There are a growing number of companies utilizing text-to-SQL models, based on integrations between generative language models (GenAI or LLMs) and data models. These are often built into products or representing the product itself. For those who are not in the know, SQL (or structured query language), is a programming language utilized to interact with relational databases and other data systems for the purpose of retrieving, manipulating, and filtering data. I have experimented personally with text-to-SQL, also called “natural language querying” or a “natural language interface”, in my Clinical Trials Exploration Tool (message me if you want to try it out) among other work projects. The idea is to let people who don’t speak SQL ask questions of their database or data warehouse as if they had an analyst on hand to help them.
At the core of these tools is the assumption that every natural language question about a dataset can be converted into a SQL query which provides the necessary context for a data visualization or a written explanation of the results. My experience tells me that people who do not know SQL often make implicit assumptions about the data they are retrieving and thus ask questions that are improperly structured for deterministic translation into a SQL query. Text-to-SQL systems can take a variety of paths in the situation where translation is not possible:
Having been an analyst, then a manager of analysts, then a data science consultant, and now a data science consultant who manages analysts among others, for a long time, I have a lot of experience with people asking me questions about data that are misformulated. It is not their fault. They don’t know the data model, and they don’t sit around staring at the data every day such that the semantic layer (the business meaning of the data model) is basically their brain. They ask questions from the perspective of the “curious other”, someone who knows something about the entity their question is about, but not what information is available to them about that entity nor about the structure of that information.
To provide a canned example very similar to those available on the Internet, let’s say a user asks the question: “How many customers have upgraded from a monthly subscription to an annual subscription over the past 12 months?”
To answer this question, the LLM needs to know what constitutes a customer, where the subscriptions are housed, how to evaluate that a customer had both a monthly and annual subscription, and when that upgrade occurred. Many of the text-to-SQL models available right now will just dump all of the schema information (the database representation of the tables available) into the prompt (the input to the LLM) and rely on the LLM to determine which fields represent which elements of the question and how they are related. I can tell you right now that this will not work.
Customers will exist in one table, subscriptions another. And the subscriptions table might have one row per subscription per timeframe (Monthly or Annual) or it might have one row for the one subscription that was upgraded with a categorical variable for the duration of the subscription and a date on which the subscription was updated. Or all the information might be sandwiched into the customer table somehow. Or there might be another table called “customer_reporting” with information about upgrades that’s separate from the “customers_stg” table. Are your eyes watering yet? Are you still here? Anyone? Bueller?
So the solution here is to create a data model with a semantic layer that translates the schema into deeper information about what the table “means”, what role the table serves, what data it contains, and the natural language meaning of each field/column in each table. I can tell you from experience that LLMs are very good at mapping from a natural language question to a well-written semantic layer in the prompt.
Yet if you have more than one table that could be asked about, or god forbid, two or more tables that need to be joined together to answer a question, then you need to pass the requisite context to the LLM to explain which tables it should look at, their semantic information, and some information about the way those tables are related to each other. Schema information contains foreign keys to tell you how to join tables together, but cloud data warehouse schemas don’t typically include this information, so it needs to be presented to the LLM in another way.
At some point, the amount of context you need to present to the LLM for a given question becomes too large to fit into a single document for all questions, so you need to divide the document into chunks and retrieve the relevant information before sending it to the LLM along with the question that was asked. Now you have gone from text-to-SQL to “retrieval augmented generation” (RAG).
RAG typically incorporates some sort of search algorithm or algorithms, such as “semantic search” that converts documents into embeddings, stores them in a “vector database”, and queries for the closest matching documents based on the user’s question. This means that your documents need to be “semantically close” to the question that the user asked (and preferably, “semantically far” from each other in order to make it easier for your search algorithm to consistently hit the mark), which implies that the chunking strategy for your documents should be based on some optimal division of the question space (the universe of possible questions that might be asked).
My intuition for the division of the question space is that it should be according to the “entities” about which users might ask questions. “Customers” as an entity is too broad because there is a huge variety of questions that a user might ask about customers. And “customer subscription upgrades over the last 12 months” is too granular because it only covers a single question. However, “customer subscriptions” or “customer subscription upgrades” might be the right level of detail to capture a solid proportion of the question space while keeping the context bounded. Then your job is to include in the document the comprehensive set of information the LLM might need to answer questions about “customer subscriptions” or “customer subscription upgrades”.
Everything we have discussed so far are table stakes in terms of developing a working natural language query interface.
Now in order to illustrate the challenge of translating text into SQL at scale, we need to understand the context the LLM needs. Let’s consider other questions that might be asked:
Now consider a variation of question (3) that illustrates the difficulty of translating questions from text to SQL.
User: What is the breakdown of new subscribers this year by location?
LLM: SELECT location, COUNT(*) FROM customer_subscriptions GROUP BY 1
Database: ERROR: column location does not exist in table customer_subscriptions
The user has asked a question about something that does not exist in the data model. Now you might think to yourself, “Well the LLM knows that there is a city, state, and zipcode in the customers table, so it should be able to infer the intent of the user from the request.” This is where you are dead wrong. Any time you ask the LLM to infer your intent from your question, you are asking to be disappointed. I wouldn’t even trust a junior analyst to infer intent in this case. This question needs more input from the user.
Back to our options:
So some avenues we can explore are:
Of these options, I prefer option (3) the best because it (a) exposes additional information to the user that assists the user with improving their questions in the future, and (b) forces us to establish an error-handling pattern and stacking of context that funnels users toward the answers they want while effectively communicating to the users the additional context about any caveats surrounding that information.
Now let’s take an even more difficult translation example: the situation where the user asks a question about data that does not exist in the data model.
User: How many active subscribers do we have within 50 miles of Atlanta?
LLM: SELECT COUNT(*) FROM customer_subscriptions WHERE city = ‘Atlanta’ and is_active = TRUE
Output: There are 73 subscribers within 50 miles of Atlanta.
In this case, the LLM has constructed a valid query and provided a reasonable response to the user, but the response is clearly wrong and based on the following assumptions:
Anyone who has ever worked with data before knows most or all of these assumptions to be false in most databases. Yet the LLM did not hallucinate in this case. The context we provided and the methodology of our approach allowed the LLM to answer this question even when it does not know the answer.
In this case, the only possible response, the response I would expect a senior analyst to provide, is that we cannot answer this question in this form, but we can answer a similar question, such as:
Yet the response that is easiest to build into the system is “I do not have sufficient data to answer this question”, an answer that will surely cause the user to shut down their laptop and walk away. Regardless, we need to make sure to log the question and response in our monitoring platform so that we can determine how to address this class of question later on.
Later, if we determine that this class of question is critical to our user base, we could explicitly geocode all of the addresses in our customer database, provide the fields in the semantic layer, add a geocode tool to our LLM (to determine the geocode of the reference point: Atlanta, GA), and explicitly provide an example SQL query that returns a count of the records within a X mile distance of the center of a given city.
This seems like overkill, but it illustrates the point, which is that our words are blunt instruments, and not because natural language cannot encode the information required to translate to SQL: because most of us are not used to thinking in this way, so we’re also not used to speaking or writing or chatting a chatbot in this way.
Those who do not live, work, sleep, and breathe in data land are not used to writing in ways that are specific enough to explain to the LLM (or to an analyst) exactly what we want the query results to contain and not contain on the first try. Experienced analysts are used to asking follow up questions to help stakeholders to be more specific about what they want so that they minimize back and forth. LLMs don’t know how to do that (yet), and expecting them to be good at this is unreasonable. It is up to those of us who design natural language query systems to build in feedback loops that prevent us from providing invalid information without annoying the shit out of our users.
-
***This post was first published by Ross Katz on LinkedIn. View the original article here***