Natural Language to SQL: What Actually Works, and Where It Breaks

Natural language to SQL is the ability to type a question in plain English and have software write the database query for you. It works well for simple lookups against one table ("how many users signed up last week"). It breaks on multi-table joins, ambiguous metrics, and picking the wrong column, and it almost never tells you when it got one wrong.
Key takeaways
- Text-to-SQL is reliable for single-table lookups where the question maps to one obvious column. It gets unreliable the moment a question needs a join, a metric definition, or a date-range rule.
- The dangerous failures are the silent ones. A model can write a query that runs cleanly, returns a confident number, and is counting the wrong thing. Nobody sees a red error.
- Three things cut the error rate: a semantic layer that pins each metric to one official definition, showing the generated query so a human can check it, and verifying the result against a number you already trust.
- The query is the easy 80%. The hard 20% is knowing your business: that "active user" excludes internal accounts, that revenue is recognized not booked, that two tables join on a column with duplicate keys.
What is natural language to SQL?
Natural language to SQL, also called text-to-SQL, turns a question typed in plain English into a database query, runs it, and returns the answer. SQL (Structured Query Language) is the language databases speak. The point is to let someone who does not write SQL ask the database a question directly, instead of filing a request and waiting for an analyst to write the query.
Is natural language to SQL accurate?
It is accurate for simple, single-table lookups, and unreliable the moment a question needs a join, a metric definition, date logic, or business context the model cannot see in the schema. Accuracy depends less on the model alone than on the quality of the metric definitions, the table relationships, and the validation checks built around it. A good model on a clean lookup gets it right consistently. The same model on "revenue by signup channel" can return a confident, wrong number.
Where natural language to SQL is reliable
It works when the question maps to one table and one obvious column. "How many orders shipped yesterday." "What was total revenue in March." "List the top 10 customers by spend." These are single-table lookups or simple aggregations. The model has seen millions of them, the mapping from words to columns is unambiguous, and there is little room to be subtly wrong. A good model handles these reliably.
This is genuinely useful. It lets a marketer or a product manager pull a number without filing a ticket and waiting in the data team's queue. At OpenAI, opening that kind of self-service access across the company meant new hires were contributing real insights on day one instead of week three, because they did not have to learn SQL first.
The trouble starts when the question stops being a lookup and starts being a question about how the business actually works. A lookup is one query. A higher-level question like "why did this move" or "is growth healthy" needs a method, an analytics playbook, not a single query.
What are the main risks of natural language to SQL?
The main risk is not that the query fails, it is that the query succeeds while measuring the wrong thing. Text-to-SQL fails most often when the model has to infer business context that is not written down in the schema. The failures cluster in three places, and all three return an answer that looks fine.
Joins. The moment a question spans two tables ("revenue by signup channel"), the model has to decide how those tables connect. Pick the wrong join key, or join on a column with duplicate values, and rows get multiplied or dropped. The query runs. The number is wrong, sometimes by a lot, and there is no error to catch it.
Ambiguous metrics. Ask for "active users" and the model picks a definition. But your company already has one: maybe it excludes internal test accounts, maybe it means logged-in-and-took-an-action, maybe sales and finance each count it differently. The model does not know your rule, so it invents a reasonable-sounding one. You get a number for "active users" that is not the number anyone in your company means by "active users."
The wrong column. Warehouses are full of near-duplicate columns: created_at versus updated_at, amount versus amount_usd, status versus status_v2. The model reads the names and guesses. Guess created_at when you needed updated_at and your whole time-based analysis shifts by days.
A plain text-to-SQL tool gives you an answer with no way to tell if it is right. The query can run cleanly, return a confident number, and be measuring the wrong thing.
Why is silent failure the real problem?
A silent failure is a query that runs successfully and returns a plausible number while answering the wrong business question. A wrong answer delivered with confidence is more dangerous than a slow one, because someone decides on it before anyone catches the mistake. A dashboard that breaks shows a broken chart. A query that joins wrong shows a clean table of numbers. The output of a confidently wrong query is indistinguishable from a correct one until someone re-derives it by hand.
This is worse with a model than with a junior analyst, for one reason: the model does not flag what it does not know. A human who is unsure about which "revenue" column to use will ask. The model picks the most statistically common pattern and presents it as the obvious answer, with no note that it guessed. The fluent, confident tone is the same whether it is right or wrong.
So the question to ask of any natural-language-to-SQL tool is not "is it fast." Speed is table stakes. The question is "when it is wrong, will I find out before I act on it."
How can you make natural language to SQL safer?
Three things move text-to-SQL from a demo trick to something you can run a business on: a semantic layer, the visible query, and verification against a number you already trust.
A semantic layer. A semantic layer is the recorded map between raw database fields and business concepts like revenue, active users, churn, and customer. It holds one official definition per metric, the relationships between tables, and which source is the truth. "Active user" is defined once. The join between orders and channels is specified once. Now the model is not guessing how your business works, it is reading it. This is the single biggest lever, because it removes the two failure modes that cause silent wrong answers: the ambiguous metric and the bad join. Gamma replaced separate Looker and Amplitude setups with one platform that holds these definitions, and their team says they trust what is on Sundial more than wading through their own raw tables.
Show the query. If the tool shows you the SQL it wrote and the steps it took, a person who knows the data can read it and catch a wrong join or a wrong column before the number leaves the room. A black-box number you cannot inspect is a number you cannot trust on anything that matters.
Verify against a known value. Run the generated query against a number you already know is right (last quarter's revenue from the closed books, a count you have validated by hand) and see if it matches. If it does not, the query is wrong, and you found out in seconds instead of in a board meeting.
How the options compare
Plain text-to-SQL and a grounded agent are not the same tool, and a dashboard does a different job again. Here is how they line up on what they are good for, where they break, and what guardrail each one needs. One note before the table: a dashboard and an AI analyst are not an either/or. Sundial has dashboard capabilities too, so you get the fixed dashboard views and the ask-anything investigation in one place.
| Tool | Best for | Breaks on | Guardrail it needs |
|---|---|---|---|
| Plain text-to-SQL | Simple single-table lookups | Joins, metric definitions, business context | A human reading the query |
| Text-to-SQL on a semantic layer | Questions about defined metrics | Open-ended "why did this move" questions | The semantic layer staying current |
| Agentic analytics | Multi-step "why" investigations | Highest-stakes one-off calls | Visible work, confidence signal, audit trail |
| Dashboard | A fixed KPI everyone watches | Any question it was not built for | A data team to build and maintain it |
| Human analyst | Architecting the context, judgment calls, the highest-stakes calls | Volume and speed of routine pulls | Time |
Sundial's approach: an agent that investigates and shows its work
Sundial is an AI analytics platform for product, growth, and data teams that investigates business questions directly against company data. It puts an AI agent between you and your data, and the agent is built around the three fixes above. You ask a question in plain language. Instead of writing one query and stopping, the agent plans an investigation, runs the queries itself, checks its own work, and hands back an answer with the reasoning attached. It shows the steps and the queries, and it carries a confidence signal and an audit trail so a data team can see exactly how it got there.
What the agent can do depends on who is using it. For data consumers, the business users who just want answers, it is read-only by default: they ask questions and get answers and cannot change the data. For data practitioners, the same agent does more. It is the one tool that can both model the tables and run data-quality checks, so practitioners use it to build and maintain the context layer that everyone else relies on. Read-only is the consumer default, not a cap on the whole tool.
The grounding comes from a context layer: the official definition of each metric, how the tables relate, which source is truth. The work splits across four agents. Quality checks whether the underlying data is fresh and complete enough to answer at all. Modeling holds the semantic layer that turns raw tables into business concepts. Analysis runs the chain of queries and reasoning to get to "why." Storytelling turns the result into something a decision-maker can act on.
That structure is why the hard questions land. Character used Sundial to answer things a single query cannot ("what happened to power users on Android in India last week," "which specific characters drove higher time spent"), and got to a level of customer understanding they say would otherwise have taken 6 to 12 months.
When should you use text-to-SQL, and when should you not?
Reach for plain natural-language-to-SQL on the shallow, single-table questions where a wrong answer is cheap and easy to spot. A quick count, a daily total, a top-10 list. The risk is low and the time saved is real.
Do not use it as the sole input to a high-stakes decision without inspecting the query or checking it against a value you trust. Be careful on anything that needs a join, a metric definition, or a date-range rule. For governed reporting where every number has to tie out exactly (the financial close, board reporting, regulated filings), keep a human signing off.
For the open-ended "why did this move" questions that turn into a ticket for the data team, you want more than a query generator: an agent that investigates by following an analytics playbook, grounds itself in your definitions, shows its work, and tells you how sure it is.
Frequently asked questions
What is natural language to SQL? It is software that turns a plain-English question into a database query, runs it, and returns the answer, so someone who does not write SQL can ask the database directly.
How accurate is natural language to SQL? Reliable for simple single-table lookups. Unreliable for joins, ambiguous metrics, wrong-column guesses, and any question that needs business context the model cannot see in the schema.
What causes text-to-SQL errors? The model infers things that are not written down: how tables join, what a metric officially means, which of several similar columns to use. It picks the most common pattern and does not flag that it guessed.
Can natural language to SQL replace data analysts? It does not remove the human, but it does change the job, and a team likely needs fewer analysts. It can do much of what analysts used to spend their days on: the repetitive pulls and the first-pass investigation. So the role shifts from reactively answering query requests to architecting the context, defining the metrics, the relationships, and the source of truth, so everyone's questions get high-quality answers. Humans stay in the loop on judgment and the highest-stakes calls.
What is a semantic layer? The recorded map between raw database fields and business concepts like revenue and active users. It pins one definition per metric and specifies how tables relate, so the model reads your business rules instead of guessing them.
How is Sundial different from a text-to-SQL tool? A text-to-SQL tool writes one query and stops. Sundial runs a multi-step investigation grounded in a context layer, shows its work, and attaches a confidence signal and an audit trail, so you can tell whether to believe the answer.
If you want to see what reliable looks like in practice, that is what we build at Sundial.