What It Actually Takes to Trust AI With Your Data
Picture this. It’s your fourth day at your new job. You’re sitting in a large team meeting, still figuring out how to connect to the office Wi-Fi, when the GM pulls up a chart and asks the room: “Retention dropped 5% last month. What’s going on?”
You glance at your laptop. You’ve been poking around the data warehouse since Day 1 because you’re eager and you’re smart. You actually have an answer. You found a dashboard, ran some queries, and you’re pretty sure you know what’s going on.
So you raise your hand.
“It looks like the Android cohort drove the decline. Their D7 retention fell significantly compared to the prior month.”
The room goes quiet. Your manager clears her throat.
“We… uh… changed the definition of ‘active user’ three weeks ago. It used to be any event and now it’s a meaningful engagement event. The drop isn’t real.”
You slowly lower your hand.
Here’s the thing: you’re eager and you’re smart. Your SQL was fine. Your chart was clean. Your explanation was coherent and plausible.
You were just missing context that no amount of raw intelligence could compensate for.
This is what AI does.

AI is the world’s smartest new hire
Today’s AI models are extraordinary. They write SQL so robust it’ll make your data engineer weep with respect. They generate charts as pristine as a McKinsey slide deck on presentation day. They can produce multi-paragraph explanations that boom with the full-regalia confidence of your most Senior, most Tenured, Crotchety Analyst.
But they don’t actually have that confidence.
When you point ChatGPT, Gemini, or Claude at your data warehouse and ask “why did retention drop?”, you are handing the question to a brilliant person who has never worked at your company, has never attended your team’s meetings, doesn’t know your metric quirks, and doesn’t know nearly half the things you do about your business.
The problem is not intelligence. The problem is everything else.
Today, dozens of cutting-edge companies have told us that their AI agents are right maybe 70-85% of the time. This is game-changing for data practitioners, who can now do analyses in a fraction of the time. If sometimes the responses smell fishy… Well, these data ninjas have the know-how to check, veto or rework their AI’s responses.
But to let business users loose with the promise of data democratization via natural language chat? Unfortunately 70-85% just isn’t good enough.
So how do we close the remaining gap and get something actually trustworthy?
4 things your best analyst knows that your AI should too
Think about that amazing senior analyst on your team that everyone trusts. The one whose Slack messages get screenshotted and forwarded to the exec team. What do they actually know that makes them so good?
Which numbers matter
Your company has 14 dashboards with some version of “retention” on them. Three of them use different definitions. One was built by someone who left two years ago and nobody’s touched it since. One is the “official” board metric.
Your best analyst knows which one is canonical. They know that the dashboard labeled “Retention - Master” is, ironically, the wrong one (it’s the one called “retention_jake_final” that the CFO actually uses, because Jake built it to match the board definition before he left, and nobody’s ever renamed it.)
Generic AI probably doesn’t know this. It’ll grab the table that sounds more correct and spit that number back via a snazzy “Ask AI in Slack” interface that everyone’s been using, and your PM will be none the wiser as they copy it into their next presentation where it becomes the number everyone argues about for the next 20 minute.
What changed in the business
Revenue jumped 30% last quarter, hooray! The latest model with access to your CRM and Slack might actually piece together a solid story: it finds the enterprise deal that closed, sees the Slack thread where the sales team went crazy on emojis, and pulls the contract value from Salesforce.
But this deal was unusual: the VP of Sales gave the client 18 months of free onboarding support to close, a concession made in the final stages of a late night phone call. The revenue is real, but the margin story is a different picture and the strategic rationale lives in someone’s head.
Piping in context from Slack, Linear and your CRM bridges part of the gap. But there will always be judgment calls, side conversations, and unwritten context that no integration automatically captures. These misalignments are often discovered the human way, when one day some part of a story doesn’t make sense and questions are asked. Your best analyst is always listening and paying attention to the goings-on of the business.
How to actually think about the problem
Give AI and a senior analyst the same question: “Why did retention drop?”, and watch what happens. AI opens the data, starts slicing, and follows whatever looks interesting. It’ll build a beautiful cohort analysis, then do a segmentation deep-dive, and come back with a full report that technically answers the question but doesn’t actually move the decision.
Your best analyst takes a different tack. They start by asking who cares and why. They scope the problem before they touch the data. They work backwards from what decisions need to be made, and then assesses whether a directional answer is sufficient or deep, precise answers are needed. They have entire playbooks for how to distill a business question into data questions, and how to separate the signal from the noise.
This is the accumulated judgment of someone who has done hundreds of analyses at this company, for these particular stakeholders, with all the weird quirks of this specific ecosystem.
Generic AI, like a junior analyst, investigates what’s asked and hands you an encyclopedic answer that’s technically impressive but may be practically useless, the analytical equivalent of answering “What should we have for dinner?” with a complete nutritional breakdown of every restaurant within ten miles.
What happened last time
Every January, your numbers dip. Every January, someone panics. And every January, your best analyst says: “It’s January. It always dips. It’ll bounce back by the third week.”
They know this because they investigated it the first time, were right, and watched what happened next. Over three years of doing this, they’ve built a calibrated sense of when to worry and when to wait.
Generic AI starts from zero, and it never closes the loop. It produces an answer and moves on. It doesn’t know which of its past recommendations were right, which were subtly wrong, or which ones led to a decision that backfired. It has no way to learn that the churn analysis it produced last quarter actually missed the real driver, or that the context it was given about metric definitions introduced a new error somewhere else.
Your best analyst is always updating their mental model with new information: this worked, that didn’t, this source is reliable, this one is error-prone. Most AI systems today have none of that. Teams add context ad hoc, fix the errors they notice, and have no systematic way to know if accuracy is improving or degrading.
What this looks like in real life
In one recent example, we gave 50 real questions from actual users to a state-of-the-art model running on top of a clean set of tables in a warehouse. These tables even had a semantic layer!
The model got slightly better than 80%. This is great if you are a data proficient analyst who can read SQL or Python as well as a bookseller reads novels; you are now vastly more empowered in your work.
This is not nearly good enough if you are a business user! Person after person told us how they’d rather rely on a human analyst since they couldn’t be sure AI was right. Even data leaders agreed: the downstream impact wasn’t just embarrassing meetings but bad business decisions leading to thrash across the organization and greater overall skepticism of AI analysis.
What were these 20% errors? They were practically all failures of context.
In a few cases, the system treated sign-ins as signups because event data used those terms in a non-standard way. In another, it interpreted “month X revenue” as a rolling 28-day window when the business expected a calendar month. In another, it pulled registration counts from the wrong source because cumulative registrations and daily registrations were defined differently, including how deleted users were handled.
Once the appropriate context was applied, the same set of questions jumped to 98% accuracy.
Same model; same warehouse; same underlying data.
What changed was not intelligence. It was institutional know-how, as well as a heaping dose of careful monitoring, measurement, and iteration.
What can you do about this?
Here are five practical places to start:
1. Define your canonical metrics.
Create one trusted source for your most important business metrics. Call it your “golden set.” Only one definition allowed. If there are confusingly named metrics, or you observe AI pulling incorrectly, feed that context in.
2. Log major business changes.
Campaign launches, pricing tests, onboarding redesigns, policy changes, instrumentation changes, metric definition changes — these should live somewhere machine-readable and easy to retrieve.
3. Capture analytical playbooks.
What do your best analysts check first when retention drops? Which cuts matter? Which segments are strategic? What questions are usually noise? Write that down.
4. Continuously update your memory.
When your team investigates a recurring issue, do not let the answer disappear into Slack. Store the conclusion, the evidence, and what was ruled out so the system can use it next time.
5. Measure whether it’s working. This is the step that matters the most as your system matures. Track accuracy over time. Understand which questions your AI gets right, which it gets wrong, and why. Without this, enrichment becomes a game of whack-a-mole because improving context in one spot can cause a regression in another. You need a feedback loop with enrichment → measurement → observability, not just growing piles of context.
The gap is closing, but not how you think
The funny thing about this problem is that it’s not going to be solved by meatier models.
GPT-6 or Claude 5 or whatever comes next will be even more capable at reasoning, even more fluent, even more powerful in its working memory. But out of the box, it still won’t know your business better.
Part of the answer is giving the model better context. It’s getting easier and easier to connect and ingest company details, and teams that do this well will see real improvements.
But context alone also hits a ceiling. Every team that’s pushed past 85% accuracy learns the lesson of the fragility of ad-hoc enrichment.
The teams that actually close the gap aren’t just adding more context, they’re measuring whether the context is actually working to improve quality. Enrich, then measure accuracy, then observe where things break, then decide what to enrich next. It’s a continuous loop.
In a world where every company has access to the same frontier models and the same integration tools, the differentiator is the system around it. How well does AI know your business? How well does it keep track of what’s changing? How well can you tell whether its trustworthiness is increasing or decreasing?

