Anthropic deployed Claude-powered analytics agents to automate 95% of business analytics queries with approximately 95% aggregate accuracy, enabling their data science team to focus on strategic work rather than ad-hoc requests. The system addresses three critical failure modes in analytics agents—concept-to-entity ambiguity, data staleness, and retrieval failure—through a comprehensive agentic data stack comprising data foundations, sources of truth (including a semantic layer), skills (procedural knowledge encoded in markdown), and multi-layered validation through offline evaluations, ablation testing, and online monitoring with adversarial review.
Anthropic’s implementation of Claude for self-service data analytics represents a comprehensive LLMOps case study addressing the substantial operational challenges of deploying large language models in production for business-critical analytics. The company automated approximately 95% of their business analytics queries with roughly 95% accuracy using Claude-powered agents, freeing their data science team from repetitive ad-hoc requests to focus on higher-value work like causal modeling, forecasting, and machine learning. This case study is particularly valuable because it comes from a company building LLMs, providing insider perspective on the practical challenges and solutions for production LLM deployments.
The fundamental insight driving Anthropic’s approach is that analytics accuracy with LLMs is primarily a context and verification problem rather than a code generation issue. While LLMs excel at generating SQL and writing code, the real challenge lies in mapping ambiguous business questions to specific entities in complex data models, ensuring those mappings remain current as schemas evolve, and verifying that the agent actually retrieves and uses the correct information. This contrasts with coding agents, where the solution space is open-ended and creativity is rewarded, while documentation and tests provide natural guardrails. In analytics, there’s typically only one correct answer using one correct data source, with no deterministic way to prove correctness.
Anthropic identified three attributes accounting for the overwhelming majority of inaccurate analytics responses:
Concept-to-entity ambiguity occurs when agents cannot correctly map business concepts to the right data model entities among hundreds of viable options. For example, measuring “active users” requires determining what actions constitute being “active,” whether to include fraudulent users, and what lookback window to apply. With potentially millions of fields in a data warehouse, the agent must navigate to the precise fields that answer the user’s actual question.
Data staleness happens because data sources, business definitions, and schemas change constantly. Without active maintenance, documentation and agent knowledge go stale and begin returning subtly incorrect answers. This is particularly insidious because the answers may look plausible while being technically wrong.
Retrieval failure occurs when the correct information exists in the data model and is properly annotated, but given the vastness of the search space, the agent simply doesn’t find it during query time. This represents a fundamental limitation in how agents search and navigate large information spaces.
Anthropic built a four-layer stack to systematically address these failure modes:
The data foundations layer consists of the data models, transforms, tests, and tables in the data warehouse, along with metadata describing them. This layer primarily attacks concept-to-entity ambiguity by ensuring that business concepts like “revenue” resolve to one governed dataset instead of forty plausible candidates. It also provides the first defense against staleness, as the repository defining canonical models is where currency enforcement naturally lives.
Anthropic emphasizes that standard data engineering practices like dimensional modeling, shift-left testing, and freshness/completeness checks remain essential. However, the key difference is that the end user is no longer a data expert but rather agents acting on behalf of users with varying expertise levels. This means results cannot rely on user validation of underlying correctness because end users often lack the knowledge to verify.
Key practices include creating canonical datasets—curating a small set of heavily governed, single-source-of-truth datasets that are clearly owned, consumption-ready, and discoverable, while aggressively deprecating near-duplicates. The goal is that when agents search for a concept, they find a single governed answer. Physical rollups and caches still matter for performance and cost, but they derive mechanically from canonical models rather than existing as alternatives.
Governance must be enforced through tooling (agents are structurally routed to canonical models first), CI (changes bypassing them fail review), and mandate (downstream teams build on the governed layer or explain why not). Without enforcement, governance quickly decays back to the multiple candidates problem.
Anthropic colocates nearly all data code—modeling, semantic layer, reference docs, canonical dashboard definitions—in a single repository with CI checks protecting cross-layer integrity. If a modeling change would break a downstream dashboard or invalidate a documented metric, CI flags it and the fix ships in the same pull request. This approach proved essential for preventing documentation drift.
Treating metadata as a first-class product is critical. Coding agents perform well partly because codebases are legible through READMEs, type signatures, and docstrings. Data warehouses can achieve the same legibility, but only if column and table descriptions, canonical metric definitions, grain documentation, valid value ranges, lineage, ownership, and model tiering are maintained with the same rigor as transformations themselves. This governance provides critical context helping agents choose the right dataset.
The sources of truth layer consists of reference surfaces the agent consults to navigate the data warehouse, reducing concept-to-entity ambiguity and turning phrases like “weekly active users” into specific governed entities. In descending order of trust:
The semantic layer contains compiled metric and dimension definitions. When a question maps cleanly to a defined metric, the agent calls a function and receives one number—the same number every other surface in the company produces. Anthropic’s agents are structurally required by skill instructions to leverage the semantic layer first. Importantly, they found that bootstrapping the semantic layer by having LLMs auto-generate metric definitions from raw tables and query logs was net-negative versus a smaller, human-curated layer. The auto-generated definitions encoded the very ambiguities they were trying to eliminate. They recommend generating documentation with Claude but having humans own definitions.
Lineage and the transformation graph enable the agent to reason about which upstream models feed a concept, which are deprecated, and which share grain when the semantic layer doesn’t cover a question. This transforms “I don’t know the metric” into “I know which governed model to aggregate from.” It’s also the backbone of freshness and provenance signals surfaced in online validation.
The query corpus represents historical SQL from dashboards, notebooks, and prior analyses. Counterintuitively, Anthropic found that giving agents raw retrieval access to thousands of prior queries moved accuracy by less than a percentage point. Unstructured retrieval couldn’t map new questions to the right precedent. What does work is distilling that corpus into structured per-domain reference docs and reusable analysis patterns described in skills. They treat query history as raw material for curation, not as a source of truth agents read directly.
Business context is the layer most teams skip and the one Anthropic underrated longest. Agents that don’t understand the business will answer what users asked, but not what they meant. They won’t know that “the Q2 launch” refers to a specific product, that two teams define the same term differently, or that a question is being asked because a board meeting is Thursday. Anthropic pipes in a company knowledge graph consisting of indexed docs, roadmaps, decision logs, and organizational structure so agents can resolve ambient references and ask better clarifying questions.
The common failure pattern across all four sources is poor or stale documentation. Claude is exceptionally useful for closing the gap—drafting column descriptions, proposing metric docs from query patterns, flagging undocumented models in CI—but curation and ownership remain human responsibilities.
If sources of truth represent the agent’s declarative knowledge (what a metric means), then skills represent procedural knowledge: which sources to consult in what order, how to navigate ambiguous data, and what a finished analysis looks like. In Claude Code, a skill is a folder of markdown the agent reads on demand.
Skills proved hugely value-additive at Anthropic. Without skills, Claude’s ability to answer analytics questions accurately didn’t exceed 21% on their evaluations. Adding skills consistently gets accuracy above 95% in aggregate and regularly around 99% in certain domains.
Anthropic creates pairwise skills: a knowledge skill acts as a thin top-level router allowing additional domain details to load on demand. It directs agents to try the semantic layer first, but if there’s no coverage, points to approximately 30 reference files for the domain describing relevant tables, columns, joins, and gotchas. This router effectively answers retrieval failure by narrowing the search space to a few dozen curated files before any query is written, rather than letting the agent search a million-field warehouse.
The unbook skill encodes the process a senior analyst would follow: clarify the question, find sources (via the knowledge skill), run the query, then loop the result through adversarial review sub-agents. It bundles about a dozen reusable analysis patterns (retention curves, rate decomposition, funnel analysis) so common requests don’t get reinvented each time.
Reference docs are written specifically for LLM retrieval, describing tables (grain, scope, exclusions), mechanics of gotchas (e.g., “exclude known free-email domains, but keep custom ones”), and explicit routing triggers (e.g., “IF the question is about experiment lift… DO NOT use for raw event counts”) without prescriptive recipes that go stale.
Skill maintenance is treated as a first-class citizen. Skill docs describe a data model changing daily, so without active maintenance they’re wrong within weeks. Anthropic watched offline accuracy drift from approximately 95% at launch to approximately 65% over a month before treating this as an engineering problem. Colocating skill markdown files in the same repository as transformation models means the pull request changing a model is the same pull request updating the doc describing it. A code-review hook flags any reporting-model change that doesn’t touch a skill file. Roughly 90% of data-model pull requests now include a skill change in the same diff. They also regularly prune skill scaffolding as models improve and previous failure modes no longer apply.
Creating a consistent and seamless experience across all surfaces is essential. The same skill must provide the same answer to questions in Slack, in the IDE, in dashboard tools, and in standalone agent sessions. Anthropic ensures one canonical source (the data repo) and that skill changes sync automatically. On merge, the skill syncs to a plugin marketplace (for IDE users), to cloud-storage blobs (for hosted apps reading a single file), and is served directly as resources over MCP. They designed for portability from the start by avoiding hardcoded repo paths and surface-specific namespaces.
Validation is how Anthropic discovers which of the three failure modes is still leaking through. It consists of offline evaluations, ablation techniques, and online validation.
Offline evaluations are simple question/answer pairs, similar to offline testing for ML models. They don’t indicate online agent performance but give a good sense of whether there are critical gaps. Anthropic deploys two kinds: dashboard-based evals are auto-generated by Claude (then human validated), covering the most common stakeholder questions. Long tail evals feed Claude business context (roadmaps, table docs) and have it generate plausible questions across the rest of the domain. They continuously harvest every instance when a stakeholder corrects the agent in a thread as a candidate eval.
Best practices include anchoring ground truth so it can’t drift—pinning every eval to a snapshot date, writing it against a stable fact table, or having the grader judge the agent’s query rather than its number. They wire the suite into CI so a pull request touching a dependency re-runs affected evals. They store results like telemetry in a warehouse table with skill version, git SHA, model ID, per-assertion pass/fail, token count, and wall-clock, making “Did that change help?” a queryable question with time-series to catch slow regressions.
They gate launches per domain: a domain owner can’t announce the agent to stakeholders until their slice of the eval set clears a threshold (initially approximately 90%). This forces reference-doc fixes before users see failures. The number of evals depends on business area and data model complexity, calibrated by tracking how well offline accuracy predicts online accuracy. They’ve found diminishing returns past a few dozen per topic, and that ceiling drops with each new model generation. Offline eval accuracy should approach 100%, and every correct answer should hit the semantic layer if one exists. This level of accuracy doesn’t guarantee the system won’t produce wrong answers, just that there are no obvious gaps assuming proper eval coverage.
Ablation techniques inform every structural decision about skills. Holding the offline eval set fixed, they vary exactly one component and compare pass rates. Each run takes about an hour and replaces many arguments. They design for null results—their most useful ablation was negative. They gave the agent direct grep access to their entire dashboard, transformation, and analyst-notebook SQL (thousands of files), verified in transcripts that it actually read them before every answer, and found accuracy moved by less than a point in either direction. Checking confounds revealed that the answer was present in the corpus about 80% of the time, but “answer present” didn’t predict “now gets it right”—the flip rate was flat. The information was there, the agent saw it, but still didn’t use it. This single experiment revealed their bottleneck wasn’t access to prior work but structure—mapping questions to the right entity. This insight redirected months of roadmap.
They ablate at pull request granularity, with every meaningful skill edit getting a before/after run on the relevant eval slice and the delta in the pull request description. This keeps “I improved the docs” honest and catches surprisingly common cases where well-intentioned additions make things worse. They maintain a short list of what didn’t work, including stacking additional rounds of doc refinement past a certain point (hitting three consecutive net-negative iterations where docs got longer, not better) and swapping the adversarial reviewer to a cheaper model to cut latency (which lost most accuracy wins for no real speedup).
Online validation ensures actual system performance is as accurate as possible. Adversarial review employs a Claude skill to aggressively challenge all underlying assumptions on potential final answers, increasing accuracy by 6% within their eval set at the cost of 32% more tokens and 72% higher latency. Every response carries a provenance footer containing which source tier it came from (semantic layer › curated reference › raw table), how fresh the underlying data is, and who owns the model. This doesn’t make answers more correct but helps consumers judge how much to trust responses. A “raw table, freshness unknown” footer signals to verify before forwarding upstream and is one of few mitigations for silent failures.
Data quality checks ensure that even when agents use the right field appropriately, the data itself is correct. Adding basic checks to ensure referenced fields are up-to-date, complete, and have no anomalies is generally good hygiene. Passive monitoring continuously tracks two production signals: the share of agent queries resolving through the semantic layer, and the share of responses using correction language (“that’s the wrong table,” “you’re missing the fraud filter”). Both feed a dashboard reviewed weekly alongside offline pass rate.
Active correction harvesting closes the loop. A scheduled agent scans stakeholder channels every few hours for correction language, drafts a one-line fix to the relevant reference doc, and opens a pull request tagged to the domain owner. The fix path is deliberately boring—edit a markdown file, merge, auto-sync everywhere—so domain owners don’t spend excessive time on the task. The same corrections feed back into the offline eval set.
The failure mode none of this fully catches is the silent one where the answer is wrong but looks plausible and is used without objection. Mitigations include the provenance footer, explicit human sign-off on anything leadership-bound, and a standing eval for each domain’s top KPIs that sanity-checks against the blessed dashboard daily, though they don’t have a robust solution yet.
One critical finding is that data is not software. LLMs’ generative abilities are a double-edged sword: mechanisms enabling creative solutions to complex problems can also hallucinate erroneous output. Coding is an open-ended solution space rewarding model creativity, while documentation and tests provide natural guardrails against hallucination. For analytics use cases, there’s often only a single correct answer using a single correct source with no deterministic way of proving correctness. The complexity lies in data ambiguity—the central problem is mapping a user’s question to specific and up-to-date entities in the data model and knowing the correct way of working with them. If this can be done, resulting execution and SQL become trivial.
The finding that raw retrieval access to thousands of prior queries moved accuracy by less than a percentage point is particularly striking and runs counter to intuition. This suggests that simply exposing more information to LLMs doesn’t solve the problem—the information must be structured and curated in ways that enable correct mapping from questions to entities.
For teams starting from zero, Anthropic recommends that a handful of canonical datasets, a few dozen offline evals, and a thin knowledge skill will capture most of the upside. Everything else in their approach represents what they added once those were built.
Teams should align on organizational principles affecting their approach by asking: How important is a correct answer today versus in the future? AI models are progressing rapidly, and building significant infrastructure to account for current model shortfalls may become moot once models improve. Knowing where models fall short and waiting for improvements has significantly less overhead but may not fit every company’s risk tolerance.
Teams should consider how the complexity of their business will change over time. Some processes may be overkill if they don’t produce much data, have few output consumers, or expect their data model to remain simple. They should assess how technical the intended audience is—if building for data scientists who can recognize incorrect answers, they may tolerate more errors compared to audiences with no data model familiarity.
Teams must decide how much they’re willing to spend for improved accuracy. Processes like adversarial validation can significantly improve accuracy but often at higher cost and latency. Finally, teams should clarify their comfort around access controls and internal data privacy. Agents are often significantly more performant with more context, but broad data access cuts against most companies’ governance posture, determining whether to build one agent or many scoped ones.
This case study is noteworthy for its transparency about both successes and failures, which is refreshing given that it’s published by Anthropic to promote Claude. The detailed discussion of what didn’t work—like auto-generating semantic layer definitions or providing raw query corpus access—adds credibility to the claimed successes.
However, readers should note that the 95% accuracy figure is stated as “in aggregate” and “approximately,” with some domains reaching 99% while others presumably fall below. The case study acknowledges that silent failures—where wrong but plausible answers go unchallenged—remain an unsolved problem. This is a critical limitation for production analytics systems where incorrect data-driven decisions can have significant business impact.
The infrastructure investment described is substantial, requiring tight integration between data modeling, semantic layers, skill documentation, CI/CD, and continuous monitoring. This may be prohibitive for smaller teams or organizations with less mature data infrastructure. The requirement that 90% of data-model pull requests now include skill changes represents significant ongoing maintenance overhead, though Anthropic frames this as necessary cost rather than optional enhancement.
The reliance on human curation at multiple layers—metric definitions, reference docs, skill maintenance, eval creation—means this is not a “set it and forget it” solution. While Claude assists with generation, humans remain responsible for correctness and currency. This is appropriate given the stakes but means the system’s effectiveness depends heavily on organizational discipline and data team capacity.
The case study is particularly valuable for teams already operating at scale with mature data practices who want to enable self-service analytics. For teams with immature data foundations, the lesson may be that investing in canonical datasets, governance, and documentation pays dividends regardless of whether LLMs are involved—the LLMs simply make the return on that investment more immediate and visible.
This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.
Medable developed Agent Studio, a comprehensive platform for deploying AI agents in clinical trial operations to address the lengthy drug approval process that currently takes over 10 years. The platform enables both internal teams and customers to build configurable multi-agent systems that tackle problems like document classification in electronic trial master files and clinical research monitoring across multiple data systems. By taking a platform-first approach with support for model-agnostic agents, RAG knowledge integration, MCP connectors, workflow functionality, and robust evaluation frameworks, Medable has deployed multiple agentic applications that help clinical research associates process over 80,000 documents per year and monitor data across 13+ disparate systems, with the ambitious goal of reducing clinical trial timelines from 10 years to one year.
Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.