Numbers Station addresses the challenge of overwhelming data team requests in enterprises by developing an AI-powered self-service analytics platform. Their solution combines LLM agents with RAG and a comprehensive knowledge layer to enable accurate SQL query generation, chart creation, and multi-agent workflows. The platform demonstrated significant improvements in real-world benchmarks compared to vanilla LLM approaches, reducing setup time from weeks to hours while maintaining high accuracy through contextual knowledge integration.
Numbers Station is a company founded by Stanford PhD graduates, including co-founder and chief scientist Ines, who previously worked on representation learning for structured data and knowledge graph embeddings under Chris Ré. The company’s mission is to bring AI to structured data, specifically focusing on the “last mile” problem of deploying LLMs in production for enterprise analytics use cases.
The core problem they address is self-service analytics in enterprises. Over the past decade, companies have invested heavily in modern data stacks (ETL pipelines, improved storage and compute), but this has created complexity that overwhelms data teams with support requests from data consumers. These requests include finding dashboards, understanding field definitions, and writing SQL queries—leaving data teams stuck in support functions rather than strategic data projects.
A key insight from the presentation is that the rise of LLMs has actually exacerbated this problem. End users have seen impressive demos of LLMs writing SQL and generating charts, raising expectations significantly. However, very few of these “magical demos” have made it into production, creating a gap between user expectations and reality. Numbers Station positions itself as solving this production deployment challenge.
The foundation of Numbers Station’s approach is what they call the “unified knowledge layer.” This is arguably the most critical component of their production system, as it enables accurate, business-contextualized responses. The knowledge layer is built by:
The speaker emphasizes that building this knowledge layer involves significant “data wrangling work” that isn’t “fancy AI work” but is essential for production accuracy. For example, a metric definition like “active customers” might have a specific business rule (e.g., filtering where “closed is not zero”) that an LLM cannot guess without context.
The knowledge layer is accessed through Retrieval Augmented Generation (RAG). The speaker explains that feeding all organizational context into prompts is impractical due to context window limitations and the sheer volume of enterprise data. Instead, they:
This approach allows the SQL agent to use business-specific definitions rather than making up calculations, which was identified as a key accuracy problem with vanilla LLM approaches.
The presentation walks through the evolution of their approach, which is instructive for understanding production LLM deployment:
Stage 1: Basic Prompting - Starting with few-shot or zero-shot prompting, sending natural language questions and schema to an LLM API. This approach requires users to copy-paste generated SQL into their own executor, providing no automation.
Stage 2: Manual Control Flow with Tools - Adding execution tools in a pipeline where the model generates SQL and then executes it against the data warehouse. However, this “manual control flow” is brittle—if a query fails to compile, the system gets stuck unless extensive edge cases are coded.
Stage 3: Agentic Systems - The key shift is making the control flow non-deterministic and LLM-directed. Using tool calling, the LLM decides what actions to take next. If a query fails, the agent can read the error message, correct the query, and retry. This self-correction capability is crucial for production reliability.
The speaker provides a simplified implementation outline for their SQL agent:
The agentic design means the sequence of operations is determined by the LLM at runtime, providing flexibility to handle diverse user requests and error conditions.
For handling diverse request types (SQL queries, chart generation, dashboard search, email search), Numbers Station implements a hierarchical multi-agent system. Key architectural decisions include:
The speaker notes they’ve considered scalability—with a dozen agents, the hierarchical approach works well, but with 50+ agents, they’d need multiple hierarchy levels to reduce routing complexity.
A critical production concern raised in Q&A is data access control. Numbers Station’s approach is pragmatic: they don’t rebuild access control internally but instead respect the governance already in place in source systems. When a user authenticates, their permissions flow through to what data and tools they can access.
The speaker acknowledges the disconnect between user expectations (shaped by impressive demos) and production reality. Their solution is the knowledge layer—grounding responses in verified business logic. Importantly, they’ve built the system to express uncertainty, saying “I don’t know” rather than providing potentially incorrect answers. This confidence calibration is essential for non-technical end users who can’t verify SQL correctness themselves.
The onboarding process has evolved significantly. A year prior, setup required weeks of manual customer work. They’ve reduced this to approximately three hours of admin training where users provide feedback that gets injected back into the system. The long-term goal is full automation of this setup process. The feedback loop helps tune the system to how specific organizations phrase questions and use terminology.
The speaker mentions building their own agent framework with optimizations for routing—if a routing decision doesn’t require a model call, they skip it. This reduces latency and cost, addressing scalability concerns raised about multi-agent systems where complexity can drive up costs even as per-token prices decrease.
The presentation claims “huge improvements” on real customer benchmarks compared to vanilla prompting approaches. The speaker emphasizes that real-world SQL benchmarks are more challenging than public academic benchmarks, and the combination of the knowledge layer (for context) and agentic retry mechanisms (for error correction) drives these improvements.
Vouch, an insurtech company, is mentioned as a customer example. They experienced rapid business growth that overwhelmed their data team with requests. Numbers Station helped them implement self-service analytics through a chat interface, allowing end users to get answers without burdening the data team.
While the presentation provides valuable technical depth, several aspects warrant balanced consideration:
Nevertheless, the technical approach demonstrates mature thinking about production LLM deployment, particularly around grounding responses in organizational knowledge, handling errors gracefully through agentic retry loops, and respecting existing enterprise governance structures rather than rebuilding them.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Articul8 developed a generative AI platform to address enterprise challenges in manufacturing and supply chain management, particularly for a European automotive manufacturer. The platform combines public AI models with domain-specific intelligence and proprietary data to create a comprehensive knowledge graph from vast amounts of unstructured data. The solution reduced incident response time from 90 seconds to 30 seconds (3x improvement) and enabled automated root cause analysis for manufacturing defects, helping experts disseminate daily incidents and optimize production processes that previously required manual analysis by experienced engineers.
Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.