Airtable: Building a High-Quality Q&A Assistant for Database Research

Company

Airtable

Title

Building a High-Quality Q&A Assistant for Database Research

Industry

Tech

Link

https://medium.com/airtable-eng/how-we-built-a-high-quality-q-a-assistant-738ae9efeb7a

Year

2025

Summary (short)

Airtable developed Omni, an AI assistant capable of building custom apps and extracting insights from complex databases containing customer feedback, marketing data, and product information. The challenge was creating a reliable Q&A agent that could overcome LLM limitations like unpredictable reasoning, premature conclusions, and hallucinations when dealing with large table schemas and vague questions. Their solution employed an agentic framework with contextual schema exploration, planning/replanning mechanisms, hybrid search combining keyword and semantic approaches, token-efficient citation systems, and comprehensive evaluation frameworks using both curated test suites and production feedback. This multi-faceted approach enabled them to deliver a production-ready assistant that users could trust, though the post doesn't provide specific quantitative results on accuracy improvements or user adoption metrics.

Tags

## Overview of Airtable's Omni Q&A Assistant In June 2025, Airtable launched Omni, an AI-powered assistant designed to help users build custom applications and extract insights from information stored in Airtable bases, such as customer feedback, marketing campaigns, and product details. This case study describes the engineering challenges and LLMOps practices Airtable employed to build a production-grade Q&A agent capable of researching complex databases reliably. The core problem they addressed was creating an agent that could consistently deliver high-quality answers when dealing with large, complex table schemas and ambiguous user queries—a challenge that required numerous iterations and sophisticated engineering approaches. The fundamental challenge with building LLM-based Q&A systems is balancing the impressive processing capabilities of LLMs with their significant limitations. While LLMs excel at quickly processing and summarizing large volumes of information far faster than humans, they suffer from unpredictable reasoning patterns. They often jump to conclusions prematurely, compound initial mistakes through cascading errors, or hallucinate entirely inaccurate responses. These issues become particularly acute when dealing with large, complex table schemas or vague user questions. Airtable recognized that a system with only 50% accuracy is not just unhelpful but fundamentally unusable in production, which motivated their comprehensive approach to reliability engineering. ## Agentic Framework Architecture Airtable built their solution on an agentic framework that allows an LLM to dynamically call tools and make sequential decisions to solve problems. The system is designed to mimic how a human would approach finding answers from a database: exploring table structures, applying relevant filters, re-evaluating based on new information, and performing both quantitative and qualitative analysis. This multi-step reasoning approach is particularly effective for complex questions that cannot be answered through a single query or lookup. The architecture consists of several interconnected components that work together to process user questions. The system takes user questions and contextual information as inputs (highlighted in their documentation as purple elements) and produces answers through a series of tool calls and reasoning steps. This approach represents a significant departure from simple prompt-response patterns, instead embracing the complexity of real-world database research tasks that require iterative exploration and refinement. ## Contextual Schema Exploration One of the most critical challenges Airtable faced was the token efficiency problem: when dealing with databases containing hundreds or thousands of records that far exceed an LLM's context window, how can the system find answers efficiently? Their initial approach of providing the schema for all tables in the base proved problematic, as even the schema alone could consume most of the context window for larger bases, leaving insufficient room for actual data exploration and reasoning. To address this, Airtable implemented a two-step data exploration approach. First, they provide a tool for understanding schema structure. Second, they offer a tool for querying actual data. This separation allows the system to progressively narrow its focus rather than attempting to process everything at once. However, they recognized that real-world schemas are often noisy, containing deprecated columns, empty fields, and highly similar or ambiguous field names that confuse LLMs. Their solution was to pare down the initial context to only the most useful and salient information. They provide a high-level schema including table names, descriptions, primary columns, and relationships between tables. They then offer more detailed schema information specifically for whatever the user is actively viewing in their interface. Finally, they include example records for the table the user is currently viewing. This context prioritization helps clarify user intent and can even predict what users are likely to do next. Interestingly, Airtable found this information useful not only for question answering but also for offering proactive suggestions, demonstrating how good context management can enable multiple product features. ## Planning and Replanning Mechanisms Airtable leveraged chain-of-thought reasoning, a proven technique for improving LLM reasoning capabilities. Rather than asking the model to directly provide an answer, they guide it to articulate its reasoning process step-by-step, mimicking how humans solve problems. This approach helps LLMs tackle complex tasks by breaking them down into smaller, manageable parts. They specifically mention using Anthropic's Sonnet 4 model, which has built-in "thinking tokens" and "interleaved thinking" capabilities to facilitate this process. A critical innovation in their system is the incorporation of both planning and replanning steps. The initial planning step helps the LLM structure its approach to answering a question, but the replanning capability allows the system to adjust course upon discovering new data. This is particularly important when dealing with confusing data schemas or when initial explorations don't yield the expected results. Airtable built an evaluation system that specifically captures complex scenarios where the data schema is confusing and multiple explorations or backtracking steps may be required. The example they provide of replanning illustrates how the system can discover that its initial approach was insufficient and adjust accordingly. This represents a more sophisticated approach than rigid, predetermined query patterns, allowing the system to handle the messiness and ambiguity of real-world data exploration tasks. The ability to replan demonstrates a form of meta-cognition that brings the system closer to how human analysts actually work with unfamiliar databases. ## Hybrid Search Implementation Airtable implemented a hybrid search approach combining both keyword and semantic search techniques, recognizing that Retrieval Augmented Generation (RAG) efficacy depends on two critical factors: narrowing down data sources to search and efficiently ranking search results. Their system provides the LLM with tools to filter and search base data, where the filtering step narrows the search scope while keyword and semantic search help identify relevant results. The hybrid approach offers important advantages. By combining both search methods, they can prioritize exact matches for named entities (through keyword search) while still finding relevant results even when search queries are vague or worded differently (through semantic search). This is particularly valuable given that LLMs are subject to random errors and could overlook tables or columns during the filtering step. To address this, Airtable introduced a correction mechanism: if no meaningful data is found, the system performs the search again on a wider scope than the initial attempt. This fallback mechanism provides additional fault tolerance, acknowledging that LLM-based systems will sometimes make mistakes and need opportunities to recover. The architecture of their hybrid search system demonstrates thoughtful engineering around LLM limitations. Rather than assuming the LLM will always correctly identify relevant data sources on the first attempt, they built in recovery mechanisms that allow the system to gracefully handle initial errors. This defensive design approach is characteristic of mature production LLM systems that account for the probabilistic nature of model outputs. ## Token-Efficient Citation System Airtable recognized that citation of sources is important for LLM-generated answers because it allows users to verify information and reduces the risk of relying on potentially inaccurate responses. Citations also serve as an effective mechanism to minimize hallucination, as the requirement to cite specific sources discourages the model from making up information. They leverage inline citation tags for any derived information, whether from internet sources or database sources, achieving compact and flexible citations that can be turned into rich content later. The LLM cites sources alongside each piece of information using inline citation tags. For example, a generation might end with "...a question before Justin ". This citation style follows the natural flow of conversation and offers benefits of being compact, model-agnostic, and flexible for downstream processing. A significant technical challenge they addressed was the token inefficiency of unique IDs. Database IDs don't follow natural language patterns, and a 17-character ID can consume up to 15 tokens in the LLM's tokenization. When thousands of IDs are included in a single invocation, this results in increased cost and latency. To address this, Airtable developed a system to encode database IDs into contextually relevant, token-efficient representations that can be as short as 3 tokens. They also apply a checksum algorithm to minimize collisions between these shortened representations. This optimization yielded over 30% improvements in latency and 15% cost savings—substantial gains for a production system handling significant traffic. This citation optimization demonstrates sophisticated thinking about the practical constraints of production LLM systems. While it would be easy to focus solely on accuracy metrics, Airtable recognized that latency and cost are equally important for user experience and business viability. Their willingness to develop custom encoding schemes for citations shows the kind of engineering investment required to make LLM systems truly production-ready. ## Evaluation Framework Airtable emphasizes that measuring the impact of any approach requires a robust evaluation system. They developed a dual-source evaluation approach combining curated test suites with live production feedback. This two-pronged strategy allows them to maintain quality during development while also learning from real-world usage patterns. Their evaluation suite consists of curated lists of questions selected from customer research and production usage. They use both deterministic scorers (for cases where correctness can be objectively determined) and LLM-as-a-judge scorers (for more subjective assessments) to output result scores for various metrics they care about. The examples in the eval suite are specifically selected to represent Omni's most common use cases and known failure points, ensuring that their testing focuses on the scenarios that matter most to users. Importantly, they augment this list continuously as more representative examples emerge from actual usage, creating a feedback loop between production and evaluation. The evaluation suite enables rapid and confident iteration on any aspect of the system. Since Omni is model-agnostic (not tied to a specific LLM provider), the evaluations also enable them to compare performance across different models and track regressions when updating to newer model versions. This is a crucial capability for production LLM systems, as model providers frequently release updates that can have unexpected effects on application behavior. The second source of evaluation comes from live feedback on production data. While the case study doesn't detail the specific mechanisms for collecting and processing this feedback, the mention of it suggests they have instrumentation in place to monitor real-world performance and user satisfaction. This combination of offline evaluation (the curated suite) and online evaluation (production feedback) represents best practices for production ML systems. ## Critical Assessment and Limitations While Airtable's blog post describes a sophisticated and well-engineered system, it's important to note what information is not provided. The post does not include quantitative metrics on system accuracy, such as what percentage of questions are answered correctly or how often the system needs to invoke its replanning or correction mechanisms. There are no specific numbers on user satisfaction or adoption rates for Omni beyond the general claim that their techniques have been "crucial in delivering high-quality and reliable answers." The 30% latency improvement and 15% cost savings from citation optimization are concrete numbers, but these relate to a specific subsystem rather than overall system performance. The post also doesn't discuss failure modes in detail—what types of questions or database structures still cause problems for the system? What is the fallback user experience when the system cannot find an answer with confidence? Additionally, as with many vendor blog posts, there's an inherent marketing angle here. The post is designed to showcase Airtable's engineering capabilities and promote their Omni product. While the technical details appear sound and the approaches described are consistent with best practices in the LLMOps field, readers should be aware that the post likely highlights successes while downplaying challenges or ongoing limitations. The claim that a system with 50% accuracy is "unusable" is stated as motivation for their work, but the post doesn't provide concrete data on what accuracy level they actually achieved. This is a common pattern in vendor case studies—describing the problem and solution in detail while being less specific about quantitative outcomes. ## Production LLMOps Practices Demonstrated Despite these limitations, the case study demonstrates several important LLMOps practices for production systems. The agentic framework with tool-calling represents a move beyond simple prompt-response patterns toward more sophisticated reasoning systems. The contextual schema exploration shows thoughtful management of context windows, a critical constraint in production LLM applications. The planning and replanning mechanisms demonstrate an understanding that LLM reasoning is not always linear and that systems need to accommodate course corrections. The hybrid search implementation shows sophisticated thinking about information retrieval in LLM contexts, combining multiple techniques to achieve robustness. The token-efficient citation system demonstrates attention to the practical constraints of latency and cost in production systems. Most importantly, the comprehensive evaluation framework with both curated test suites and production feedback represents mature MLOps practices adapted for LLM systems. Airtable's approach is model-agnostic, which is a significant architectural decision. While they mention using Anthropic's Sonnet 4 specifically, their system design allows for comparison across models and flexibility to switch providers. This is increasingly important in the rapidly evolving LLM landscape, where new models are released frequently and provider pricing and availability can change. ## Future Directions Looking forward, Airtable identifies their next challenges as scaling these capabilities to even larger, more heterogeneous bases with low latency, and enhancing LLM reliability during extended iterative operations. This acknowledgment suggests that while they've achieved production-quality results for their current use cases, they recognize that database complexity and query complexity can increase substantially, requiring continued innovation. The challenge of extended iterative operations is particularly interesting from an LLMOps perspective. As agents perform more steps of reasoning and tool calls, they face several challenges: context windows fill up with conversation history, costs accumulate with each model invocation, latency increases linearly with the number of steps, and error rates compound as mistakes in early steps propagate to later ones. Addressing these challenges while maintaining quality will require continued innovation in areas like context compression, efficient tool calling, parallel reasoning paths, and error detection and recovery. In summary, Airtable's Omni Q&A assistant represents a thoughtfully engineered production LLM system that addresses many of the practical challenges of deploying agentic AI in real-world applications. While the blog post lacks some quantitative details about results, it provides valuable insights into the architectural decisions and LLMOps practices required to build reliable LLM-powered products. The emphasis on iterative development, comprehensive evaluation, and attention to practical constraints like latency and cost demonstrates mature thinking about production AI systems.

Start deploying reproducible AI workflows today