## Overview and Company Context
Toqan developed a specialized data analyst agent as part of their broader AI platform, designed to democratize data access by allowing users to ask questions in natural language and receive SQL-generated answers with automatic visualizations. The company worked with development partners including iFood, Gloo, and OLX to bring this agent from prototype to production. The speakers, Yanis and Don (a machine learning engineer), presented their experience scaling this agent to serve hundreds of users, sharing the technical challenges and solutions they encountered during the production deployment process.
## The Technical Challenge
The Toqan data analyst agent represents a sophisticated multi-step AI system that performs several complex tasks: understanding natural language questions, translating them to SQL queries, executing those queries against user databases, and presenting results with visualizations. While the agent showed impressive capabilities during development - understanding questions, finding relevant data sources, generating and executing SQL, and recovering from failures - the transition to production revealed significant reliability issues.
The team experienced what they described as "magical" moments when the agent worked perfectly, but also frustrating failures including infinite loops where agents would ignore stop commands, repetitive responses (giving the same answer 58-59 times), and inconsistent behavior across different runs. These issues made the system unsuitable for production use despite its impressive technical capabilities.
## Four Key Production Lessons and Solutions
### Lesson 1: Deterministic Workflows for Behavioral Problems
The team discovered that while LLMs excel at generating diverse content, this flexibility becomes a liability in production systems requiring consistent behavior. Users would ask vague questions, and the agent would generate answers even when it lacked sufficient context, leading to wrong or inconsistent results across different runs.
Their solution involved implementing deterministic pre-processing steps that evaluate questions in isolation before the agent attempts to solve them. This preprocessing determines whether the agent has sufficient context to answer the question, creating a more consistent user experience and preventing the agent from attempting to answer unanswerable questions.
They also implemented similar deterministic checks for SQL query validation. Rather than allowing the agent to generate SQL and then discover column-doesn't-exist errors through database execution, they built validation steps that check column existence before execution. This approach reduces unnecessary database load, minimizes wasted computation cycles, and prevents the agent from consuming server resources on predictably failing queries.
The key insight here is that LLM-generated content can be evaluated using functional programming approaches before being applied to external systems, reducing cycles and improving reliability.
### Lesson 2: Expert Integration for Setup and Monitoring
Initially, the team attempted a traditional handoff approach where data experts provided documentation and schema information, which the engineering team would then integrate into the agent system. However, this process proved inadequate for the specialized nature of SQL generation and data analysis tasks.
The breakthrough came when they created pipelines and user interfaces that allowed data experts to directly update and optimize documentation for agent consumption. This approach enabled experts to iterate quickly without engineering intervention and, crucially, to optimize documentation specifically for AI consumption rather than human consumption.
This expert-in-the-loop approach proved essential because the data experts understood the nuances of business rules, data relationships, and domain-specific requirements that generic engineering approaches couldn't capture. The result was agents that became highly specialized and effective for their specific use cases.
### Lesson 3: Resilient System Design
When deployed to real users, the agent faced queries and use cases that the development team hadn't anticipated. Users would push boundaries, ask questions outside the intended scope, and occasionally attempt to abuse the system. The team learned that building production AI systems requires expecting and gracefully handling unexpected inputs.
Their solution involved balancing hard limits (specific rules about when the agent must stop) with soft limits (guidance to nudge the agent toward desired behaviors). Hard limits prevent infinite loops and resource abuse, while soft limits preserve the agent's creative problem-solving capabilities and self-correction abilities that make agentic systems valuable.
This approach required careful engineering to ensure the system remained robust while preserving the agent's core value proposition of flexible problem-solving.
### Lesson 4: Tool Optimization for Agent Efficiency
The team recognized that agent frameworks consist of both the main agent logic and supporting tools, and that optimizing these tools significantly impacts overall system performance. They focused particularly on their schema tool, which reads documentation and determines relevant context for the agent.
Initially designed for small use cases with few tables, the schema tool needed enhancement as they scaled to larger, more complex use cases. They improved the tool's ability to determine relevance, ensuring agents received concentrated, relevant information while avoiding distraction from irrelevant data.
They also implemented a reflection tool within their SQL execution component - a smaller LLM call that can fix common SQL errors (like date function dialect differences) without involving the main agent. This optimization saves computation cycles and prevents the main agent's context from being cluttered with error-correction information.
## Technical Architecture and Model Selection
The system uses a multi-model approach, selecting different LLMs based on task complexity and cost considerations. They primarily use OpenAI models but vary the specific model (including GPT-4, GPT-4 Turbo, and smaller models) depending on the component's requirements. Simpler, more focused tools can use lighter models, while the core data analyst functionality requires more capable models due to the high cost of errors in data analysis decisions.
The team built their own agent framework rather than using existing solutions, partly because they began development two years ago when framework options were limited, and partly because they needed specific capabilities that weren't available in existing tools. This custom approach allowed them to implement the specialized optimizations and control mechanisms they discovered were necessary for production deployment.
## Database Access and Security
The system handles database access on a case-by-case basis, working with each client to establish appropriate credentials and access controls. They implement strict timeouts to prevent long-running queries that could lock databases, and they require limited access credentials to minimize the risk of system failures affecting broader client infrastructure.
Role-based access control varies by client and their existing database management practices, requiring flexible security implementations tailored to each deployment environment.
## Production Results and Scaling
The implementation of these four lessons successfully transformed the Toqan data analyst from a prototype that "never works out of the box" to a production system serving hundreds of users with acceptable reliability and user adoption. The team emphasized that "AI first doesn't mean AI only," highlighting the importance of combining AI capabilities with traditional software engineering practices for production deployment.
## Key Insights for LLMOps Practitioners
This case study provides several valuable insights for practitioners deploying LLM-based systems in production. The importance of deterministic workflows challenges the assumption that LLM flexibility is always beneficial - sometimes predictable behavior is more valuable than creative responses. The expert-in-the-loop approach demonstrates that domain expertise remains crucial even with highly capable AI systems, and that tools for expert collaboration can be as important as the AI components themselves.
The balance between hard and soft limits offers a framework for maintaining system reliability while preserving AI capabilities, and the focus on tool optimization shows how supporting components can significantly impact overall system performance. Finally, the multi-model approach demonstrates that production LLM systems often benefit from using different models for different tasks rather than applying a single model to all components.
The case study also illustrates the importance of extensive testing and iteration when moving from prototype to production, and the value of building custom solutions when existing frameworks don't meet specific production requirements. Their journey from frequent failures to stable production deployment required systematic identification and resolution of reliability issues, emphasizing that successful LLMOps requires both AI expertise and traditional software engineering discipline.