Explai: Building Production-Ready AI Analytics Agents Through Advanced Prompt Engineering

Overview

Explai is a company founded two years ago (circa 2023) that focuses on applying AI agents to data analytics and business intelligence. The founder, a data scientist with 20 years of experience who previously led data teams of hundreds at companies like Zalando and Delivery Hero, presents a detailed account of their journey building production LLM systems for enterprise data analytics. This case study is particularly valuable because it candidly discusses initial failures and the tactical solutions developed to overcome them.

The company’s mission centers on democratizing data science by creating AI “data companions” rather than traditional BI tools. Their fundamental insight is that business users possess tremendous domain knowledge and context but lack technical skills in SQL, Python, or statistics, while AI agents, despite having read “the whole internet,” often fail at basic tasks like mathematical reasoning on tabular data. This framing guides their entire approach to building production LLM systems.

The Problem: Context Window Management at Scale

The case study reveals a critical production challenge that emerged during their first 12 months of operation. Initially, Explai followed what seemed like best practices at the time: pre-loading agent contexts with extensive information including custom prompts with domain knowledge, RAG-based retrieval of all table information, business documents, data science process guidance, intermediate SQL results, data previews (potentially thousands of rows, even when sampled), and accumulated snapshots across multi-step analysis workflows.

This approach appeared logical since data analytics inherently requires rich context spanning multiple domains including database schemas, business logic, statistical methodologies, and intermediate computation results. However, as they scaled to real enterprise use cases with production-level data volumes, they observed severe instruction following degradation. The root cause was straightforward: even sampled data isn’t small, and when combined with all the other contextual information they were providing, the context window became polluted with too much information, causing the LLM to lose focus on the actual task at hand.

The speaker acknowledges this was a “hard learned journey” and emphasizes that while they “felt very smart doing it,” the approach simply “didn’t work very well” in production. This honest assessment of failure makes the subsequent solutions more credible and valuable.

Solution Framework: Strategic Prompt Engineering

Explai developed a comprehensive prompt engineering strategy organized around four main categories, drawing from LangChain’s framework for context management:

Writing Context: How information is persisted or committed to long-term and short-term memory, including techniques like scratchpads for agent reasoning.

Selecting Context: How relevant information is chosen for inclusion in prompts, with emphasis on pull-based rather than push-based approaches.

Compressing Context: Using summarization (preferred over simple trimming when time-to-token allows) to reduce context size while preserving signal, especially during agent handovers.

Distributing Context: Isolating contexts across different agents to partition work, similar to distributed computing patterns in traditional data processing.

The speaker focuses on three specific tactical implementations that proved most effective in their production system, which I’ll detail below.

Tactic 1: Reversing RAG - Pull vs. Push

This tactic fundamentally reimagines how domain knowledge is provided to agents. Instead of pre-loading contexts with extensive documentation, Explai developed a structured document system with four key components:

Trigger Messages: Extremely concise one-sentence descriptions that can be preloaded into every relevant agent context without consuming significant tokens. For example, for computing cohort retention metrics, the trigger simply states “when is cohort retention actually useful.” These triggers essentially serve as lightweight pointers to more detailed knowledge.

Prerequisites: When an agent pulls the full document (using tools provided for this purpose), it first encounters a section on prerequisites that helps the agent determine if this approach is appropriate. For instance, the agent might learn that computing cohort retention requires data from two date ranges.

Related Content: Guidance on alternatives and related approaches. The example given shows that if only two consecutive years of data are available, the system guides the agent toward year-over-year metrics instead, which would be more appropriate.

Examples: Concrete demonstrations of how to apply the technique, leveraging the fact that LLMs learn better inductively from examples than from abstract deductive instructions.

The critical innovation here is that agents can query multiple such documents in parallel (the speaker notes that frontier models handle 5-10-15 parallel tool calls without issues), so the latency penalty is minimal. This requires discipline in structuring knowledge and building appropriate tooling, but it dramatically reduces context pollution. The speaker references a recent Anthropic post on “skills” and notes this pattern applies equally well to domain knowledge.

While this approach is clever, it does introduce dependencies on tool calling reliability and adds complexity to the system architecture. The claim that parallel tool calls add no latency should be evaluated critically, as real-world network conditions and API rate limits may introduce variability not apparent in controlled testing environments.

Tactic 2: Write Artifacts, Not Raw Data

This tactic addresses the problem of intermediate results polluting agent context during multi-step analysis workflows. Instead of placing actual data (even samples) into the LLM context, Explai materializes all intermediate results as structured artifacts in a backend data store (PostgreSQL or Pandas DataFrames).

The key insight is that agents only need to see metadata about these artifacts—table names, schemas, summary statistics, column scales, and lineage information (e.g., “this is a result of regression analysis”)—which consumes very few tokens. Agents then have access to tools and endpoints to explore these artifacts as needed through operations like head, tail, and sampling.

The example workflow shown involves generating a smartphone product catalog table, followed by generating related order data. Neither table’s actual contents enter the agent context. Instead, the agent sees artifact references and can query them programmatically. This approach provides several benefits: dramatically reduced token consumption, clear data lineage tracking, ability to page through large datasets interactively (the example shows 200 data points with only the first 5 initially visible), and consistent infrastructure that both agents and frontend UIs can use (the frontend uses the same endpoints to render tables).

This is a solid engineering pattern that separates concerns between data storage and reasoning. However, it does raise questions about what happens when agents need to actually examine data to make decisions (e.g., identifying data quality issues or unexpected patterns). The speaker doesn’t fully address whether there are cases where agents do need to see actual data samples, or how the system handles such scenarios.

Tactic 3: Full Code Generation in Sandboxed Environments

The third tactic involves giving agents more autonomy to write complete executable code for certain tasks, rather than constraining them to limited tool calls or declarative formats. The specific example discussed is data visualization.

Initially, Explai took a constrained approach: for plotting with Plotly, they had agents generate JSON declarations that would be passed to the Python runtime. This provided safety and predictability but limited flexibility. After gaining confidence with the system’s reliability, they transitioned to allowing agents to write full Python code for visualization tasks.

The rationale is multi-faceted. First, visualization is considered low-risk when executed in a sandbox (unlike freestyle SQL, which they still constrain with guardrails for data protection and PII concerns). Second, full code generation is more flexible—agents can pre-aggregate data, check for label overlap, adjust layouts, and even rerun plotting code iteratively if the result is unsatisfactory. Third, it’s more adaptable to changing requirements—if a customer prefers a different visualization library, the change is straightforward without rewriting declarative grammars.

The speaker contrasts this with their continued use of workflows with “guard rails” for SQL generation, indicating a risk-based approach to determining where agents can have full code generation autonomy versus where they need more constraints.

This tactic represents a pragmatic middle ground in the ongoing debate about how much autonomy to give agents. However, the speaker doesn’t discuss in detail what the sandboxing mechanism looks like, how they handle execution timeouts, resource limits, or what happens when generated code has bugs or infinite loops. These are critical production concerns that would need robust solutions.

Architecture and Agent Orchestration

While not the primary focus of the talk, the speaker provides important context about their overall system architecture. They operate a multi-agent system where different agents specialize in different tasks: SQL writing, plotting, forecasting, causal inference, and other analytical operations. This specialization makes sense given that data analytics encompasses diverse disciplines requiring different approaches.

The system requires agent coordination, skill development, result verification (critical given that data analytics demands accuracy and errors can accumulate across multi-step workflows), and intelligent information provisioning. The speaker notes that SQL writing is “soon becoming a commodity” and cannot be relied upon for differentiation, whereas capabilities like optimal forecasting model selection or causal inference still offer opportunities for competitive advantage.

An interesting philosophical point raised is that since they don’t employ reinforcement learning or fine-tuning for most workloads, prompt engineering and context window management essentially serve as their primary mechanism for “manufacturing learning” and building end-to-end capabilities. This makes their prompt engineering tactics even more critical to system performance.

From Workflows to Autonomous Agents

The speaker presents a maturity model for agent autonomy that their tactical improvements enable: starting with constrained workflows, progressing to ReAct-style agents with strong primitives and tool use, and eventually reaching full code generation capabilities for appropriate tasks. The key insight is that once robust primitives and infrastructure are in place (structured document systems, artifact management, sandboxed execution), agents can be granted more autonomy without sacrificing reliability.

The speaker initially thought workflows would remain necessary but found that “once you have those primitives then ReAct and code works just fine.” This suggests their tactics successfully addressed the underlying issues that make constrained workflows necessary in less mature systems.

Business Philosophy: Companions, Not Tools

An important framing throughout the talk is that Explai aims to build “data companions” rather than “just another BI system.” The speaker argues that great data analytics was never about who could write the best SQL or create the prettiest plots—those are necessary skills but not the essence of analytical value. Instead, analytics is a “social cultural process” that is inherently multi-step, requires human-in-the-loop interaction, and involves high context understanding.

The speaker contrasts this with much of the industry’s approach to AI for analytics, which they see as simply adding natural language interfaces to existing BI tools or building natural-language-to-SQL converters. While acknowledging these can be useful, they argue this limits the potential of AI because it treats the agent as just another tool rather than as a consultant or companion in an analytical process.

This philosophy directly influences their technical approach. The emphasis on multi-step workflows, follow-up questions, and context management reflects the reality that “if a single query can answer [the question] then it wasn’t an interesting question to begin with.” The signal in analytics comes from surprising results that generate follow-up questions, not from routine reporting.

While this framing is compelling and likely resonates with experienced data professionals, it’s worth noting that this represents a particular vision of what AI analytics should be. Many organizations may have legitimate use cases for simpler natural-language-to-SQL tools, and the speaker’s characterization of such approaches as insufficient may be somewhat dismissive of valid alternative design philosophies.

Technical Stack and Tooling

While specific technical details are limited, the case study references several components of their stack: PostgreSQL and Pandas for data storage and manipulation, Plotly for visualization (with flexibility for alternatives), LangChain patterns for agent orchestration, insights from Anthropic’s documentation on skills and structured approaches, frontier LLM models capable of reliable parallel tool calling, and sandboxed Python execution environments.

The speaker doesn’t mention which specific LLMs they use, whether they employ multiple models for different tasks, or how they handle model updates and versioning—all relevant concerns for production LLMOps. The reference to “frontier models” suggests they’re using cutting-edge commercial APIs rather than self-hosted models, which has implications for cost, latency, and control.

Production Considerations and Open Questions

While the case study provides valuable tactical insights, several production concerns receive limited attention. The speaker doesn’t discuss evaluation and testing methodologies, monitoring and observability approaches, failure handling and recovery mechanisms, cost management and token optimization beyond context reduction, latency requirements and real-time vs. batch processing considerations, data security and compliance beyond mentioning PII protection for SQL, or how they handle model updates and maintain system stability as LLM capabilities evolve.

The speaker’s background leading large data teams at major tech companies lends credibility, but the lack of quantitative results (latency improvements, accuracy metrics, customer satisfaction scores) makes it difficult to assess the magnitude of improvements their tactics provided. Statements like “it didn’t work very well” and “works just fine” are qualitative and subjective.

Critical Assessment

The case study demonstrates genuine learning from production experience and offers practical, implementable tactics that address real problems. The honest discussion of failures is refreshing and valuable. However, several caveats should be noted. First, the talk is from a company founder at what appears to be a conference or meetup, so there’s inherent sales motivation even if the speaker explicitly says “it’s not a sales pitch.” Second, the emphasis on these specific three tactics may reflect selection bias—these are the approaches that worked for their specific use case with their specific data and customers, but may not generalize universally. Third, some claims (like parallel tool calls adding no latency) should be validated independently rather than accepted at face value.

The approach of reversing RAG is clever but adds architectural complexity and dependencies on tool calling reliability. The artifact management approach is solid engineering but may have edge cases where agents actually need to see data. The full code generation approach is pragmatic but carries risks that aren’t fully addressed. Nevertheless, these tactics represent thoughtful solutions to real production problems and are likely to be valuable for others building similar systems.

Conclusion

Explai’s two-year journey building production LLM systems for data analytics illustrates the gap between initial approaches that seem theoretically sound and what actually works at scale with real enterprise data. Their evolution from context pre-loading to pull-based retrieval, from raw data in context to artifact references, and from constrained declarative formats to sandboxed code generation represents a maturation process many teams building LLM systems will need to undertake.

The case study’s value lies not in presenting revolutionary new techniques, but in providing battle-tested tactical implementations of emerging best practices, along with honest assessment of what didn’t work. For practitioners building multi-agent LLM systems, particularly in data-intensive domains, these tactics offer concrete starting points for addressing context management challenges. The emphasis on structured knowledge, separation of concerns between reasoning and data storage, and risk-based autonomy grants provides a reasonable framework for production LLMOps in complex analytical domains.

Building Production-Ready AI Analytics Agents Through Advanced Prompt Engineering

Industry

Technologies