ZenML

Optimizing RAG-based Search Results for Production: A Journey from POC to Production

Statista 2023
View original source

Statista, a global data platform, developed and optimized a RAG-based AI search system to enhance their platform's search capabilities. Working with Urial Labs and Talent Formation, they transformed a basic prototype into a production-ready system that improved search quality by 140%, reduced costs by 65%, and decreased latency by 10%. The resulting Research AI product has seen growing adoption among paying customers and demonstrates superior performance compared to general-purpose LLMs for domain-specific queries.

Industry

Research & Academia

Technologies

Overview

Statista is a Hamburg-based company that operates a global statistics and data platform, serving approximately 30,000 paying customers across multiple countries including the US, Japan, UK, Germany, and France. The platform hosts millions of statistics curated by 300-400 researchers and analysts, with around 23 million monthly page views and 500,000 downloads of complex data reports. The company’s core challenge has always been discovery—helping users find the right statistics among millions of data points.

This case study documents Statista’s journey from an internal proof-of-concept to a production-ready generative AI product called “Research AI,” developed in collaboration with Uriel Labs (led by CTO Mati) and facilitated by Talent Formation. The presentation was delivered as a panel discussion featuring three perspectives: the business stakeholder (Ingo, CTO of Statista), the technical optimizer (Mati from Uriel Labs), and the project facilitator (Benedict from Talent Formation).

Business Context and Initial Exploration

When ChatGPT emerged in early 2023, Statista’s executive team recognized both the opportunity and threat that generative AI posed to their business model. While 66-80% of their traffic comes from organic Google search, the rise of AI assistants that could directly answer user questions threatened to disintermediate their platform. However, Statista also recognized their competitive advantage: a trusted, curated data source that paying customers rely on for professional purposes.

Rather than making top-down decisions about AI strategy, Statista took a pragmatic approach by dedicating a single engineer with a mathematics background to explore use cases for two months. This exploratory phase resulted in several potential applications, with one clear winner that was developed into a prototype by mid-2023. The prototype proved the principle but was not production-ready—it was slow, expensive, and lacked scalability.

Initial Architecture and Technical Challenges

The proof-of-concept was built as a typical RAG (Retrieval Augmented Generation) application with the following pipeline:

Upon initial analysis, the system was making 42 LLM calls per request—40 for reranking (asking whether each retrieved document was relevant) plus one for answer generation and one for quality rating. This appeared to be an obvious optimization target, but the team wisely implemented traceability before making any changes.

The Critical Role of Traceability

Before attempting any optimizations, the team implemented comprehensive traceability to understand what the application was actually doing at each step. This observability layer captured what was being sent to the LLM, latency for each component, and costs per step. This investment proved crucial because the data revealed surprising insights:

This highlights a fundamental LLMOps principle: intuition about performance bottlenecks is often wrong, and proper instrumentation is essential before optimization.

Establishing Quality Metrics and Baselines

With traceability in place, the team needed to define success metrics. They established a clear priority order: quality first, then cost, then latency. They also set specific thresholds: costs below 5 cents per request and latency below 30 seconds on average.

To measure quality, they needed a reference dataset—a ground truth of expected answers for given questions. Statista’s content experts created this dataset, enabling an “LLM-as-a-judge” evaluation approach where a language model compares system outputs against reference answers to score quality.

The team built a test runner that could generate all three metrics (quality, cost, latency) on-demand after each change. The initial baseline measurements were sobering:

Experimentation and Optimization

With metrics infrastructure in place, the team executed over 100 experiments in three weeks to optimize the system. The key insight driving their experimentation was that semantic similarity between user queries and statistic snippets is not naturally aligned—a question and its answer have different semantic structures.

Query Rewriting

The simplest approach was asking an LLM to rewrite user queries for better retrieval. For example, “How tall is the Eiffel Tower? It looked so high when I was there last year” becomes “What’s the height of the Eiffel Tower?” This removed noise and improved wording but showed no significant net quality improvement—some tests improved while others degraded.

Multi-Query Rewriting

The breakthrough came from generating multiple query variations. Complex questions often require multiple statistics to answer. For “Which company had more revenue 2015 to 2020, Apple or Microsoft?”, the system generates separate queries for Apple revenue, Microsoft revenue, and a comparison. Each query performs its own retrieval, and results are merged, capturing multiple aspects needed for comprehensive answers. This technique showed significant quality improvements with only marginal increases in latency and cost.

HyDE (Hypothetical Document Embeddings)

The most innovative technique addressed the fundamental mismatch between questions and answers in embedding space. HyDE asks the LLM to generate a hypothetical answer—in this case, a fake statistic snippet that looks like real data. This fabricated answer (which may contain incorrect numbers) is then used for similarity search, because it’s semantically closer to actual statistic snippets than the original question. The actual retrieved documents contain the correct data.

This technique also showed significant quality improvements, though it changed which queries succeeded versus failed, indicating it was capturing different aspects than the multi-query approach.

Combined Approach

The final architecture combines all techniques: query rewriting, multi-query expansion, and HyDE run in parallel, with their retrievals merged. While this appears complex, the parallel execution minimizes latency impact. The latency savings from other optimizations were “reinvested” into HyDE, which is slower but improves quality.

Final Results

After three weeks of experimentation, the optimized system achieved:

These metrics brought the system to production readiness, meeting the defined thresholds.

Model Selection Considerations

The presentation also covered model selection as a critical optimization lever. Using a comparison framework from Artificial Analysis, the team demonstrated that:

A live demo showed the speed difference between models completing the same summarization task. The challenge is that no single model is optimal for all requests—some queries need sophisticated reasoning while others just require summarization. This led to the development of an “AI Router” product that dynamically selects models per request, reportedly saving customers 82% on costs.

Production Deployment and Business Results

The journey from technical production-readiness to market launch took approximately five months, requiring work on UX, go-to-market strategy, and pricing beyond the core AI system.

The final product, Research AI, presents answers with key facts extracted and visualized for quick comprehension, along with source links to original statistics. The rollout followed a phased approach:

Post-launch metrics show:

The team continues to optimize, achieving response times down to 14-15 seconds with further cost reductions while maintaining quality. Importantly, they benchmark their quality against public models (ChatGPT, Gemini, Perplexity) to ensure competitive performance.

Ongoing Experimentation

Since most Statista traffic comes from Google searches landing on specific statistics pages (not the Research AI interface), the team is now A/B testing integration of AI-generated follow-up questions on these pages to drive deeper engagement. This represents a shift from search-initiated AI to context-aware AI recommendations.

Business Model Evolution

The presentation noted three emerging business model patterns for data providers in the AI era:

Statista is exploring all three approaches, with the optimal balance to be determined by user behavior and market dynamics.

Key LLMOps Lessons

The case study emphasizes several LLMOps best practices:

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building AI-Native Platforms: Agentic Systems, Infrastructure Evolution, and Production LLM Deployment

Delphi / Seam AI / APIsec 2025

This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.

chatbot content_moderation customer_support +40

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57