ZenML

Natural Language Interface to Business Intelligence Using RAG

Volvo 2024
View original source

Volvo implemented a Retrieval Augmented Generation (RAG) system that allows non-technical users to query business intelligence data through a Slack interface using natural language. The system translates natural language questions into SQL queries for BigQuery, executes them, and returns results - effectively automating what was previously manual work done by data analysts. The system leverages DBT metadata and schema information to provide accurate responses while maintaining control over data access.

Industry

Automotive

Technologies

Overview

This case study emerges from a podcast conversation featuring Jesper Fikson, an AI Engineer at Volvo, who discusses practical implementations of LLMs in production environments. The primary focus is on a Retrieval Augmented Generation (RAG) system built to automate data analyst workflows at Volvo Car Mobility, a subsidiary of Volvo Cars that operates car-sharing services. The conversation provides valuable insights into the evolution from data science to AI engineering, the practical challenges of deploying LLM-based systems, and the trajectory from simple RAG implementations toward more sophisticated autonomous agents.

Company Context and Role Evolution

Jesper works in a unique position split between two domains: 50% as a data scientist for the car-sharing service optimizing algorithms, and 50% working on generative AI initiatives across Volvo Cars. The broader organization of “Commercial Digital” comprises approximately 1,500 people within the larger Volvo Cars structure.

A significant theme in the discussion is the evolution of roles in AI/ML organizations. Jesper makes a strong case for the distinction between data scientists and AI engineers, arguing that while data science focuses on building knowledge and proof of value (often in notebooks), AI engineering focuses on productionizing solutions and creating real business value. He notes that around 2022, the industry began shifting more toward product-focused, engineering-centric approaches. This observation aligns with broader industry trends where many data science POCs never reach production, highlighting the critical importance of engineering skills in LLMOps.

The Problem: Ad-Hoc Data Requests

The specific use case addresses a common pain point in data teams: business stakeholders frequently post questions to Slack channels asking about operational metrics like “How many journeys did we have yesterday?” or “How many users signed up?” These questions, while valuable to the organization, require data team members to drop their current work, write SQL queries, and return results. This creates significant context-switching overhead and reduces time available for higher-value analytical work.

Technical Solution Architecture

The solution is a Slack bot that enables non-technical users to ask natural language questions and receive data-driven answers automatically. The technical pipeline works as follows:

Context Window Management

One of the most interesting technical insights relates to context window management. When GPT-4 Turbo’s 128K token context window was released (during OpenAI’s developer day), Jesper realized he could simply include the entire database schema without needing vector database-based semantic search. This represents a pragmatic engineering decision: the schema files were small enough to fit entirely within the expanded context, eliminating architectural complexity.

However, Jesper notes important limitations around the 128K context window. While the hard limit is 128K tokens, effective usage is considerably lower—he suggests not exceeding 40-50K tokens maximum. This aligns with research around LLMs “forgetting” information in the middle of long contexts, with better retention at the beginning and end of prompts.

DBT Integration for Metadata

The solution leverages DBT (Data Build Tool) as a key component. DBT provides more than just schema information—it includes YAML files with:

This metadata is crucial because ChatGPT needs to understand not just the technical schema but also business semantics. For example, knowing that a column contains values like “B2B” or “B2C” helps the model generate accurate queries. Without this contextual information, the LLM would struggle to correctly reference specific values in WHERE clauses.

Semantic Search Considerations

Jesper mentions that they initially experimented with semantic search (embeddings-based retrieval) to find relevant parts of the schema based on question similarity. However, with the expanded context window, this became unnecessary for their use case. The system stores context “on file” rather than in a vector database, demonstrating that not every RAG implementation requires vector databases—the architecture should match the scale of the data.

Production Status and Results

The system is described as being “in production” at Volvo Car Mobility. The practical benefit is that non-technical users can ask natural language questions about car counts, user signups, journey statistics, and other operational metrics without requiring data team intervention. Jesper describes the experience of seeing it work as “like magic.”

The solution acknowledges current limitations—it returns tabular data rather than visualizations, though this is noted as a potential future enhancement.

From RAG to Autonomous Agents

The conversation extends beyond simple RAG to discuss the trajectory toward autonomous agents. Jesper frames this evolution in terms of capabilities:

The Slack bot example is positioned as a very limited agent—it does take action by executing queries against BigQuery—but true autonomous agents would require more sophisticated planning and broader action capabilities.

The Rabbit R1 and Large Action Models

The discussion references the Rabbit R1 device as an interesting development in the agent space. Unlike traditional LLMs trained on next-word prediction, Rabbit claims to train a “Large Action Model” on interactions with computer interfaces. This represents a different training paradigm focused on learning action trajectories rather than text generation.

Engineering Philosophy and Pragmatism

Throughout the conversation, Jesper emphasizes pragmatic engineering over theoretical purity. Key principles include:

Context on Voice Interfaces and Accessibility

An interesting aside in the conversation covers using ChatGPT’s voice mode to help Jesper’s 76-year-old father who is going blind due to retinal detachment. This demonstrates LLM applications beyond traditional enterprise use cases—voice interfaces enabling access for users who cannot type or read screens. The multimodal capabilities (voice input/output, image description) represent emerging productionization opportunities.

Broader LLMOps Observations

The discussion touches on several broader LLMOps themes:

Critical Assessment

While the case study presents a successful production deployment, several caveats should be noted:

The solution represents a practical, achievable LLMOps implementation that delivers tangible business value while acknowledging the current limitations of the technology.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57

Scaling AI Product Development with Rigorous Evaluation and Observability

Notion 2025

Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.

document_processing content_moderation question_answering +52