ZenML

AI-Powered Data Copilot for Autonomous Analysis in IDEs

BlaBlaCar 2025
View original source

BlaBlaCar developed an AI-powered Data Copilot to address the inefficient workflow between Software Engineers and Data Analysts, where engineers lacked data warehouse access and analysts were overwhelmed with repetitive queries. The solution embeds an LLM-powered assistant directly in VS Code that connects to BigQuery, provides contextual business logic from curated queries, generates SQL and Python code with unit tests, and enables engineers to perform their own analyses with data health checks as guardrails. The tool leverages a "zero-infrastructure" RAG approach using VS Code's native capabilities and GitHub Copilot, treating analyses as code artifacts in pull requests that analysts review, resulting in faster question resolution (from weeks to minutes) and freeing analysts to focus on high-value modeling work.

Industry

Tech

Technologies

Overview

BlaBlaCar, a ridesharing platform, built an internal “Data Copilot” to fundamentally reshape how their engineering organization interacts with data. The case study presents an interesting production LLM application that addresses organizational friction between Software Engineers (SWE) and Data Analysts (DA). The company identified that engineers possessed the analytical skills and domain context needed for data analysis but were blocked by unfamiliar tooling and organizational silos, while analysts were buried under repetitive “quick questions” that prevented them from doing higher-value work.

The solution represents a “shift left” philosophy borrowed from DevOps, moving data analysis closer to the point of feature development. Rather than building yet another text-to-SQL chatbot for business users, BlaBlaCar explicitly designed their tool for engineers, embedding it directly in their IDE (VS Code) where they already work. This approach treats data analysis as a code artifact subject to the same rigor as production code, complete with pull requests, unit tests, and peer review.

Technical Architecture and LLM Integration

The technical implementation is notable for its simplicity and clever reuse of existing infrastructure. BlaBlaCar describes this as a “zero-infrastructure RAG” approach that bypasses the complexity of vector databases or separate Model Context Protocol (MCP) servers. Instead, they built a lightweight Python script that bridges BigQuery and the IDE, exporting key context into standard text files (Markdown, SQL, JSON) directly within the project repository.

The RAG mechanism leverages VS Code’s native indexing capabilities. When files containing schema definitions, curated “golden queries,” and table samples are placed in the workspace, VS Code automatically indexes them. GitHub Copilot can then access this context through its built-in toolset. When an engineer asks a question like “How do I calculate monthly active users?”, the system triggers VS Code’s semantic search (#codebase) and literal string matching (#textSearch) to retrieve relevant documentation and inject it into the chat context. This transforms GitHub Copilot from a generic code completion tool into a domain-specific data analyst without requiring custom AI infrastructure.

The architecture involves tunneling securely into BlaBlaCar’s BigQuery environment, removing the need for engineers to use the BigQuery Console directly. The tool has access to curated queries from production DBT models and verified reporting, as well as previews of tables that users have permissions to access. This contextual grounding is crucial for addressing the hallucination problem common in generic LLM assistants.

Context and Business Logic Integration

One of the most critical aspects of the implementation is how the system handles business context. Generic AI assistants understand SQL syntax but lack knowledge of specific business definitions. BlaBlaCar addresses this by providing the LLM with access to curated query examples that encode institutional knowledge. When an engineer asks about “driver churn rate” or “search intent,” the Copilot doesn’t hallucinate a definition but retrieves the logic actually used by the Data Team in production.

This approach reflects a sophisticated understanding of the LLMOps challenge: the value isn’t just in generating syntactically correct SQL, but in generating queries that align with established business logic and definitions that may have evolved over time. The system bridges what the authors call the gap “between raw data and business reality.”

Data Quality and Safety Mechanisms

The case study describes an interesting approach to data quality through what they call a “Data Health Card.” This functions as a linter for analytical logic rather than just syntax. While a query can be syntactically perfect, it can still be analytically disastrous (for example, joining tables incorrectly or using deprecated fields). The Data Health Card runs heuristic checks that provide soft warnings, allowing engineers to move quickly while passively learning to identify bad data patterns without being blocked.

This represents a pragmatic approach to guardrails in production LLM systems. Rather than attempting to prevent all errors through hard constraints (which would slow velocity), the system provides feedback that educates users over time while allowing them to proceed with appropriate caution. The balance between safety and velocity is a key consideration in production LLM deployments.

Code Artifacts and Transparency

Unlike traditional BI tools that hide logic behind drag-and-drop interfaces, the Data Copilot treats analyses as transparent artifacts generated through a composition of code and LLM reasoning. The system doesn’t just deliver static charts; it generates the raw SQL and Python code required to build them. This transparency is particularly valuable for power users who can “open the hood,” inspect the logic, and modify parameters as needed.

More significantly, every analysis is generated as a Python script with auto-generated unit tests (assertions). This transforms the cultural practice around data work. Instead of analyses being ephemeral screenshots pasted into Slack, they become version-controlled code artifacts. Engineers commit the scripts, and Data Analysts review them as pull requests. The reviewer sees not just a chart but the underlying code and passing tests, transforming the analyst’s role from “Query Factory” to “Reviewer and Guide.”

Repository as Memory and Knowledge Accumulation

A particularly clever aspect of the system is how it addresses the common problem of “amnesiac workflows” in data analysis. Because analyses are treated as code and committed to a central repository, the Copilot can index every merged pull request. The repository effectively becomes the system’s long-term memory, creating a positive feedback loop where past work informs future queries.

This has several practical benefits. Engineers never start from zero when asking questions similar to previous ones, as the Copilot can surface earlier scripts as starting points. Old analyses can be refreshed with new data through simple prompts rather than requiring complete rewrites. Complex logic built by senior analysts becomes reusable modules for future queries. This represents a form of organizational learning encoded in the LLM system’s retrieval mechanism.

Production Deployment and Integration

The deployment model is interesting from an LLMOps perspective. Rather than building a standalone service, BlaBlaCar piggybacks on GitHub Copilot’s infrastructure and licensing. Users need a GitHub Copilot license with access to premium models to use the tool. This reduces operational overhead significantly, as the company doesn’t need to manage LLM serving infrastructure, handle scaling, or negotiate direct relationships with model providers.

The tool lives where engineers already work (VS Code), reducing adoption friction. The authentication and permissions model leverages existing BigQuery access controls, ensuring that engineers only see data they’re authorized to access. This integration with existing infrastructure and workflows is a key factor in the tool’s reported success.

Claims and Results Assessment

BlaBlaCar claims two major impacts: engineers achieving autonomy (questions answered in 10 minutes instead of sitting in a backlog for 3 weeks) and analysts becoming scalable (freed from support queues to focus on deep modeling). While these are compelling claims, the case study is promotional in nature and should be evaluated critically.

The reported velocity improvement (from weeks to minutes) is dramatic but likely reflects best-case scenarios. The comparison is between questions that would have required analyst intervention versus questions now handled autonomously. Not all data questions are equally amenable to this approach—complex analyses requiring deep statistical reasoning or ambiguous business requirements would still benefit from analyst involvement. The tool is positioned as a “Junior Analyst,” which appropriately sets expectations that it handles routine queries rather than sophisticated analytical work.

The cultural transformation claims around pull request reviews and data quality are compelling but would require longitudinal observation to fully validate. Changing established workflows and organizational norms typically requires sustained effort beyond tool deployment. The success likely depends heavily on management support, incentive alignment, and ongoing training.

Open Source Strategy

BlaBlaCar open-sourced a version of their Data Copilot on GitHub, which adds credibility to their case study and allows external validation of their approach. The open source version can connect to BigQuery sample datasets or custom data warehouses. This strategy is pragmatic from both a community-building and recruitment perspective, though the core innovation here is more architectural and organizational than algorithmic.

LLMOps Maturity and Considerations

From an LLMOps perspective, this case study demonstrates several mature practices:

Grounding and retrieval: The system addresses hallucination through careful context engineering, providing curated examples and schema information rather than relying on the base model’s parametric knowledge.

Integration with existing workflows: Rather than requiring users to adopt new tools, the solution embeds in existing IDEs and leverages familiar development practices (pull requests, code review, version control).

Transparency and debuggability: Generated queries are exposed as code, allowing inspection and modification. This is crucial for building trust in LLM outputs.

Incremental safety: The Data Health Card provides soft warnings rather than hard blocks, balancing safety with velocity.

Knowledge accumulation: The repository-as-memory approach creates a virtuous cycle where the system improves over time as more analyses are committed.

However, several LLMOps challenges are not deeply addressed in the case study:

Model evaluation and monitoring: There’s no discussion of how query quality is measured systematically, how often the LLM generates incorrect SQL, or what monitoring exists to detect degradation over time.

Prompt engineering evolution: The system presumably relies on carefully crafted prompts to generate SQL and Python code, but there’s no mention of how these prompts are versioned, tested, or evolved as business logic changes.

Cost management: Using GitHub Copilot’s premium models presumably involves per-user costs. At scale, this could become significant, though likely less than maintaining separate LLM infrastructure.

Failure modes: The case study doesn’t discuss what happens when the LLM generates subtly incorrect queries that pass superficial checks but produce wrong results. The Data Health Card provides some protection, but heuristics have limits.

Training and onboarding: While the tool is designed to be intuitive, effective use likely requires understanding both the data model and how to formulate questions appropriately. The case study doesn’t detail training programs or adoption metrics.

Broader Context: Data Mesh and Organizational Design

The case study situates this work within the broader “Data Mesh” movement, which emphasizes domain-oriented ownership of data products. By enabling engineers to answer their own questions, BlaBlaCar is operationalizing data mesh principles, treating data quality as an upstream engineering constraint rather than a downstream analytics problem.

The “ecotone” metaphor—borrowed from ecology to describe the productive interface between disciplines—is apt. The authors argue that LLMs change the economics of inhabiting interdisciplinary spaces. Previously, thriving at the boundary between engineering and analysis required being in the top 20% of both fields. LLMs lower this bar by handling translation and synthesis, allowing more people to work effectively at the interface.

This represents a broader trend in LLM applications: not replacing specialists but enabling non-specialists to perform competently in adjacent domains. The tool doesn’t eliminate the need for Data Analysts but shifts their work toward higher-leverage activities (reviewing complex analyses, designing KPIs, running A/B tests, improving the data platform).

Technical Simplicity as Strength

Perhaps the most striking aspect of this case study is how much value BlaBlaCar extracted from relatively simple technical components. They didn’t build custom embedding models, fine-tune LLMs, or deploy complex orchestration systems. Instead, they:

This “zero-infrastructure” approach is both a strength and a limitation. It reduces operational complexity and accelerates time-to-value, but it also constrains customization. The system is bound by GitHub Copilot’s capabilities and limitations. If GitHub changes its API or pricing model, BlaBlaCar’s tool is affected. The retrieval mechanism relies on VS Code’s indexing, which may not scale optimally as context grows.

Nevertheless, for many organizations, especially those already using GitHub Copilot, this approach offers a compelling path to production LLM deployment with minimal infrastructure investment. The case study demonstrates that effective LLMOps doesn’t always require sophisticated tooling—sometimes clever integration of existing tools is sufficient.

Conclusion and Broader Implications

BlaBlaCar’s Data Copilot represents a thoughtful application of LLMs to an organizational problem. Rather than chasing the most advanced models or techniques, they identified a specific friction point (the boundary between engineering and data analysis) and applied LLMs strategically to reduce that friction. The solution demonstrates mature LLMOps thinking around grounding, transparency, integration, and knowledge accumulation.

The claims should be evaluated with appropriate skepticism given the promotional nature of the content, but the technical approach is sound and the open source release allows external validation. The case study is most valuable as an example of how production LLM systems can be built pragmatically by leveraging existing infrastructure and applying software engineering discipline to LLM outputs. The “shift left” philosophy and treatment of analyses as code artifacts offer a replicable pattern for other organizations facing similar challenges around data democratization and analyst scalability.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57

Building a Multi-Agent Research System for Complex Information Tasks

Anthropic 2025

Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.

question_answering document_processing data_analysis +48