## Overview
Uber's case study presents a comprehensive LLMOps implementation centered around Genie, an internal on-call copilot designed to support thousands of engineering queries across multiple Slack channels. The system specifically focuses on engineering security and privacy domains, where accuracy and reliability are paramount. This case study demonstrates how Uber evolved from a traditional RAG architecture to an enhanced agentic RAG (EAg-RAG) approach to achieve near-human precision in automated responses.
The business context is particularly compelling: Uber needed to scale support for domain-specific engineering queries while reducing the burden on subject matter experts (SMEs) and on-call engineers. The challenge was ensuring that an LLM-powered system could provide accurate, reliable guidance in critical areas like security and privacy, where incorrect information could have serious consequences.
## Initial Implementation and Challenges
Uber's initial approach utilized a traditional RAG framework that integrated with nearly all internal knowledge sources, including engineering wikis, Terrablob PDFs, Google Docs, and custom documents. The system supported the full RAG pipeline covering document loading, processing, vector storage, retrieval, and answer generation. While this represented a sophisticated machine learning infrastructure, the system faced significant accuracy challenges when tested against a golden set of 100+ test queries curated by SMEs.
The evaluation revealed critical gaps in response quality. Many answers were incomplete, inaccurate, or failed to retrieve relevant information with sufficient detail from the knowledge base. SMEs determined that the response quality didn't meet the standards required for broader deployment across critical security and privacy Slack channels. This assessment highlighted a crucial LLMOps challenge: the gap between technical sophistication and real-world production requirements where accuracy and reliability are non-negotiable.
## Enhanced Document Processing
One of the most significant technical innovations in Uber's approach involved fundamentally reimagining document processing. The team discovered that existing PDF loaders failed to correctly capture structured text and formatting, particularly for complex tables spanning multiple pages with nested cells. Traditional PDF loaders like SimpleDirectoryLoader from LlamaIndex and PyPDFLoader from LangChain resulted in extracted text that lost original formatting, causing table cells to become isolated and disconnected from their contextual meaning.
To address this challenge, Uber transitioned from PDFs to Google Docs, leveraging HTML formatting for more accurate text extraction. They built a custom Google document loader using the Google Python API that recursively extracted paragraphs, tables, and table of contents. For tables and structured text like bullet points, they integrated an LLM-powered enrichment process that prompted the LLM to convert extracted table contents into properly formatted markdown tables.
The document enrichment process went beyond basic extraction. The team enriched metadata with custom attributes including document summaries, FAQ sets, and relevant keywords. These metadata attributes served dual purposes: some were used in precursor or post-processing steps to refine extracted context, while others like FAQs and keywords were directly employed in the retrieval process to enhance accuracy. This approach demonstrates sophisticated LLMOps thinking where data quality and preprocessing significantly impact downstream model performance.
## Agentic RAG Architecture
The core innovation in Uber's approach was the transition to an agentic RAG architecture that introduced LLM-powered agents to perform pre-and post-processing steps. The system implemented several specialized agents: Query Optimizer, Source Identifier, and Post-Processor Agent. The Query Optimizer refined ambiguous queries and broke down complex queries into simpler components for better retrieval. The Source Identifier processed optimized queries to narrow down the subset of policy documents most likely to contain relevant answers.
The retrieval process itself was enhanced with a hybrid approach combining traditional vector search with BM25-based retrieval. The BM25 retriever leveraged enriched metadata including summaries, FAQs, and keywords for each chunk. The final retrieval output represented the union of results from both approaches, which was then processed by the Post-Processor Agent for de-duplication and contextual structuring.
This agentic approach addressed subtle distinctions within domain-specific content that simple semantic similarity couldn't capture. In Uber's security and privacy domain, document chunks often contained nuanced variations in data retention policies, classification schemes, and sharing protocols across different personas and geographies. The agentic framework provided the flexibility to handle these complexities while maintaining modularity for future enhancements.
## Evaluation and Testing Infrastructure
A critical aspect of Uber's LLMOps implementation was the development of automated evaluation systems using LLM-as-Judge frameworks. The team addressed two key challenges that plague many production LLM systems: high SME involvement in evaluation and slow feedback cycles that previously took weeks to assess improvements. The automated evaluation system reduced experiment evaluation time from weeks to minutes, enabling faster iterations and more effective experimentation.
The LLM-as-Judge system operated through a three-stage process: one-time manual SME review to establish quality standards, batch execution of chatbot responses, and automated LLM evaluation using user queries, SME responses, and evaluation instructions as context. The evaluation system scored responses on a 0-5 scale and provided reasoning for evaluations, enabling incorporation of feedback into future experiments.
This evaluation infrastructure represents sophisticated LLMOps practice where automated testing and validation systems are essential for maintaining production quality. The ability to rapidly iterate and test improvements while maintaining quality standards demonstrates mature LLMOps thinking that balances automation with human oversight.
## Technical Implementation Details
Uber built the agentic RAG framework using Langfx, their internal LangChain-based service within Michelangelo, their machine learning platform. For agent development and workflow orchestration, they utilized LangChain LangGraph, which provided scalability and developer-friendly frameworks for agentic AI workflows. While the current implementation follows a sequential flow, the LangGraph integration allows for future expansion into more complex agentic frameworks.
The system's modularity was designed as configurable components within the Michelangelo Genie framework, making them easily adoptable by other domain teams across Uber. This architectural decision demonstrates good LLMOps practices where reusability and scalability are built into the system design from the beginning.
## Production Results and Business Impact
The enhanced agentic RAG system achieved substantial improvements: a 27% relative improvement in acceptable answers and a 60% relative reduction in incorrect advice. These metrics enabled deployment across multiple security and privacy help channels, providing real-time responses to common user queries. The system measurably reduced support load for on-call engineers and SMEs, allowing them to focus on more complex and high-value tasks.
The business impact extended beyond immediate efficiency gains. By demonstrating that better-quality source documentation enables improved bot performance, the system encouraged teams to maintain more accurate and useful internal documentation. This creates a virtuous cycle where improved documentation quality enhances system performance, which in turn incentivizes better documentation practices.
## Operational Considerations and Lessons Learned
While the case study presents impressive results, it's important to note several operational considerations. The system required significant investment in custom document processing infrastructure and specialized agent development. The transition from PDF to Google Docs, while beneficial for extraction quality, required organizational changes in how documentation was created and maintained.
The agentic approach introduced additional complexity in system architecture and debugging. Each agent represents a potential point of failure, and the sequential processing approach could introduce latency concerns at scale. The team's acknowledgment of future needs for iterative Chain-of-RAG approaches and self-critique agents suggests ongoing complexity management challenges.
## Future Development and Scalability
Uber's roadmap indicates several areas for continued development. They plan to extend their custom Google Docs plugin to support multi-modal content including images. The team is considering iterative Chain-of-RAG approaches for multi-hop reasoning queries and self-critique agents for dynamic response refinement. They're also exploring tool-based architectures where LLM-powered agents can choose appropriate tools based on query type and complexity.
This forward-looking approach demonstrates mature LLMOps thinking where current solutions are viewed as foundations for future capabilities rather than final implementations. The emphasis on flexibility and modularity suggests an understanding that production LLM systems require continuous evolution and adaptation.
The case study represents a sophisticated example of LLMOps implementation that goes beyond basic RAG deployment to address real-world production challenges through thoughtful architecture, rigorous evaluation, and systematic quality improvement. While the results are impressive, the approach required significant engineering investment and organizational changes, highlighting the complexity of deploying high-quality LLM systems in production environments.