Uber: LLM-Powered Data Labeling Quality Assurance System

LLMOps Database

Tech

Uber

Company

Uber

Title

LLM-Powered Data Labeling Quality Assurance System

Industry

Tech

Link

https://www.uber.com/en-GB/blog/requirement-adherence-boosting-data-labeling-quality-using-llms/?uclick_id=0a73d271-32e7-4b77-9697-a587a4c8d9fe

Year

2025

Summary (short)

Uber AI Solutions developed a production LLM-based quality assurance system called Requirement Adherence to improve data labeling accuracy for their enterprise clients. The system addresses the costly and time-consuming problem of post-labeling rework by identifying quality issues during the labeling process itself. It works in two phases: first extracting atomic rules from client Standard Operating Procedure (SOP) documents using LLMs with reflection capabilities, then performing real-time validation during the labeling process by routing different rule types to appropriately-sized models with optimization techniques like prefix caching. This approach resulted in an 80% reduction in required audits, significantly improving timelines and reducing costs while maintaining data privacy through stateless, privacy-preserving LLM calls.

Tags

## Overview Uber AI Solutions has deployed a sophisticated LLM-powered quality assurance system in their production data labeling operations. This case study provides valuable insights into how LLMs can be operationalized to solve real-world quality control challenges in data annotation workflows. The system, named Requirement Adherence, is integrated into Uber's proprietary labeling tool (uLabel) and serves their entire enterprise client base across diverse industries including autonomous vehicles, healthcare, retail, and finance. The fundamental challenge this system addresses is common in data labeling operations: traditional quality control happens after labeling is complete, either through post-labeling checks or inter-annotator agreement measures. While effective at catching errors, this approach means mislabeled data must be sent back for expensive rework, creating delays and poor client experiences. The team's insight was to shift quality control left in the process—catching issues during labeling rather than after—but doing so in a way that could scale across diverse client requirements without custom engineering for each project. ## Technical Architecture and Design Decisions The production system operates through a carefully designed two-phase architecture: rule extraction and in-tool validation. This separation of concerns represents a thoughtful LLMOps approach that optimizes for both accuracy and operational efficiency. ### Phase 1: Rule Extraction from SOP Documents The rule extraction phase processes Standard Operating Procedure documents that contain client requirements. The team made several pragmatic design choices here. First, they convert SOP documents to markdown format to optimize LLM processing—a small but important preprocessing step that likely improves parsing reliability. They discovered through experimentation that attempting to enforce all requirements in a single LLM call led to hallucinations and missed enforcements, a finding that aligns with broader industry learnings about LLM task complexity. Their solution was to decompose requirements into atomic, unambiguous, and self-contained rules, with each rule extracted individually. A particularly sophisticated aspect of their approach is the classification of rules into four complexity tiers: formatting checks (handled with regex), deterministic checks (requiring specific words or exact matches), subjective checks (requiring reasoning based on input), and complex subjective checks (involving intricate logical flows). This taxonomy enables intelligent routing in the validation phase—a key LLMOps pattern for balancing accuracy, latency, and cost. The team implements reflection capabilities in the rule extraction logic, allowing the LLM to analyze SOP text, generate structured JSON output, review its own work, and make corrections. This self-correction mechanism improves reliability but presumably adds latency and cost to the offline rule extraction phase—a reasonable tradeoff since this happens once per project rather than per labeling task. Importantly, the system includes a human-in-the-loop step where operators can manually add rules to cover gaps or add requirements not in the SOP. This pragmatic acknowledgment that LLMs won't perfectly extract all rules demonstrates mature production thinking. Another LLM call then validates manual rules for format compliance and checks for overlap with auto-extracted rules, suggesting changes to create a final consolidated rule set. ## Phase 2: Real-Time In-Tool Validation The validation phase showcases several production LLMOps optimizations. Rather than a monolithic validation call, the system performs one validation call per rule, with calls running in parallel. This architectural choice significantly reduces latency in the feedback loop for labelers—critical for user experience in an interactive tool. The team leverages prefix caching to further reduce latency, a specific optimization technique where common prompt prefixes are cached to avoid recomputation across similar requests. The intelligent routing strategy based on rule complexity is particularly noteworthy from an LLMOps perspective. Formatting checks bypass LLMs entirely and use code-based validation. Deterministic checks use lightweight, non-reasoning models for speed and cost efficiency. Subjective checks employ more powerful reasoning models, while complex subjective checks may add reflection or self-consistency mechanisms. This tiered approach represents sophisticated production thinking about the accuracy-latency-cost tradeoff space. The team explicitly notes this "intelligent routing ensures that we apply the right computational power to the right problem, maximizing both accuracy and cost-efficiency." Beyond binary pass/fail validation, the system provides actionable suggestions to labelers on how to fix quality issues—a user experience enhancement that likely contributes to the system's adoption. Spelling and grammar checks run alongside requirement validations, providing comprehensive quality assistance. ## Production Deployment and Operations The system is deployed as a standard component within Uber AI Solutions' annotation pipeline and is used across their entire client base—indicating significant production maturity and scale. The 80% reduction in required audits represents substantial operational impact, directly translating to faster delivery timelines and reduced costs. While the article presents this as a clear success, a balanced assessment would note that we lack information about false positive rates (rules flagged incorrectly), false negative rates (quality issues missed), or how the system handles edge cases and ambiguous requirements. The team mentions building feedback collection mechanisms to continuously improve the system, with a long-term goal of auto-optimizing prompts based on real-world performance. This represents forward-thinking LLMOps practice around monitoring, feedback loops, and continuous improvement—though details on implementation are not provided. Questions remain about how feedback is collected (from labelers? from audit results?), how it's aggregated and analyzed, and what mechanisms enable prompt optimization. ## Privacy and Security Considerations The team emphasizes that LLM interactions are stateless and privacy-preserving, with no client SOP data retained or used for model training. This is critical for enterprise applications handling sensitive client information. The stateless design likely constrains certain architectural choices (precluding memory or fine-tuning approaches) but is necessary for confidentiality. The article doesn't specify which LLM providers are used, though the OpenAI trademark notice suggests at least partial reliance on OpenAI models. Questions about data residency, API security, and handling of potentially sensitive labeling content in validation calls remain unaddressed. ## LLMOps Patterns and Lessons This case study demonstrates several important LLMOps patterns for production deployments: **Task Decomposition**: Breaking complex requirements into atomic rules rather than attempting to enforce everything in one call significantly improved reliability and reduced hallucinations. This aligns with broader patterns of simplifying LLM tasks for better performance. **Intelligent Model Routing**: Using different model types and sizes based on task complexity optimizes the accuracy-cost-latency tradeoff. This requires upfront taxonomy work to classify tasks but pays dividends in production efficiency. **Reflection and Self-Correction**: Implementing reflection in the rule extraction phase improves output quality, trading additional compute for better reliability in an offline process where latency is less critical. **Parallelization and Caching**: Running validation calls in parallel and using prefix caching demonstrates attention to latency optimization, critical for interactive user experiences. **Human-in-the-Loop**: Acknowledging LLM limitations and incorporating manual rule addition represents mature production thinking rather than over-relying on full automation. **Structured Output**: Extracting rules as structured JSON rather than free-form text enables downstream processing and validation—a common LLMOps pattern for reliability. ## Critical Assessment While the article presents impressive results, several aspects warrant critical consideration. The 80% reduction in audits is a compelling metric, but we lack baseline information: what was the initial audit failure rate? How was this measured? What's the false positive rate of the new system—are labelers frequently seeing incorrect quality flags that might degrade their experience or trust in the system? The article doesn't discuss failure modes or limitations. How does the system handle ambiguous or contradictory requirements in SOP documents? What happens when rules are too complex for even the most sophisticated reasoning models? How are edge cases escalated and handled? Cost information is entirely absent. While the system reduces audit costs, what are the LLM API costs? Given that validation happens on every labeling task with potentially dozens of rules checked in parallel, the cumulative API costs could be substantial. The cost-benefit analysis would be valuable for others considering similar approaches. The choice of LLM providers and models is not disclosed, making it difficult to assess reproducibility or consider alternatives. The system's dependency on external LLM APIs creates operational risk around API availability, rate limits, and pricing changes that aren't discussed. The feedback collection and prompt optimization mechanisms mentioned as "next steps" suggest the system is still maturing. It's unclear whether prompt engineering was done systematically during development or whether performance monitoring and A/B testing capabilities exist in production. ## Broader Context and Applicability This case study is particularly relevant because quality assurance is a common challenge across LLM applications, not just data labeling. The patterns demonstrated here—decomposing complex requirements, intelligent model routing, parallel execution, caching—apply broadly to production LLM systems. The approach of extracting structured rules from documentation and then enforcing them could be adapted to code review, content moderation, compliance checking, and many other domains. The system represents a meta-application of LLMs: using LLMs to improve the quality of data that will ultimately train other ML models. This recursive relationship—LLMs ensuring quality of training data for future models—is emblematic of how AI systems are increasingly used to enable and improve other AI systems. For organizations considering similar implementations, key takeaways include the importance of task decomposition, the value of intelligent routing based on task complexity, the need for human-in-the-loop validation, and attention to latency optimization for interactive use cases. The stateless, privacy-preserving design offers a template for enterprise applications handling sensitive data. However, implementers should carefully consider costs, monitor for false positives and negatives, and build comprehensive feedback mechanisms from day one rather than as an afterthought. Overall, this case study demonstrates thoughtful LLMOps practices and significant production impact, though a more complete assessment would benefit from additional transparency around metrics, costs, failure modes, and long-term operational learnings.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source