Uber: LLM-Powered In-Tool Quality Validation for Data Labeling

LLMOps Database

Tech

Uber

Company

Uber

Title

LLM-Powered In-Tool Quality Validation for Data Labeling

Industry

Tech

Link

https://www.uber.com/en-IN/blog/requirement-adherence-boosting-data-labeling-quality-using-llms/

Year

2025

Summary (short)

Uber AI Solutions developed a Requirement Adherence system to address quality issues in data labeling workflows, which traditionally relied on post-labeling checks that resulted in costly rework and delays. The solution uses LLMs in a two-phase approach: first extracting atomic rules from Standard Operating Procedure (SOP) documents and categorizing them by complexity, then performing real-time validation during the labeling process within their uLabel tool. By routing different rule types to appropriate LLM models (non-reasoning models for deterministic checks, reasoning models for subjective checks) and leveraging techniques like prefix caching and parallel execution, the system achieved an 80% reduction in required audits while maintaining data privacy through stateless, privacy-preserving LLM calls.

Tags

data_analysis

data_cleaning

classification

high_stakes_application

## Overview Uber AI Solutions built an LLM-powered system called Requirement Adherence to improve data labeling quality for their enterprise clients. The company provides data labeling services across multiple industries including autonomous vehicles, healthcare, retail, and finance, where high-quality labeled datasets are critical for training machine learning models. The core problem they faced was that traditional quality assurance approaches relied on post-labeling checks and inter-annotator agreement, which meant that errors were only discovered after work was completed, requiring expensive rework cycles and creating delays for clients. The innovation here lies in shifting quality checks leftward in the workflow—catching issues during the labeling process itself rather than after submission. This is implemented within their proprietary labeling tool called uLabel. The challenge they identified was scalability: each client has unique requirements, making it impractical to build custom validation logic for every engagement. Their solution leverages LLMs to automatically extract validation rules from Standard Operating Procedure (SOP) documents and enforce them in real-time during annotation. ## Technical Architecture and LLMOps Implementation The system operates in two distinct phases: rule extraction and in-tool validation. This separation of concerns is a key architectural decision that enables the system to handle diverse client requirements while maintaining performance. ### Rule Extraction Phase The rule extraction phase begins with document preprocessing. SOP documents, which contain client requirements and guidelines, are converted to markdown format to optimize them for LLM processing. This preprocessing step is important because it standardizes the input format regardless of how clients originally provided their documentation. The team discovered through experimentation that a single monolithic LLM call to extract and enforce all requirements simultaneously led to hallucinations and missed requirements. This is a critical LLMOps insight—breaking down complex tasks into atomic units improves reliability. They designed their system to extract each requirement as an individual rule, with rules being atomic, unambiguous, and self-contained. This granular approach allows for more reliable extraction and enforcement. An interesting aspect of their rule extraction is the classification system. They categorize rules into four complexity tiers: formatting checks (handled by regex without LLM involvement), deterministic checks (requiring specific words or exact matches), subjective checks (requiring reasoning or interpretation), and complex subjective checks (involving intricate logical flows). This classification isn't just for organization—it directly impacts how rules are validated in production, with different LLM models and techniques applied to different rule types. The rule extraction logic incorporates reflection capabilities, a technique where the LLM analyzes the SOP text, generates structured JSON output, reviews its own work, and makes corrections if necessary. This self-correction mechanism is an important quality control measure that helps reduce extraction errors. The output format is structured JSON, which makes the extracted rules machine-readable and easy to integrate into the validation pipeline. Uber AI Solutions also incorporates a human-in-the-loop component at the rule extraction stage. Operators can manually add rules to cover gaps in LLM extraction or to incorporate new requirements not present in the original SOP. This hybrid approach acknowledges the limitations of fully automated extraction while maintaining efficiency. After manual rules are added, another LLM call validates that the manual rules follow required formats and checks for overlap with auto-extracted rules, suggesting changes when conflicts are detected. This additional validation layer ensures consistency across the entire rule set. ### In-Tool Validation Phase The validation phase represents the production inference component of the system and demonstrates sophisticated LLMOps practices. Rather than performing a single validation call for all rules, the system executes one validation call per rule. These calls run in parallel, which significantly reduces latency and improves the user experience for annotators working in real-time. This parallelization strategy is a critical performance optimization for production LLM systems. Prefix caching is employed to further reduce latency. This technique caches common prompt prefixes so that repeated similar requests can reuse cached computations rather than reprocessing the same context. In a validation scenario where multiple rules might share similar context about the labeling task, prefix caching can substantially improve response times and reduce computational costs. The system implements intelligent LLM routing based on rule complexity, which is a sophisticated cost-optimization strategy. Formatting checks bypass LLMs entirely and use traditional code-based validation. Deterministic checks use lightweight, non-reasoning LLMs that can quickly verify exact matches or specific word presence. Subjective checks leverage more powerful reasoning models capable of nuanced interpretation. Complex subjective checks employ the strongest reasoning models and may incorporate additional techniques like reflection or self-consistency checks. This tiered approach ensures that computational resources are allocated appropriately—simple checks don't waste expensive reasoning model capacity, while complex checks get the sophisticated processing they require. Self-consistency, mentioned in the context of complex subjective checks, is a technique where multiple reasoning paths are generated and compared to arrive at a more reliable answer. Reflection, also mentioned, involves having the model review and critique its own output. Both techniques improve reliability at the cost of additional inference calls, but the system applies them selectively only where complexity justifies the overhead. Beyond simple pass/fail validation, the system provides actionable suggestions to annotators on how to fix quality issues. This guidance transforms the tool from a gatekeeper into a coaching system that helps experts improve their work. Spelling and grammar checks run alongside the rule-based validations to provide comprehensive quality feedback. ## Production Considerations and Operational Results The case study emphasizes data privacy as a critical consideration for their LLMOps implementation. All LLM interactions are designed to be stateless, meaning no conversation history or context is retained between calls. This architecture choice ensures that client SOP data isn't retained or used to train models. The privacy-preserving design processes data without compromising confidentiality, which is essential when handling sensitive client information across industries like healthcare and finance. The system achieved measurable production results: an 80% reduction in audits required. While the blog post frames this as a substantial enhancement, it's worth noting that this metric measures reduction in post-labeling audits rather than absolute accuracy improvement. The reduction in audits translates to faster delivery timelines and lower costs, which are the business outcomes that matter to clients. The system has been adopted as a standard step across Uber AI Solutions's entire client base, indicating production stability and reliability. ## Future Development and Continuous Improvement The team is building feedback collection mechanisms to gather data on validation performance. This feedback loop will enable prompt optimization over time, with a long-term goal of auto-optimizing prompts based on real-world performance. This represents a mature approach to LLMOps where the system learns and improves from production usage. Automated prompt optimization is an active area of research and development in the LLMOps field, and Uber's plan to implement it shows forward-thinking engineering. ## Critical Assessment and Tradeoffs While the case study presents impressive results, several aspects deserve balanced consideration. The 80% reduction in audits is presented as the primary success metric, but the text doesn't provide baseline error rates or absolute quality improvements. It's possible that in-tool validation catches obvious errors while more subtle quality issues still require post-labeling review. The system's reliance on SOP document quality means that poorly written or ambiguous SOPs will result in poor rule extraction, creating a garbage-in-garbage-out scenario. The human-in-the-loop component for rule extraction, while valuable for quality, creates a manual step that may become a bottleneck as the system scales to more clients. The reflection and self-consistency techniques used for complex rules increase inference costs and latency, which requires careful cost-benefit analysis. The multi-model routing strategy adds architectural complexity and requires maintaining multiple LLM integrations. The privacy-preserving, stateless design is excellent for data protection but prevents the system from learning from past validations to improve future performance automatically. The team's planned feedback collection mechanism attempts to address this, but it will require careful engineering to maintain privacy while enabling learning. Overall, the Requirement Adherence system demonstrates sophisticated LLMOps practices including intelligent model routing, parallel execution, prefix caching, reflection, and human-in-the-loop validation. The architecture thoughtfully balances accuracy, latency, cost, and privacy considerations. The production deployment across the entire client base and measurable business impact indicate a successful LLMOps implementation, though the full picture of tradeoffs and limitations isn't completely visible from the promotional nature of the blog post.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source