Company
Uber
Title
AI-Powered Developer Tools for Code Quality and Test Generation
Industry
Tech
Year
2025
Summary (short)
Uber's developer platform team built AI-powered developer tools using LangGraph to improve code quality and automate test generation for their 5,000 engineers. Their approach focuses on three pillars: targeted product development for developer workflows, cross-cutting AI primitives, and intentional technology transfer. The team developed Validator, an IDE-integrated tool that flags best practices violations and security issues with automatic fixes, and AutoCover, which generates comprehensive test suites with coverage validation. These tools demonstrate the successful deployment of multi-agent systems in production, achieving measurable improvements including thousands of daily fix interactions, 10% increase in developer platform coverage, and 21,000 developer hours saved through automated test generation.
## Company Overview and Context Uber operates at massive scale, serving 33 million trips daily across 15,000 cities, powered by a codebase containing hundreds of millions of lines of code. The company's developer platform team is responsible for maintaining developer productivity and satisfaction among approximately 5,000 engineers. This scale and complexity creates unique challenges in maintaining code quality, security standards, and developer efficiency that traditional tooling approaches struggle to address effectively. The presentation was delivered by Matasanis and Sorup Sherhhati from Uber's developer platform team, showcasing how they leveraged LangGraph and multi-agent systems to build production AI developer tools. Their approach represents a mature implementation of LLMOps principles, demonstrating how large-scale organizations can successfully deploy AI agents in critical developer workflows. ## Strategic Framework and LLMOps Architecture Uber's AI developer tool strategy is built on three foundational pillars that reflect sophisticated LLMOps thinking. The first pillar focuses on product development that directly improves developer workflows, targeting specific pain points like test writing and code review processes. Rather than building generic AI tools, they identified concrete developer tasks that could be enhanced through AI automation. The second pillar emphasizes building cross-cutting primitives and foundational AI technologies that can be reused across multiple solutions. This approach demonstrates mature LLMOps thinking by avoiding the common trap of building isolated AI solutions that don't scale organizationally. They developed what they call "Lang Effect," an opinionated framework that wraps LangGraph and LangChain to integrate better with Uber's existing systems and infrastructure. The third pillar, which they consider the cornerstone of their strategy, involves "intentional tech transfer" - a deliberate process of identifying reusable components and abstractions from initial product development that can reduce barriers for future problem-solving. This approach reflects sophisticated understanding of how to scale AI capabilities across an organization while maintaining quality and consistency. ## Validator: Production Code Quality Agent The Validator tool represents their first major production deployment of an AI agent system. This tool provides an IDE-integrated experience that automatically flags best practices violations and security issues in code as developers write it. The system demonstrates several important LLMOps principles in its design and implementation. The tool operates as a LangGraph agent with a sophisticated user experience integrated directly into the IDE. When developers open code files, the system analyzes the code in real-time and provides diagnostic information about violations, such as incorrect methods for creating temporary test files that could leak into the host system. The user experience is designed to be non-intrusive while providing actionable feedback. One of the key technical innovations in Validator is its hybrid approach combining AI agents with deterministic tooling. The system includes sub-agents that call LLMs with curated lists of best practices, but it also incorporates deterministic components that run static linting tools. This hybrid approach allows the system to leverage the strengths of both AI-powered analysis and traditional static analysis tools, demonstrating sophisticated understanding of when to use AI versus deterministic approaches. The agent architecture allows for composability, where multiple sub-agents operate under a central validator agent. This design enables the system to handle different types of analysis - from LLM-powered best practice evaluation to deterministic lint issue discovery - while maintaining a unified user experience. The system can precompute fixes for many issues, allowing developers to apply corrections with a single click. In terms of production impact, Validator generates thousands of fix interactions daily, indicating substantial developer adoption and utility. The tool successfully meets developers where they work (in the IDE) and provides immediate value without requiring changes to existing workflows. ## AutoCover: Automated Test Generation System AutoCover represents a more complex implementation of multi-agent systems for automated test generation. This tool demonstrates advanced LLMOps capabilities by orchestrating multiple specialized agents to generate comprehensive, validated test suites that include business logic testing, coverage analysis, and mutation testing. The system's user experience is designed for minimal developer intervention. Engineers can invoke AutoCover through simple IDE interactions like right-clicking on source files. The tool then performs extensive background processing including adding build system targets, setting up test files, running initial coverage checks, and analyzing surrounding source code for business context. This background orchestration demonstrates sophisticated understanding of the complete test generation workflow. The technical architecture of AutoCover showcases advanced agent composition and parallel processing capabilities. The system uses multiple domain expert agents, each specialized for different aspects of test generation. The scaffolder agent prepares test environments and identifies business cases to test. The generator agent creates new test cases for extending existing tests or writing entirely new ones. The executor agent handles builds, test execution, and coverage analysis. A key innovation in AutoCover is its ability to perform parallel processing at scale - the system can execute 100 iterations of code generation simultaneously and run 100 test executions concurrently on the same file without conflicts. This capability was achieved through sophisticated integration with Uber's build system, allowing the agent to manipulate build configurations to enable parallel execution while maintaining isolation. The system incorporates Validator as a sub-agent, demonstrating the reusability principles central to their LLMOps strategy. This composition allows AutoCover to validate generated tests against the same best practices and security standards applied to human-written code, ensuring consistency in code quality standards. ## Performance and Benchmarking Uber conducted comprehensive benchmarking of AutoCover against industry-standard agentic coding tools for test generation. Their results show approximately 2-3x better coverage generation in about half the time compared to competing solutions. This performance advantage is attributed to their domain expert agent architecture and the speed improvements achieved through parallel processing capabilities. The measurable business impact includes a 10% increase in developer platform coverage, translating to approximately 21,000 developer hours saved. The system generates thousands of tests monthly, indicating sustained adoption and value delivery. These metrics demonstrate the successful transition from experimental AI tooling to production systems delivering measurable business value. ## Ecosystem and Extensibility Beyond the core Validator and AutoCover tools, Uber has developed an ecosystem of AI-powered developer tools leveraging the same foundational primitives. Their internal "Uber Assistant Builder" functions as a custom GPT store where teams can build specialized chatbots with access to Uber-specific knowledge and the same tooling primitives used in Validator and AutoCover. The Security ScoreBot exemplifies this extensibility, incorporating the same best practices detection and security antipattern recognition capabilities as Validator but in a conversational interface. This allows developers to ask architectural questions and receive security guidance before writing code, extending the value of their AI investments across different interaction modalities. Picasso, their internal workflow management platform, includes a conversational AI called "Genie" that understands workflow automations and provides feedback grounded in product knowledge. This demonstrates how their foundational AI primitives can be applied beyond code-specific use cases to broader developer productivity scenarios. The U-Review tool applies similar capabilities to the code review process, flagging code review comments and suggesting improvements during the PR review phase. This creates multiple checkpoints for code quality enforcement, from initial writing (Validator) through testing (AutoCover) to final review (U-Review). ## Technical Architecture and Learnings The team's technical learnings provide valuable insights for LLMOps practitioners. Their emphasis on building "domain expert agents" that are highly capable and specialized has proven more effective than general-purpose agents. These domain experts use context more effectively, can encode rich state information, hallucinate less, and produce higher quality results. The hybrid approach of combining LLM-powered agents with deterministic sub-agents where possible has proven crucial for reliability. When problems can be solved deterministically, they prefer that approach to ensure consistent, reliable outputs. This is exemplified in their lint agent under Validator, which provides reliable intelligence that can be passed to other parts of the agent graph. Their approach to agent composition and reusability demonstrates mature thinking about scaling AI systems. By solving bounded problems with agents and then reusing those agents across multiple applications, they've been able to scale development efforts significantly. The build system agent, used across both Validator and AutoCover, exemplifies this lower-level abstraction reuse. ## Organizational and Strategic Impact From an organizational perspective, Uber's approach demonstrates how proper encapsulation and abstraction can boost collaboration across teams. Their security team was able to contribute rules for Validator without needing deep knowledge of AI agents or graph construction, showing how well-designed abstractions can democratize AI tool development across an organization. The graph-based modeling approach they've adopted often mirrors how developers already interact with systems, making the AI tools feel natural and integrated rather than foreign additions to existing workflows. This alignment with existing mental models appears to be crucial for adoption and effectiveness. An important strategic insight is that improving AI workflows often identifies and addresses inefficiencies in existing systems, benefiting both AI and non-AI use cases. Their work on agentic test generation led to improvements in mock generation, build file modification, build system interaction, and test execution that improved the experience for all developers, not just those using AI tools. ## Production Deployment Considerations The case study demonstrates several important LLMOps best practices for production deployment. The team's focus on meeting developers where they already work (in IDEs) rather than requiring new tools or workflows appears crucial for adoption. Their emphasis on providing immediate, actionable value while maintaining existing workflow patterns shows sophisticated understanding of change management in technical organizations. The parallel processing capabilities they've developed for AutoCover represent significant engineering investment in making AI agents performant at scale. Their ability to execute 100 concurrent operations while maintaining isolation shows the level of systems integration required for production AI deployment in complex environments. The reusability framework they've built around their agents, particularly the intentional tech transfer approach, provides a model for how organizations can avoid the common trap of building isolated AI solutions that don't scale. Their Lang Effect framework demonstrates how organizations can create opinionated abstractions that work well with existing systems while leveraging powerful frameworks like LangGraph. The measurable impact they've achieved - thousands of daily interactions, measurable coverage improvements, and significant developer hour savings - demonstrates that AI developer tools can deliver concrete business value when properly designed and implemented. However, their success appears to depend heavily on deep integration with existing systems, sophisticated understanding of developer workflows, and significant engineering investment in making the tools performant and reliable at scale.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.