Uber: Building AI Developer Tools Using LangGraph for Large-Scale Software Development

LLMOps Database

Tech

Uber

Company

Uber

Title

Building AI Developer Tools Using LangGraph for Large-Scale Software Development

Industry

Tech

Link

https://www.youtube.com/watch?v=Bugs0dVcNI8

Year

2025

Summary (short)

Uber's developer platform team built a suite of AI-powered developer tools using LangGraph to improve productivity for 5,000 engineers working on hundreds of millions of lines of code. The solution included tools like Validator (for detecting code violations and security issues), AutoCover (for automated test generation), and various other AI assistants. By creating domain-expert agents and reusable primitives, they achieved significant impact including thousands of daily code fixes, 10% improvement in developer platform coverage, and an estimated 21,000 developer hours saved through automated test generation.

## Overview Uber's Developer Platform team presented a comprehensive case study on how they built AI-powered developer tools using LangGraph to support their massive engineering organization. With 5,000 developers working on hundreds of millions of lines of code serving 33 million trips daily across 15,000 cities, Uber faced significant challenges in maintaining code quality and developer productivity. The team's solution involved building a suite of AI tools powered by agentic workflows that integrate seamlessly into existing developer workflows. ## Strategic Approach and Architecture The team's AI developer tool strategy was built on three key pillars. First, they focused on products that directly improve developer workflows by targeting specific pain points like writing tests and reviewing code. Second, they invested in building cross-cutting primitives - foundational AI technologies that could be reused across multiple solutions. Third, they emphasized intentional tech transfer, deliberately extracting reusable components from their initial products to accelerate future development. Central to their approach was the development of "Lang Effect," their opinionated framework that wraps LangGraph and LangChain to integrate better with Uber's internal systems. This framework emerged from necessity as they saw agentic patterns proliferating across their organization and needed standardized tooling to support multiple teams building AI solutions. ## Key Products and Technical Implementation ### Validator: Code Quality and Security Analysis The first major product showcased was Validator, an IDE-integrated experience that automatically flags best practices violations and security issues. The tool is implemented as a LangGraph agent with a sophisticated user interface that provides real-time feedback directly in the developer's IDE environment. When a developer opens a file, Validator runs in the background and displays diagnostic information for any violations found. For example, it might detect incorrect methods for creating temporary test files and suggest secure alternatives. The system offers multiple remediation options: developers can apply pre-computed fixes generated by the system, or send the issue to their IDE's agentic assistant for a custom solution. The technical architecture demonstrates the power of agent composition. Validator consists of multiple sub-agents working together under a central coordinator. One sub-agent focuses on LLM-based analysis using curated best practices, while another handles deterministic analysis through static linting tools. This hybrid approach allows the system to combine the flexibility of LLM reasoning with the reliability of traditional static analysis tools. The impact has been substantial, with thousands of fix interactions occurring daily as developers address code quality issues before they become larger problems in production. The tool successfully meets developers where they work, providing contextual assistance without disrupting their workflow. ### AutoCover: Intelligent Test Generation The second major product, AutoCover, tackles the time-consuming task of writing comprehensive tests. The tool generates high-quality, validated tests that include business case coverage and mutation testing, aiming to save developers significant time in test creation while maintaining high standards. The user experience is designed to be seamless. Developers can invoke AutoCover on an entire file through a simple right-click menu. Once activated, the system performs several operations in parallel: it adds new targets to the build system, sets up test files, runs initial coverage checks to understand the testing scope, and analyzes surrounding source code to extract business context. While these background processes run, developers see a dynamic experience where tests stream into their IDE in real-time. The system continuously builds and validates tests, removing those that fail, merging redundant tests, and adding new ones including performance benchmarks and concurrency tests. This creates what the presenters described as a "magical" experience where developers watch their test suite build itself. The technical implementation showcases sophisticated agent orchestration. The LangGraph workflow includes specialized agents for different aspects of test generation: a scaffolder that prepares the test environment and identifies business cases, a generator that creates new test cases, and an executor that runs builds and coverage analysis. Notably, the system reuses the Validator agent as a component, demonstrating the composability benefits of their architecture. The team achieved significant performance improvements by allowing parallel execution of up to 100 test generation iterations simultaneously on large source files. This parallelization, combined with domain-specific optimizations, resulted in performance that was 2-3 times faster than industry-standard agentic coding tools while achieving better coverage. The impact has been measurable: AutoCover helped raise developer platform coverage by 10%, translating to approximately 21,000 saved developer hours, with thousands of tests being generated monthly. ## Additional Products and Ecosystem Beyond these two flagship products, the team demonstrated the scalability of their approach through several additional tools. The Uber Assistant Builder functions as an internal custom GPT store where teams can create chatbots with deep Uber knowledge. Examples include security scorebots that can detect anti-patterns using the same primitives as Validator, allowing developers to get architectural feedback before writing code. Picasso, Uber's workflow management platform, incorporates a conversational AI called Genie that understands workflow automations and provides guidance grounded in product knowledge. This demonstrates how the same underlying AI primitives can be adapted to different domains within the organization. The UReview tool extends quality enforcement to the code review process, flagging issues and suggesting improvements before code gets merged. This creates multiple layers of quality assurance, from IDE-time detection through Validator to review-time checking through UReview. ## Technical Learnings and Best Practices The team shared several key technical insights from their development experience. First, they found that building highly capable domain expert agents produces superior results compared to general-purpose solutions. These specialized agents leverage context more effectively, maintain richer state, and exhibit reduced hallucination. The executor agent exemplifies this approach - it includes sophisticated knowledge about Uber's build system that enables parallel test execution without conflicts and separate coverage reporting. Second, they discovered the value of combining LLM-based agents with deterministic sub-components when possible. The lint agent within Validator demonstrates this principle - by using reliable static analysis tools for certain types of issues, they achieve consistent output quality while reserving LLM capabilities for more complex reasoning tasks. Third, they found that solving bounded problems through reusable agents significantly scales development efforts. The build system agent, used across both Validator and AutoCover, represents a lower-level abstraction that multiple products can leverage. This component approach reduces development time and ensures consistency across different tools. ## Organizational and Strategic Learnings Beyond technical insights, the team emphasized important organizational lessons for scaling AI development. Proper encapsulation through well-designed abstractions like LangGraph enables horizontal scaling of development efforts. Their security team was able to contribute rules for Validator without understanding the underlying AI agent implementation, demonstrating how good abstractions enable broader organizational participation. The graph-based modeling approach mirrors how developers naturally interact with systems, making the AI tools more intuitive and effective. Process improvements made for AI workloads often benefit traditional developer workflows as well, creating compound value rather than competitive tension between AI and traditional tooling. ## Production Considerations and Reliability While the presentation focused on capabilities and impact, several production considerations emerge from the case study. The system's ability to handle thousands of daily interactions suggests robust infrastructure and reliability measures, though specific details about monitoring, error handling, and fallback mechanisms weren't extensively covered. The integration with existing IDE environments and build systems demonstrates significant engineering effort in making AI tools production-ready rather than just experimental prototypes. The real-time streaming experience in AutoCover, in particular, suggests sophisticated state management and user interface considerations. ## Assessment and Future Implications This case study represents a mature approach to deploying AI in software development environments. Rather than building isolated AI tools, Uber created an integrated ecosystem with shared primitives and consistent user experiences. The measured impact - thousands of daily interactions, measurable coverage improvements, and quantified time savings - suggests genuine productivity benefits rather than just technological novelty. The emphasis on composability and reusability indicates sustainable development practices that can scale as the organization's needs evolve. The combination of specialized domain expertise with general AI capabilities creates a balanced approach that leverages the strengths of both paradigms. However, the case study comes from a large technology company with significant resources and engineering expertise. The transferability of their specific approaches to smaller organizations or different domains remains an open question. The heavy integration with internal systems and custom frameworks may also create maintenance overhead that wasn't fully addressed in the presentation. Overall, this represents a sophisticated example of LLMOps in practice, demonstrating how large-scale AI deployment can create measurable business value while maintaining high engineering standards and developer experience quality.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source