Uber's developer platform team built AI-powered developer tools using LangGraph to improve code quality and automate test generation for their 5,000 engineers. Their approach focuses on three pillars: targeted product development for developer workflows, cross-cutting AI primitives, and intentional technology transfer. The team developed Validator, an IDE-integrated tool that flags best practices violations and security issues with automatic fixes, and AutoCover, which generates comprehensive test suites with coverage validation. These tools demonstrate the successful deployment of multi-agent systems in production, achieving measurable improvements including thousands of daily fix interactions, 10% increase in developer platform coverage, and 21,000 developer hours saved through automated test generation.
Uber’s developer platform team presented their approach to building AI-powered developer tools at scale, serving approximately 5,000 developers working on a codebase with hundreds of millions of lines of code. The company processes 33 million trips daily across 15,000 cities, and maintaining this infrastructure requires robust developer tooling. The presentation, delivered by Matasanis and Sorup Sherhhati, focuses on how they leveraged LangGraph to create composable, reusable AI agents that power multiple developer-facing products.
The team operates on three core pillars for their AI developer tool strategy:
Product Bets: They focus on identifying specific developer workflows that can be improved with AI, such as writing tests and reviewing code. The emphasis is on eliminating toil and making existing workflows faster rather than creating entirely new paradigms.
Cross-cutting Primitives: Building foundational AI technologies and abstractions that can be reused across multiple solutions. This includes developing the right frameworks and tooling that accelerate the development of new AI-powered features.
Intentional Tech Transfer: Perhaps the most interesting aspect of their strategy is the deliberate effort to identify reusable components from individual products and spin them out as shared primitives. This has led to the creation of “Lang Effect,” their internal opinionated framework that wraps LangGraph and LangChain to work better with Uber’s internal systems.
Validator is an IDE-integrated experience that automatically flags best practice violations and security issues in code. The architecture is built as a LangGraph agent with a polished IDE user experience. When a developer opens a file (the example shown was a Go file), Validator analyzes the code and displays diagnostics for issues—for instance, flagging when an incorrect method is used to create temporary test files that could leak into the host system.
The user experience is well-considered: developers can either apply a pre-computed fix that was prepared in the background, or they can send the fix request to their IDE’s agentic assistant. This demonstrates the composability thinking that runs through their architecture—the same fix capability can be consumed through different interfaces.
A key architectural insight from Validator is the composition of multiple sub-agents under a central coordinator. They have an LLM-powered sub-agent that evaluates code against a list of best practices, but they also incorporate deterministic components—specifically, they run static linters and pass those findings through the rest of the graph. This hybrid approach of combining LLM-based analysis with deterministic tooling is a pattern that recurs throughout their architecture.
The impact metrics are impressive: thousands of fix interactions daily from engineers resolving code issues before they become larger problems.
AutoCover is a more complex tool designed to automatically generate high-quality, passing tests that raise code coverage. The emphasis is on generating tests that are validated through actual execution and mutation testing, not just syntactically correct code. The goal is saving developer time so engineers can focus on implementing business features rather than writing boilerplate tests.
The workflow begins when a developer invokes AutoCover on a source file. Multiple background operations kick off simultaneously: adding a new target to the build system, setting up a test file, running an initial coverage check to establish a baseline, and analyzing surrounding source code to understand business context.
What the developer sees is a test file that’s in “constant flux”—tests stream in rapidly, builds run, failing tests get removed, tests may be merged or marked redundant, and eventually benchmark and concurrency tests appear. The final output is a vetted set of validated tests.
The underlying graph architecture is particularly sophisticated. It includes:
A critical performance optimization is parallelization. Because there’s no human in the loop during generation, they can “supercharge” the graph—running up to 100 iterations of code generation simultaneously and 100 test executions at the same time for sufficiently large source files. This parallelization, combined with deep integration with their build system, gives them significant performance advantages.
The benchmarking results are notable: 2-3x more coverage in half the time compared to industry agentic coding tools for test generation. In terms of business impact, AutoCover has raised developer platform coverage by approximately 10%, which they calculate as equivalent to 21,000 dev hours saved, with thousands of tests generated monthly.
The presenters also highlighted several other products built on the same foundations:
Uber Assistant Builder: An internal “Custom GPT store” where teams can build chatbots steeped in Uber knowledge. One example is a security scorebot that has access to the same tools as Validator, allowing developers to ask architecture questions and get security feedback before writing code.
Picasso/Genie: A conversational AI interface for their internal workflow management platform. It understands workflow automations and can provide feedback grounded in product truth.
UReview: A code review tool that reinforces quality checks at PR merge time. It uses the same underlying tools as Validator and AutoCover to flag issues and suggest fixes during the review process, catching issues that slip through earlier in the workflow.
The team found that building highly capable domain-specific agents produces outsized results compared to generic agents. These specialized agents use context more effectively, can encode rich state, and hallucinate less. A concrete example is their executor agent, which is finely tuned to interact with Uber’s build system—it can execute 100 tests on the same file without collision and generate separate coverage reports for each.
When possible, they compose LLM-powered agents with deterministic sub-agents or make entire components deterministic. The lint agent within Validator is a prime example—static analysis tools provide reliable, deterministic output that doesn’t need LLM inference. This reliability is then passed through the graph for downstream processing. This hybrid approach reduces hallucination risk and improves overall system reliability.
Creating agents that solve bounded, well-defined problems enables significant reuse. The Validator agent appears both as a standalone IDE experience and as a validation step within AutoCover. At an even lower level, the build system agent is used across multiple products, providing a consistent interface for executing builds and tests.
Beyond the technical aspects, the presenters emphasized organizational patterns that enabled their success:
Encapsulation Boosts Collaboration: Well-thought-out abstractions like LangGraph, combined with opinions on state management and concurrency, allow horizontal scaling of development. The security team was able to write rules for Validator without understanding the underlying AI agent architecture—they simply needed to understand the interface for specifying rules.
Graphs Model Developer Interactions: The graph-based architecture often mirrors how developers already interact with systems. This alignment means that identifying process bottlenecks for AI workloads also improves experiences for non-AI tooling.
Symbiotic Improvement: Building agentic systems surfaced inefficiencies in underlying systems (mock generation, build file modification, test execution). Fixing these “paper cuts” improved the experience for all developers, not just those using AI tools. This represents a virtuous cycle where AI tool development drives broader infrastructure improvements.
The team built “Lang Effect,” an opinionated internal framework wrapping LangGraph and LangChain to integrate with Uber’s internal systems. This framework emerged organically as multiple teams began building agentic solutions and needed shared patterns for things like state management, concurrency handling, and integration with internal services.
The decision to standardize on LangGraph was driven by observing its organic proliferation across the organization. Rather than letting fragmentation occur, they invested in a consistent abstraction layer that makes agentic development more accessible while maintaining integration with Uber’s infrastructure.
This case study represents a mature approach to deploying LLM-based tooling at scale. The emphasis on composability, reuse, and hybrid deterministic/LLM architectures reflects lessons learned from production deployment rather than theoretical design. The quantitative results (coverage improvements, dev hours saved, benchmark comparisons) provide concrete evidence of value delivery.
The organizational insights about encapsulation and collaboration are particularly valuable—they demonstrate that successful LLMOps isn’t just about the technology but about how teams structure work and share capabilities. The observation that agentic development improves underlying infrastructure creates a compelling argument for investment in these tools beyond just the direct productivity gains.
One area not deeply covered is evaluation and testing of the agents themselves, or how they handle edge cases and failures. The presentation focused more on capabilities and outcomes than on the operational aspects of maintaining these systems in production. Additionally, while the performance comparisons to “industry agentic coding tools” are mentioned, specific baselines or methodologies aren’t provided, making it difficult to fully validate those claims.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.
Uber's developer platform team built a suite of AI-powered developer tools using LangGraph to improve productivity for 5,000 engineers working on hundreds of millions of lines of code. The solution included tools like Validator (for detecting code violations and security issues), AutoCover (for automated test generation), and various other AI assistants. By creating domain-expert agents and reusable primitives, they achieved significant impact including thousands of daily code fixes, 10% improvement in developer platform coverage, and an estimated 21,000 developer hours saved through automated test generation.