ZenML

Building AI Developer Tools Using LangGraph for Large-Scale Software Development

Uber 2025
View original source

Uber's developer platform team built a suite of AI-powered developer tools using LangGraph to improve productivity for 5,000 engineers working on hundreds of millions of lines of code. The solution included tools like Validator (for detecting code violations and security issues), AutoCover (for automated test generation), and various other AI assistants. By creating domain-expert agents and reusable primitives, they achieved significant impact including thousands of daily code fixes, 10% improvement in developer platform coverage, and an estimated 21,000 developer hours saved through automated test generation.

Industry

Tech

Technologies

Overview

Uber’s Developer Platform team presented a comprehensive case study on how they built AI-powered developer tools using LangGraph to support their massive engineering organization. With 5,000 developers working on hundreds of millions of lines of code serving 33 million trips daily across 15,000 cities, Uber faced significant challenges in maintaining code quality and developer productivity. The team’s solution involved building a suite of AI tools powered by agentic workflows that integrate seamlessly into existing developer workflows.

Strategic Approach and Architecture

The team’s AI developer tool strategy was built on three key pillars. First, they focused on products that directly improve developer workflows by targeting specific pain points like writing tests and reviewing code. Second, they invested in building cross-cutting primitives - foundational AI technologies that could be reused across multiple solutions. Third, they emphasized intentional tech transfer, deliberately extracting reusable components from their initial products to accelerate future development.

Central to their approach was the development of “Lang Effect,” their opinionated framework that wraps LangGraph and LangChain to integrate better with Uber’s internal systems. This framework emerged from necessity as they saw agentic patterns proliferating across their organization and needed standardized tooling to support multiple teams building AI solutions.

Key Products and Technical Implementation

Validator: Code Quality and Security Analysis

The first major product showcased was Validator, an IDE-integrated experience that automatically flags best practices violations and security issues. The tool is implemented as a LangGraph agent with a sophisticated user interface that provides real-time feedback directly in the developer’s IDE environment.

When a developer opens a file, Validator runs in the background and displays diagnostic information for any violations found. For example, it might detect incorrect methods for creating temporary test files and suggest secure alternatives. The system offers multiple remediation options: developers can apply pre-computed fixes generated by the system, or send the issue to their IDE’s agentic assistant for a custom solution.

The technical architecture demonstrates the power of agent composition. Validator consists of multiple sub-agents working together under a central coordinator. One sub-agent focuses on LLM-based analysis using curated best practices, while another handles deterministic analysis through static linting tools. This hybrid approach allows the system to combine the flexibility of LLM reasoning with the reliability of traditional static analysis tools.

The impact has been substantial, with thousands of fix interactions occurring daily as developers address code quality issues before they become larger problems in production. The tool successfully meets developers where they work, providing contextual assistance without disrupting their workflow.

AutoCover: Intelligent Test Generation

The second major product, AutoCover, tackles the time-consuming task of writing comprehensive tests. The tool generates high-quality, validated tests that include business case coverage and mutation testing, aiming to save developers significant time in test creation while maintaining high standards.

The user experience is designed to be seamless. Developers can invoke AutoCover on an entire file through a simple right-click menu. Once activated, the system performs several operations in parallel: it adds new targets to the build system, sets up test files, runs initial coverage checks to understand the testing scope, and analyzes surrounding source code to extract business context.

While these background processes run, developers see a dynamic experience where tests stream into their IDE in real-time. The system continuously builds and validates tests, removing those that fail, merging redundant tests, and adding new ones including performance benchmarks and concurrency tests. This creates what the presenters described as a “magical” experience where developers watch their test suite build itself.

The technical implementation showcases sophisticated agent orchestration. The LangGraph workflow includes specialized agents for different aspects of test generation: a scaffolder that prepares the test environment and identifies business cases, a generator that creates new test cases, and an executor that runs builds and coverage analysis. Notably, the system reuses the Validator agent as a component, demonstrating the composability benefits of their architecture.

The team achieved significant performance improvements by allowing parallel execution of up to 100 test generation iterations simultaneously on large source files. This parallelization, combined with domain-specific optimizations, resulted in performance that was 2-3 times faster than industry-standard agentic coding tools while achieving better coverage.

The impact has been measurable: AutoCover helped raise developer platform coverage by 10%, translating to approximately 21,000 saved developer hours, with thousands of tests being generated monthly.

Additional Products and Ecosystem

Beyond these two flagship products, the team demonstrated the scalability of their approach through several additional tools. The Uber Assistant Builder functions as an internal custom GPT store where teams can create chatbots with deep Uber knowledge. Examples include security scorebots that can detect anti-patterns using the same primitives as Validator, allowing developers to get architectural feedback before writing code.

Picasso, Uber’s workflow management platform, incorporates a conversational AI called Genie that understands workflow automations and provides guidance grounded in product knowledge. This demonstrates how the same underlying AI primitives can be adapted to different domains within the organization.

The UReview tool extends quality enforcement to the code review process, flagging issues and suggesting improvements before code gets merged. This creates multiple layers of quality assurance, from IDE-time detection through Validator to review-time checking through UReview.

Technical Learnings and Best Practices

The team shared several key technical insights from their development experience. First, they found that building highly capable domain expert agents produces superior results compared to general-purpose solutions. These specialized agents leverage context more effectively, maintain richer state, and exhibit reduced hallucination. The executor agent exemplifies this approach - it includes sophisticated knowledge about Uber’s build system that enables parallel test execution without conflicts and separate coverage reporting.

Second, they discovered the value of combining LLM-based agents with deterministic sub-components when possible. The lint agent within Validator demonstrates this principle - by using reliable static analysis tools for certain types of issues, they achieve consistent output quality while reserving LLM capabilities for more complex reasoning tasks.

Third, they found that solving bounded problems through reusable agents significantly scales development efforts. The build system agent, used across both Validator and AutoCover, represents a lower-level abstraction that multiple products can leverage. This component approach reduces development time and ensures consistency across different tools.

Organizational and Strategic Learnings

Beyond technical insights, the team emphasized important organizational lessons for scaling AI development. Proper encapsulation through well-designed abstractions like LangGraph enables horizontal scaling of development efforts. Their security team was able to contribute rules for Validator without understanding the underlying AI agent implementation, demonstrating how good abstractions enable broader organizational participation.

The graph-based modeling approach mirrors how developers naturally interact with systems, making the AI tools more intuitive and effective. Process improvements made for AI workloads often benefit traditional developer workflows as well, creating compound value rather than competitive tension between AI and traditional tooling.

Production Considerations and Reliability

While the presentation focused on capabilities and impact, several production considerations emerge from the case study. The system’s ability to handle thousands of daily interactions suggests robust infrastructure and reliability measures, though specific details about monitoring, error handling, and fallback mechanisms weren’t extensively covered.

The integration with existing IDE environments and build systems demonstrates significant engineering effort in making AI tools production-ready rather than just experimental prototypes. The real-time streaming experience in AutoCover, in particular, suggests sophisticated state management and user interface considerations.

Assessment and Future Implications

This case study represents a mature approach to deploying AI in software development environments. Rather than building isolated AI tools, Uber created an integrated ecosystem with shared primitives and consistent user experiences. The measured impact - thousands of daily interactions, measurable coverage improvements, and quantified time savings - suggests genuine productivity benefits rather than just technological novelty.

The emphasis on composability and reusability indicates sustainable development practices that can scale as the organization’s needs evolve. The combination of specialized domain expertise with general AI capabilities creates a balanced approach that leverages the strengths of both paradigms.

However, the case study comes from a large technology company with significant resources and engineering expertise. The transferability of their specific approaches to smaller organizations or different domains remains an open question. The heavy integration with internal systems and custom frameworks may also create maintenance overhead that wasn’t fully addressed in the presentation.

Overall, this represents a sophisticated example of LLMOps in practice, demonstrating how large-scale AI deployment can create measurable business value while maintaining high engineering standards and developer experience quality.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik 2025

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot +53