Uber's Developer Platform team explored three major initiatives using LLMs in production: a custom IDE coding assistant (which was later abandoned in favor of GitHub Copilot), an AI-powered test generation system called Auto Cover, and an automated Java-to-Kotlin code migration system. The team combined deterministic approaches with LLMs to achieve significant developer productivity gains while maintaining code quality and safety. They found that while pure LLM approaches could be risky, hybrid approaches combining traditional software engineering practices with AI showed promising results.
This case study covers Uber’s Developer Platform team’s approximately two-year journey of integrating AI into their software development lifecycle. The presentation was delivered by Adam (a senior engineering manager) and Ty, who shared three distinct stories about applying LLMs to developer productivity challenges across Uber’s massive codebase of over 100 million lines of code spread across six monorepos supporting different language platforms.
Uber has positioned itself to be “AI-driven” by creating a horizontal team called “AI Developer Experience” that brings together experts from each monorepo who understand the nuances of developer workflows across different languages. Importantly, these team members don’t need to be ML experts—they focus on applying AI to solve developer problems. A separate team with ML expertise provides foundational capabilities like model hosting, fine-tuning, and a model gateway that other teams can build upon.
The company’s AI journey began around October 2022 when they had zero AI projects in their hackathons. Every hackathon since has been focused on AI, demonstrating a deliberate skill-building progression from “exploring things and building toys” to “building tools that deliver impact.”
The first story is particularly valuable as a cautionary tale about the gap between AI MVPs and production systems. Uber had been using GitHub Copilot and hypothesized that fine-tuning their own model on their 100+ million lines of proprietary code could yield better acceptance rates.
They set ambitious requirements for their in-house solution:
The architecture involved local code context gathering, passing to internal proxies, which communicated with backend-hosted models fine-tuned on each of the six monorepos. Results underwent verification for syntactic correctness before display.
They evaluated multiple LLMs including Starcoder and Code Llama. The project required significant investment across multiple engineering streams: IDE integration for each platform, model hosting infrastructure, fine-tuning pipelines, and service infrastructure for context fetching.
Their goals were to achieve approximately 10% improvement in acceptance rate over non-fine-tuned models while maintaining the latency requirements.
After six months of work to reach V1, they made the difficult decision to pause the project. Key factors included:
The team extracted several important lessons that apply broadly to LLMOps initiatives:
Rather than competing, Uber adopted an “ecosystem principle”—building on top of existing solutions rather than replacing them. They went all-in on GitHub Copilot, investing in evangelism, reducing friction, and automatically provisioning it in cloud developer environments.
From the abandoned project, they extracted several reusable components:
They built on top of Copilot using the Chat Participants API to integrate their internal support bot “Genie,” allowing developers to get context from Jira, design docs, and other internal resources without UI cannibalization.
Uber invested heavily in internal developer relations, finding that using these tools effectively requires education. They developed coursework and conducted worldwide workshops teaching best practices like iterating with the LLM rather than expecting correct answers on the first try, using chat participants, and providing relevant context.
This investment pushed adoption to around 60% of developers as active 30-day Copilot users, with data showing approximately 10% lift in PR velocity.
The second initiative applies agentic design patterns to automated test generation, targeting the massive technical debt in test coverage across their monorepos.
Uber uses LangChain and LangGraph with an internal abstraction layer for infrastructure deployment. Their approach treats developer workflows as state machines where some steps are deterministic (invoking builds, creating files) and others require LLM reasoning.
For test generation, they identified approximately 29 manual steps a developer would normally execute. The agentic pipeline simplifies this to:
AutoCover is integrated into VS Code using GitHub Copilot’s Chat Participants API. Developers can invoke it with an “@AutoCover” mention and slash commands, streaming tests directly into their workflow while maintaining the ability to iterate.
When senior developers questioned test quality, the team added a validation step to the agentic pipeline. This step:
The team treats test quality as an ongoing research area, with active iteration based on developer feedback.
Several interesting extensions are planned:
The third story demonstrates a sophisticated approach to large-scale code migration that combines deterministic AST-based transformations with LLM-assisted rule generation.
Uber’s Android monorepo contains over 10 million lines of code across Java and Kotlin, with nearly 100,000 files. Kotlin adoption began in 2017 when Google standardized it for Android, but scaling challenges (build times, annotation processing) required years of optimization before full product support in 2021. By 2022, organic adoption accelerated, and in 2025 they banned all new Java with linters.
The existing assisted migration pipeline is developer-driven:
At current adoption rates, reaching zero Java would take approximately 8 years—too long for Uber’s goals.
The first automation step runs IntelliJ/Android Studio headlessly in CI, fully indexing the monorepo with all context and build caches. This enables automated conversion, testing, and code review generation. However, this approach still requires extensive manual rule writing for post-processors and is estimated at 3 years to completion.
The team explicitly considered a pure LLM approach but rejected it due to:
The innovative solution combines both paradigms: use LLMs to accelerate the creation of deterministic AST rules, rather than having LLMs generate migration code directly.
Training Data Collection: They built a system using their DevPods (remote development environments) to:
Agentic Rule Generation Pipeline:
This approach is projected to cut the migration timeline to approximately 4 years—a 50% improvement. This estimate aligns with a recent Google study on LLM-assisted developer migrations showing similar speedups.
Several challenges remain:
The team shared their evolving approach to measurement. While they have data showing 10% PR velocity lift for Copilot users, they identified challenges with pure quantitative metrics:
Their current approach leads with qualitative signals (developer surveys showing “significantly more productive” responses) and normalizes quantitative data around “developer time saved”—a metric applicable across unit test generation, outage mitigation, code migrations, and other initiatives.
The case study highlights several organizational insights for LLMOps:
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.
This panel discussion brings together engineering leaders from HRS Group, Netflix, and Harness to explore how AI is transforming DevOps and SRE practices. The panelists address the challenge of teams spending excessive time on reactive monitoring, alert triage, and incident response, often wading through thousands of logs and ambiguous signals. The solution involves integrating AI agents and generative models into CI/CD pipelines, observability workflows, and incident management to enable predictive analysis, intelligent rollouts, automated summarization, and faster root cause analysis. Results include dramatically reduced mean time to resolution (from hours to minutes), elimination of low-level toil, improved context-aware decision making, and the ability to move from reactive monitoring to proactive, machine-speed remediation while maintaining human accountability for critical business decisions.