Uber developed FixrLeak, a framework combining generative AI and Abstract Syntax Tree (AST) analysis to automatically detect and fix resource leaks in Java code. The system processes resource leaks identified by SonarQube, analyzes code safety through AST, and uses GPT-4 to generate appropriate fixes. When tested on 124 resource leaks in Uber's codebase, FixrLeak successfully automated fixes for 93 out of 102 eligible cases, significantly reducing manual intervention while maintaining code quality.
Uber’s Programming Systems group developed FixrLeak, a production system that leverages generative AI to automatically fix Java resource leaks at scale. Resource leaks—where resources like files, database connections, or streams aren’t properly released after use—represent a persistent challenge in Java applications that can lead to performance degradation and system failures. While static analysis tools like SonarQube effectively identify such leaks, the fixing process traditionally remained manual, time-consuming, and error-prone. FixrLeak addresses this gap by combining traditional code analysis techniques with large language models to automate the repair process.
The system represents an interesting case study in applying LLMs to a well-defined, scoped problem in software engineering. Rather than attempting to solve all code quality issues with AI, Uber focused on a specific class of problems where generative AI could be highly effective: intra-function resource leaks that can be safely fixed using Java’s try-with-resources pattern.
FixrLeak employs a multi-stage pipeline that carefully orchestrates traditional static analysis with LLM-based code generation. This hybrid approach is notable because it recognizes the limitations of both traditional tools and pure LLM solutions, combining them strategically.
The system begins by scanning resource leaks reported by SonarQube, an established static analysis tool. This is a pragmatic design choice—rather than relying on AI for leak detection (which would introduce additional uncertainty), FixrLeak leverages a trusted, deterministic tool for this phase. Key details like file names and line numbers are gathered, and a deterministic hash based on file and function name is used for accurate tracking of leaks and their fixes across codebase changes.
Once identified, FixrLeak uses the Tree-sitter library to parse the code and extract the relevant function for analysis. Tree-sitter is a well-established incremental parsing library that provides robust AST (Abstract Syntax Tree) manipulation capabilities across many programming languages.
A critical aspect of FixrLeak’s design is its use of AST-level analysis to determine which leaks are safe to fix automatically. This represents an important lesson in responsible LLM deployment: not all problems should be handed to the AI. The system specifically filters out cases where:
These scenarios typically involve resources that outlive the function’s scope, where blindly applying try-with-resources could introduce use-after-close errors. By focusing only on intra-function leaks where the resource’s lifetime is confined to the allocating function, FixrLeak achieves higher accuracy and avoids introducing new bugs.
This filtering is particularly important from an LLMOps perspective. It demonstrates a “narrow the scope” principle: by carefully constraining the problem space presented to the LLM, the team achieved much higher success rates than previous approaches like InferFix, which attempted to handle more complex cases and achieved only 70% accuracy.
For leaks that pass the AST-level safety checks, FixrLeak crafts tailored prompts for OpenAI’s GPT-4O model. The text doesn’t provide extensive details on the prompt engineering approach, but the context-specific nature of the prompts is emphasized—they include the relevant function code and information about the specific resource leak to be fixed.
The choice of GPT-4O (likely referring to GPT-4 Omni or a similar variant) as the underlying model is notable. While this creates a dependency on an external API and proprietary model, it provides access to state-of-the-art code generation capabilities without the need to train or fine-tune custom models.
The LLM response is processed to extract the suggested fix, which replaces the original leaky function. However, the system doesn’t blindly trust the AI output. Before submitting a pull request, FixrLeak runs multiple validation checks:
This multi-layer validation pipeline is essential for production LLM deployments. It acknowledges that LLM outputs, while often correct, can occasionally contain subtle errors or break assumptions elsewhere in the codebase. The automated verification catches these issues before human review.
Finally, pull requests are generated for developer review. The text notes that “usually, all they need to do is one-click accept,” suggesting high confidence in the fixes, though human oversight remains part of the workflow.
The case study provides concrete metrics on FixrLeak’s performance at Uber:
This represents approximately 91% success rate on eligible cases, and 75% of all non-deprecated leaks being fixed automatically. While impressive, it’s worth noting the careful scoping that went into achieving these results—the AST-level filtering removed the harder cases before they reached the LLM.
The system is deployed as a continuous process that “runs periodically on the Java codebase and will quickly generate fixes for resource leaks introduced in the future,” representing a mature production deployment rather than a one-time batch fix.
The case study contextualizes FixrLeak against previous solutions:
RLFixer (non-GenAI): Relied on pre-designed templates and the WALA analysis framework. While effective for some leaks, it struggled to scale in massive codebases and required extensive manual setup for each new programming idiom.
InferFix (GenAI-based): An earlier LLM-based approach that achieved only 70% fix accuracy and had challenges with complex leaks. It also relied on proprietary models that couldn’t easily adapt to evolving technologies.
FixrLeak’s improvement comes from its “template-free approach” that leverages modern LLMs’ code generation capabilities, combined with strategic use of AST analysis to focus on well-scoped problems.
The case study articulates several key takeaways that align with LLMOps best practices:
Prioritize structured code analysis: AST-based techniques help ensure fixes are safe and context-aware. This represents a broader principle of combining traditional deterministic tools with probabilistic LLM outputs.
Automate targeted fixes: Focus on well-scoped, high-confidence fixes first to maximize success rates. This is essentially a guidance to “start narrow and expand” rather than attempting to solve all cases at once.
Integrate AI responsibly: Validate AI-generated code with rigorous testing and code review processes. Human oversight remains important even with high-accuracy systems.
The team outlines planned expansions:
These directions suggest confidence in the current approach and ambition to tackle progressively harder problems.
While the case study presents compelling results, several aspects warrant consideration:
The 93 successfully fixed leaks out of 102 eligible cases is a strong result, but it represents a carefully filtered subset of the original 124 leaks. The true complexity lies in the remaining cases—inter-procedural leaks, deprecated code, and the 9 failures even among eligible cases.
The reliance on OpenAI’s GPT-4O creates an external dependency that may have cost, latency, and availability implications at scale. The text doesn’t discuss these operational considerations.
The “one-click accept” characterization for code review may oversimplify the cognitive load on developers reviewing AI-generated fixes. Even high-quality automated fixes require careful review, particularly for subtle resource management issues.
Overall, FixrLeak represents a mature, pragmatic application of LLMs to a well-defined software engineering problem, with appropriate safeguards and validation pipelines. The combination of traditional static analysis tools with generative AI, rather than relying solely on either approach, appears to be a key factor in its success.
Uber's Developer Platform team explored three major initiatives using LLMs in production: a custom IDE coding assistant (which was later abandoned in favor of GitHub Copilot), an AI-powered test generation system called Auto Cover, and an automated Java-to-Kotlin code migration system. The team combined deterministic approaches with LLMs to achieve significant developer productivity gains while maintaining code quality and safety. They found that while pure LLM approaches could be risky, hybrid approaches combining traditional software engineering practices with AI showed promising results.
Anthropic developed Claude Code, a CLI-based coding assistant that provides direct access to their Sonnet LLM for software development tasks. The tool started as an internal experiment but gained rapid adoption within Anthropic, leading to its public release. The solution emphasizes simplicity and Unix-like utility design principles, achieving an estimated 2-10x developer productivity improvement for active users while maintaining a pay-as-you-go pricing model averaging $6/day per active user.
Github describes their robust evaluation framework for testing and deploying new LLM models in their Copilot product. The team runs over 4,000 offline tests, including automated code quality assessments and chat capability evaluations, before deploying any model changes to production. They use a combination of automated metrics, LLM-based evaluation, and manual testing to assess model performance, quality, and safety across multiple programming languages and frameworks.