Uber: GenAI-Powered Automated Resource Leak Fixing in Java Codebases

Company

Uber

Title

GenAI-Powered Automated Resource Leak Fixing in Java Codebases

Industry

Tech

Link

https://www.uber.com/en-GB/blog/fixrleak-fixing-java-resource-leaks-with-genai/

Year

2025

Summary (short)

Uber developed FixrLeak, a generative AI-based framework to automate the detection and repair of resource leaks in their Java codebase. Resource leaks—where files, database connections, or streams aren't properly released—cause performance degradation and system failures, and while tools like SonarQube detect them, fixing remains manual and error-prone. FixrLeak combines Abstract Syntax Tree (AST) analysis with generative AI (specifically OpenAI ChatGPT-4O) to produce accurate, idiomatic fixes following Java best practices like try-with-resources. When tested on 124 resource leaks in Uber's codebase, FixrLeak successfully automated fixes for 93 out of 102 eligible cases (after filtering out deprecated code and complex inter-procedural leaks), significantly reducing manual effort and improving code quality at scale.

Tags

continuous_integration

open_source

openai

## Overview Uber's FixrLeak represents a sophisticated production deployment of generative AI for automated code repair at industrial scale. The system addresses a specific, well-defined software engineering problem: resource leaks in Java code where resources like files, database connections, or streams aren't properly released after use. This case study demonstrates how Uber combined traditional static analysis techniques with modern large language models to create a practical, production-ready tool that operates continuously on their extensive Java codebase. The implementation showcases several key LLMOps principles including careful scope definition, multi-stage validation, and integration into existing development workflows. Rather than attempting to solve all resource leak scenarios, Uber strategically focused on a subset of problems where AI could deliver high accuracy, demonstrating pragmatic AI deployment that prioritizes reliability over breadth of coverage. ## Problem Context and Technical Background Resource leaks are a persistent challenge in Java applications where allocated resources (file descriptors, database connections, network sockets, streams) fail to be properly released. These leaks accumulate over time, leading to performance degradation, resource exhaustion, and ultimately system failures. The traditional approach involved manual code reviews and fixes, which is time-consuming and error-prone at scale. While detection tools like SonarQube effectively identify resource leaks through static analysis, the remediation process remained entirely manual. Previous automated solutions had significant limitations. Non-GenAI tools like RLFixer relied on pre-designed templates and frameworks like WALA, but struggled to scale in massive codebases and required extensive manual setup for each programming idiom. Early GenAI solutions like InferFix achieved only 70% fix accuracy and faced challenges with complex leaks requiring advanced code analysis. Additionally, InferFix relied on proprietary models that couldn't easily adapt to evolving technologies. Uber recognized an opportunity to leverage generative AI while addressing these limitations through careful engineering and scope management. The key insight was to focus on intra-procedural leaks (where resource lifetime doesn't exceed the allocating function) where GenAI could achieve higher accuracy with proper guardrails. ## Architecture and LLMOps Implementation FixrLeak's architecture demonstrates a mature approach to integrating LLMs into production workflows, combining multiple validation stages to ensure quality and reliability. ### Input Processing and Leak Detection The system begins by consuming resource leak reports from SonarQube, extracting metadata including file names and line numbers. To maintain accuracy as the codebase evolves, FixrLeak implements a deterministic hashing mechanism based on file and function names. This allows the system to track leaks and their fixes across code changes, avoiding redundant work and maintaining audit trails. The input processing phase uses Tree-sitter, a parser generator tool and incremental parsing library, to parse Java source code and extract relevant functions for analysis. This structured parsing approach is crucial for the subsequent AST analysis stage and demonstrates the importance of combining traditional program analysis with AI-driven approaches. ### AST-Based Pre-Filtering A critical component of FixrLeak's LLMOps strategy is the AST-level analysis that occurs before engaging the generative AI model. This pre-filtering stage embodies an important principle: not all problems should be sent to the LLM. By performing deterministic analysis first, Uber avoids wasting API calls on scenarios where automated fixes would be unsafe or incorrect. The AST analysis identifies and filters out several complex scenarios where simple function-level fixes would be inappropriate. Specifically, it excludes cases where resources are passed as parameters to functions, returned from functions, or stored in class fields. In these situations, the resource lifetime extends beyond the function scope, making intra-procedural fixes incorrect and potentially introducing new bugs like use-after-close errors. This filtering strategy is noteworthy from an LLMOps perspective because it demonstrates understanding of model limitations. Rather than relying solely on the LLM to make these determinations (which could be unreliable), Uber uses deterministic program analysis where it's most effective. This hybrid approach maximizes success rates by only presenting the LLM with problems it can reliably solve. ### Prompt Engineering Strategy Once a resource leak passes the AST-based filtering, FixrLeak crafts a tailored prompt for the generative AI model. While the case study doesn't provide detailed prompt templates, it indicates that prompts are customized for each specific leak scenario. The system uses OpenAI ChatGPT-4O as the underlying model, suggesting a reliance on state-of-the-art commercial LLM capabilities. The prompt engineering approach likely includes relevant code context (the function with the leak), expected fix patterns (try-with-resources statements per Java best practices), and potentially examples or constraints. The fact that FixrLeak achieves high success rates (93 out of 102 eligible cases) suggests effective prompt design that guides the model toward idiomatic Java solutions. From an LLMOps perspective, this phase represents the core AI integration point. The choice to use a commercial API (OpenAI) rather than self-hosted models reflects common tradeoffs: commercial APIs offer cutting-edge capabilities and eliminate infrastructure management overhead, though they introduce dependencies on external services and potentially higher costs at scale. ### Multi-Stage Validation and Quality Assurance FixrLeak implements a sophisticated validation pipeline that runs before any code changes are proposed to developers. This multi-stage verification is essential for production AI systems generating code that will be merged into critical systems. The validation process includes several layers. First, FixrLeak verifies that the target binary builds successfully with the proposed fix, ensuring syntactic correctness and compatibility with the existing codebase. Second, it runs all existing unit tests to confirm that the fix doesn't break functionality—a crucial safety check that prevents AI-generated code from introducing regressions. Third, the system can re-run SonarQube analysis on the fixed code to verify that the resource leak has actually been resolved, providing end-to-end validation of the fix's effectiveness. This comprehensive testing strategy addresses a fundamental challenge in LLMOps: AI models can generate plausible-looking code that is nonetheless incorrect or introduces subtle bugs. By requiring fixes to pass multiple automated checks before human review, Uber significantly reduces the risk of problematic AI-generated code reaching production. ### Deployment and Integration into Development Workflow FixrLeak is deployed as a continuously running service that periodically scans Uber's Java codebase for resource leaks. When it identifies and fixes leaks, it automatically generates pull requests for developer review. This integration into existing development workflows is a key LLMOps success factor—the AI augments rather than replaces existing processes. The pull request automation represents the final integration point where AI-generated fixes enter the human review process. According to Uber, developers typically need only perform a "one-click accept" for these pull requests, suggesting high confidence in the generated fixes. However, maintaining human review as a final gate provides accountability and allows developers to catch edge cases or issues that automated validation might miss. The continuous operation model means FixrLeak doesn't just fix existing leaks but also catches new ones as they're introduced, providing ongoing code quality improvement. This represents a mature deployment where the AI system operates autonomously within well-defined guardrails. ## Results and Effectiveness Uber tested FixrLeak on 124 resource leaks identified by SonarQube in their Java codebase. After excluding 12 leaks in deprecated code, the AST-level analysis filtered the remaining 112 leaks, ultimately identifying 102 as eligible for automated fixing (cases where resources were confined to function scope). Of these 102 eligible cases, FixrLeak successfully automated fixes for 93 leaks, representing a 91% success rate on the filtered subset. This high success rate is noteworthy and likely results from the careful scoping strategy—by filtering out complex inter-procedural cases upfront, Uber ensured the LLM only tackled problems where it could succeed. It's important to interpret these results carefully. The 91% success rate applies only to the subset of leaks that passed AST filtering (102 out of 112 eligible after removing deprecated code). The system essentially filtered out approximately 19% of potential cases (10 out of 112) where fixes would require inter-procedural analysis. This demonstrates a pragmatic engineering tradeoff: achieving high reliability on a focused problem set rather than attempting comprehensive coverage with lower accuracy. From an LLMOps perspective, this scoping strategy is instructive. Rather than claiming to solve all resource leaks, Uber clearly defines the boundaries of what FixrLeak handles well, uses deterministic analysis to enforce those boundaries, and achieves high reliability within that scope. This approach is more sustainable for production systems than attempting to use AI for all cases and accepting lower overall reliability. ## LLMOps Considerations and Tradeoffs Several important LLMOps lessons emerge from Uber's FixrLeak deployment: **Hybrid AI-Traditional Approaches**: FixrLeak demonstrates the power of combining traditional program analysis (AST-based filtering) with modern LLMs. The deterministic analysis handles what it does well (identifying resource lifetime patterns) while the LLM handles what it does well (generating idiomatic code fixes). This division of labor is more effective than relying solely on either approach. **Scope Management for Reliability**: By explicitly limiting the problem space to intra-procedural leaks, Uber achieves high success rates that enable automated deployment. This contrasts with attempting to solve all cases with lower accuracy, which would require more human intervention and reduce automation benefits. The tradeoff is coverage (some leaks aren't addressed) versus reliability (fixes that are addressed work well). **Validation as a Critical Safety Layer**: The multi-stage validation pipeline (build verification, test execution, SonarQube re-check) is essential for safely deploying AI-generated code. This represents significant engineering investment beyond just calling an LLM API, but is necessary for production deployment in critical systems. **API Dependency Considerations**: The reliance on OpenAI ChatGPT-4O introduces external dependencies that warrant consideration. While commercial APIs provide cutting-edge capabilities, they create potential issues around cost, rate limiting, service availability, and lack of control over model updates. Uber's scale likely makes API costs substantial, though the blog doesn't discuss these economic considerations. **Prompt Engineering as a Core Competency**: Though not detailed extensively in the blog, the success of FixrLeak depends significantly on effective prompt engineering. The ability to craft prompts that consistently produce idiomatic, correct fixes is a critical LLMOps skill that likely required significant iteration and refinement. **Continuous Operation Model**: Deploying FixrLeak as a continuously running service that catches new leaks represents mature LLMOps, where AI systems operate autonomously within guardrails. This requires robust error handling, monitoring, and alerting infrastructure that the blog doesn't detail but is essential for production operation. ## Future Directions and Limitations Uber acknowledges several areas for future enhancement that highlight current limitations: **Inter-Procedural Fixes**: The current system only handles leaks where resources are confined to a single function. Expanding to handle leaks spanning multiple functions would increase coverage but presents significant challenges. Inter-procedural analysis is more complex, and ensuring fix correctness across function boundaries would require more sophisticated validation. **GenAI-Based Leak Detection**: Current detection relies on SonarQube's rule-based static analysis. Incorporating GenAI for detection could identify leaks that rule-based tools miss, particularly for user-defined resource classes. This would represent another application of LLMs in the workflow, though with its own accuracy and false positive challenges. **Multi-Language Support**: Uber plans to extend FixrLeak to Golang, which currently lacks robust resource leak detection tools. This expansion demonstrates the potential for the approach to generalize beyond Java, though each language would require adaptation of the AST analysis and prompt engineering. These future directions indicate that while FixrLeak represents a successful production deployment, it addresses a focused subset of the overall resource leak problem. The system's current limitations around inter-procedural leaks and language support represent pragmatic scoping decisions rather than fundamental technical barriers. ## Critical Assessment While Uber's blog naturally presents FixrLeak positively, a balanced assessment should consider several factors: The reported 93 out of 102 success rate is impressive but applies only to pre-filtered cases. The true end-to-end automation rate from all detected leaks to automated fixes is lower when considering filtered cases. Additionally, the blog doesn't discuss false positives—cases where FixrLeak generated fixes that appeared correct but actually introduced subtle bugs that were caught during review. The reliance on commercial LLM APIs like OpenAI introduces dependencies and costs that may not be sustainable for all organizations. The blog doesn't discuss API costs, rate limiting challenges, or strategies for handling API outages, all of which are real concerns for production systems. The "one-click accept" description of developer review may oversimplify the reality. While many fixes may be straightforward, some likely require careful review, and the blog doesn't quantify what percentage genuinely required only minimal review versus deeper analysis. Despite these caveats, FixrLeak represents a genuinely impressive production deployment of GenAI for code generation. The careful engineering around scope management, validation, and workflow integration demonstrates mature LLMOps practices. The focus on a well-defined problem where AI can excel, combined with robust guardrails, provides a template for successfully deploying AI in software engineering contexts. ## Broader Implications for LLMOps FixrLeak offers several lessons for organizations considering similar AI-driven code repair systems: The importance of combining AI with traditional analysis techniques cannot be overstated. Pure AI approaches may struggle with reliability, while hybrid systems can leverage the strengths of both paradigms. The value of comprehensive validation pipelines for AI-generated code is essential—automated testing, build verification, and potentially specialized checks are necessary before human review. Strategic scope limitation to maximize success rates is a pragmatic approach. It's often better to solve a focused problem well than to attempt comprehensive solutions with lower reliability. Integration into existing workflows (like pull requests) rather than replacing processes entirely tends to be more successful and acceptable to developers. The case study demonstrates that GenAI for code generation is moving beyond experimental phases into production deployment at major technology companies. However, success requires substantial engineering investment in validation, filtering, and integration—not just API calls to LLMs. Organizations considering similar systems should carefully evaluate the economics of commercial API usage at scale, invest in prompt engineering expertise, develop comprehensive testing strategies, and clearly define problem scope to maximize reliability. The FixrLeak case study provides a valuable reference point for these considerations.

Start deploying reproducible AI workflows today