ZenML

Automated Unit Test Improvement Using LLMs for Android Applications

Meta 2024
View original source

Meta developed TestGen-LLM, a tool that leverages large language models to automatically improve unit test coverage for Android applications written in Kotlin. The system uses an Assured Offline LLM-Based Software Engineering approach to generate additional test cases while maintaining strict quality controls. When deployed at Meta, particularly for Instagram and Facebook platforms, the tool successfully enhanced 10% of the targeted classes with reliable test improvements that were accepted by engineers for production use.

Industry

Tech

Technologies

Overview

Meta developed TestGen-LLM, an innovative tool that leverages Large Language Models to automate the improvement of unit tests for Android applications written in Kotlin. This case study, published in February 2024, represents a significant application of LLMs in production software engineering workflows at one of the world’s largest technology companies. The tool was deployed and evaluated on major platforms including Instagram and Facebook, demonstrating practical utility in an industrial setting.

The core problem TestGen-LLM addresses is the challenge of comprehensive test coverage. Even well-tested codebases often have edge cases that remain untested, leaving potential bugs undiscovered. Traditional approaches to improving test coverage require significant manual effort from developers. By automating this process with LLMs, Meta aimed to systematically enhance test quality across their massive codebase while reducing the burden on engineering teams.

Technical Architecture and Methodology

TestGen-LLM employs a methodology Meta terms “Assured Offline LLM-Based Software Engineering” (Assured Offline LLMSE). This approach is particularly noteworthy from an LLMOps perspective because it emphasizes the importance of validation and quality assurance when deploying LLM-generated artifacts in production environments. The “assured” aspect refers to the system’s rigorous verification of LLM outputs before they are recommended for inclusion in the codebase.

The system architecture involves a dual-use approach focused on both evaluation and deployment. This is a critical design decision that allows the tool to be continuously assessed while simultaneously providing value in production. The separation of concerns between evaluation and deployment enables Meta to track the effectiveness of the tool over time and make improvements to the underlying LLM prompting strategies or filtering mechanisms.

A cornerstone of TestGen-LLM’s design is its filtration process. Rather than blindly accepting all test cases generated by the LLM, the system applies stringent criteria to filter candidates. This is a crucial LLMOps pattern that acknowledges the limitations of current LLM technology—namely that LLM outputs are not always correct or reliable. The filtering criteria include:

This multi-stage validation pipeline ensures that only high-quality test cases make it through to the recommendation stage. This approach represents a best practice in LLMOps: treating LLM outputs as candidates that must be verified rather than as authoritative results.

Deployment and Evaluation

TestGen-LLM was deployed at Meta through test-a-thons, organized events where the tool was applied to platforms like Instagram and Facebook. This deployment strategy is interesting from an LLMOps perspective as it combines automated tooling with human oversight and feedback collection. Specifically, evaluations were conducted within Instagram’s Reels and Stories features, which are high-traffic, business-critical components of the application.

The evaluation found that a significant portion of the test cases generated by TestGen-LLM were reliable and offered tangible improvements in coverage. While the paper does not specify exact percentages for reliability rates, the fact that these tests were deemed acceptable for production use indicates a high bar was met.

Results and Production Impact

The quantitative results reported are notable: TestGen-LLM was able to enhance 10% of the classes it was applied to. While this might seem modest at first glance, in the context of a mature codebase that likely already has substantial test coverage, finding opportunities for meaningful improvement in one out of every ten classes represents significant value. Moreover, most of the recommendations generated by TestGen-LLM were positively accepted by Meta’s software engineers for inclusion in production.

The acceptance by software engineers is a critical validation metric. It demonstrates that the generated tests were not only technically valid (passing the automated filters) but also met the quality standards and expectations of human reviewers. This human-in-the-loop validation is an important component of responsible LLM deployment, particularly when the outputs will become part of the permanent codebase.

LLMOps Patterns and Considerations

Several LLMOps patterns emerge from this case study that are worth highlighting:

Output Validation Pipeline: The filtration process is a prime example of building robust validation around LLM outputs. Rather than trusting the LLM to generate correct code, the system applies multiple automated checks to verify quality. This pattern is applicable across many LLM use cases where output correctness is critical.

Ensemble Learning: The paper mentions the use of ensemble learning, suggesting that multiple LLM calls or models may be used to generate candidate tests, with the best results being selected or combined. This technique can improve the diversity and quality of generated outputs compared to single-shot generation.

Human-in-the-Loop: Even after automated filtering, human engineers review and approve the recommended tests before they enter production. This layered approach to quality assurance is essential when LLM outputs will have lasting impact on the codebase.

Evaluation-Deployment Duality: The system architecture supports both evaluation and deployment, enabling continuous measurement of effectiveness. This is a mature LLMOps practice that allows for data-driven improvements over time.

Limitations and Balanced Assessment

While the results are promising, it’s important to maintain a balanced perspective. The 10% improvement rate means that 90% of classes did not receive beneficial test improvements from the tool. This could be because those classes already had comprehensive test coverage, or because the LLM was unable to generate valid improvements for them.

The paper does not provide details about the computational costs of running TestGen-LLM, the latency of test generation, or the false positive rate (tests that passed filtering but were rejected by human reviewers). These operational metrics would be valuable for understanding the full picture of deploying such a system at scale.

Additionally, the focus on Kotlin/Android limits the immediate generalizability of the results, though the methodology could presumably be adapted for other languages and platforms.

Implications for the Field

This case study demonstrates the viability of LLM-based tools for software engineering tasks in production environments at scale. The success of TestGen-LLM at Meta opens avenues for applying similar techniques to other aspects of software development and testing, including test maintenance, bug detection, and code review automation.

The emphasis on assured outputs through rigorous filtering provides a template for other organizations looking to deploy LLMs in contexts where correctness matters. Rather than waiting for LLMs to become perfectly reliable, Meta’s approach acknowledges current limitations while still extracting value through careful system design.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Production AI Agents and Agentic Platforms at Scale

Vercel 2025

This AWS re:Invent 2025 session explores the challenges organizations face moving AI projects from proof-of-concept to production, addressing the statistic that 46% of AI POC projects are canceled before reaching production. AWS Bedrock team members and Vercel's director of AI engineering present a comprehensive framework for production AI systems, focusing on three critical areas: model switching, evaluation, and observability. The session demonstrates how Amazon Bedrock's unified APIs, guardrails, and Agent Core capabilities combined with Vercel's AI SDK and Workflow Development Kit enable rapid development and deployment of durable, production-ready agentic systems. Vercel showcases real-world applications including V0 (an AI-powered prototyping platform), Vercel Agent (an AI code reviewer), and various internal agents deployed across their organization, all powered by Amazon Bedrock infrastructure.

code_generation chatbot data_analysis +38

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik 2025

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot +53