Meta developed TestGen-LLM, a tool that leverages large language models to automatically improve unit test coverage for Android applications written in Kotlin. The system uses an Assured Offline LLM-Based Software Engineering approach to generate additional test cases while maintaining strict quality controls. When deployed at Meta, particularly for Instagram and Facebook platforms, the tool successfully enhanced 10% of the targeted classes with reliable test improvements that were accepted by engineers for production use.
Meta developed TestGen-LLM, an innovative tool that leverages Large Language Models to automate the improvement of unit tests for Android applications written in Kotlin. This case study, published in February 2024, represents a significant application of LLMs in production software engineering workflows at one of the world’s largest technology companies. The tool was deployed and evaluated on major platforms including Instagram and Facebook, demonstrating practical utility in an industrial setting.
The core problem TestGen-LLM addresses is the challenge of comprehensive test coverage. Even well-tested codebases often have edge cases that remain untested, leaving potential bugs undiscovered. Traditional approaches to improving test coverage require significant manual effort from developers. By automating this process with LLMs, Meta aimed to systematically enhance test quality across their massive codebase while reducing the burden on engineering teams.
TestGen-LLM employs a methodology Meta terms “Assured Offline LLM-Based Software Engineering” (Assured Offline LLMSE). This approach is particularly noteworthy from an LLMOps perspective because it emphasizes the importance of validation and quality assurance when deploying LLM-generated artifacts in production environments. The “assured” aspect refers to the system’s rigorous verification of LLM outputs before they are recommended for inclusion in the codebase.
The system architecture involves a dual-use approach focused on both evaluation and deployment. This is a critical design decision that allows the tool to be continuously assessed while simultaneously providing value in production. The separation of concerns between evaluation and deployment enables Meta to track the effectiveness of the tool over time and make improvements to the underlying LLM prompting strategies or filtering mechanisms.
A cornerstone of TestGen-LLM’s design is its filtration process. Rather than blindly accepting all test cases generated by the LLM, the system applies stringent criteria to filter candidates. This is a crucial LLMOps pattern that acknowledges the limitations of current LLM technology—namely that LLM outputs are not always correct or reliable. The filtering criteria include:
This multi-stage validation pipeline ensures that only high-quality test cases make it through to the recommendation stage. This approach represents a best practice in LLMOps: treating LLM outputs as candidates that must be verified rather than as authoritative results.
TestGen-LLM was deployed at Meta through test-a-thons, organized events where the tool was applied to platforms like Instagram and Facebook. This deployment strategy is interesting from an LLMOps perspective as it combines automated tooling with human oversight and feedback collection. Specifically, evaluations were conducted within Instagram’s Reels and Stories features, which are high-traffic, business-critical components of the application.
The evaluation found that a significant portion of the test cases generated by TestGen-LLM were reliable and offered tangible improvements in coverage. While the paper does not specify exact percentages for reliability rates, the fact that these tests were deemed acceptable for production use indicates a high bar was met.
The quantitative results reported are notable: TestGen-LLM was able to enhance 10% of the classes it was applied to. While this might seem modest at first glance, in the context of a mature codebase that likely already has substantial test coverage, finding opportunities for meaningful improvement in one out of every ten classes represents significant value. Moreover, most of the recommendations generated by TestGen-LLM were positively accepted by Meta’s software engineers for inclusion in production.
The acceptance by software engineers is a critical validation metric. It demonstrates that the generated tests were not only technically valid (passing the automated filters) but also met the quality standards and expectations of human reviewers. This human-in-the-loop validation is an important component of responsible LLM deployment, particularly when the outputs will become part of the permanent codebase.
Several LLMOps patterns emerge from this case study that are worth highlighting:
Output Validation Pipeline: The filtration process is a prime example of building robust validation around LLM outputs. Rather than trusting the LLM to generate correct code, the system applies multiple automated checks to verify quality. This pattern is applicable across many LLM use cases where output correctness is critical.
Ensemble Learning: The paper mentions the use of ensemble learning, suggesting that multiple LLM calls or models may be used to generate candidate tests, with the best results being selected or combined. This technique can improve the diversity and quality of generated outputs compared to single-shot generation.
Human-in-the-Loop: Even after automated filtering, human engineers review and approve the recommended tests before they enter production. This layered approach to quality assurance is essential when LLM outputs will have lasting impact on the codebase.
Evaluation-Deployment Duality: The system architecture supports both evaluation and deployment, enabling continuous measurement of effectiveness. This is a mature LLMOps practice that allows for data-driven improvements over time.
While the results are promising, it’s important to maintain a balanced perspective. The 10% improvement rate means that 90% of classes did not receive beneficial test improvements from the tool. This could be because those classes already had comprehensive test coverage, or because the LLM was unable to generate valid improvements for them.
The paper does not provide details about the computational costs of running TestGen-LLM, the latency of test generation, or the false positive rate (tests that passed filtering but were rejected by human reviewers). These operational metrics would be valuable for understanding the full picture of deploying such a system at scale.
Additionally, the focus on Kotlin/Android limits the immediate generalizability of the results, though the methodology could presumably be adapted for other languages and platforms.
This case study demonstrates the viability of LLM-based tools for software engineering tasks in production environments at scale. The success of TestGen-LLM at Meta opens avenues for applying similar techniques to other aspects of software development and testing, including test maintenance, bug detection, and code review automation.
The emphasis on assured outputs through rigorous filtering provides a template for other organizations looking to deploy LLMs in contexts where correctness matters. Rather than waiting for LLMs to become perfectly reliable, Meta’s approach acknowledges current limitations while still extracting value through careful system design.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This AWS re:Invent 2025 session explores the challenges organizations face moving AI projects from proof-of-concept to production, addressing the statistic that 46% of AI POC projects are canceled before reaching production. AWS Bedrock team members and Vercel's director of AI engineering present a comprehensive framework for production AI systems, focusing on three critical areas: model switching, evaluation, and observability. The session demonstrates how Amazon Bedrock's unified APIs, guardrails, and Agent Core capabilities combined with Vercel's AI SDK and Workflow Development Kit enable rapid development and deployment of durable, production-ready agentic systems. Vercel showcases real-world applications including V0 (an AI-powered prototyping platform), Vercel Agent (an AI code reviewer), and various internal agents deployed across their organization, all powered by Amazon Bedrock infrastructure.
Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.