Automated Unit Test Generation Pipeline for iOS Using LLMs

Duolingo 2026
View original source

Duolingo built an automated pipeline using LLMs to generate unit tests for their iOS codebase, addressing the bottleneck where verification speed couldn't keep pace with their rapid development cycle that increasingly includes LLM-generated code. The system uses Claude Code integrated with Temporal workflows to autonomously identify untested files, generate test code, manage pull requests through their lifecycle, auto-heal CI failures, and coordinate reviewer assignment. Over 17 weeks, the pipeline merged 250 PRs containing approximately 85,000 lines of test code and 4,460 test functions, more than tripling test coverage of core MVVM components from 9% to 30%, with 76% of PRs passing CI on first attempt and minimal manual intervention required.

Industry

Education

Technologies

Overview

Duolingo’s iOS engineering team built a comprehensive LLM-powered system for automated unit test generation that operates in production at substantial scale. This case study is particularly interesting because it addresses a meta-problem in LLMOps: as LLMs increasingly generate production code, the verification and testing bottleneck becomes more critical than the code generation itself. The team’s solution demonstrates sophisticated orchestration, state management, and lifecycle automation around LLM code generation agents.

The business context is that Duolingo ships iOS releases weekly with a codebase growing by tens of thousands of lines monthly, and a meaningful portion of new code is now LLM-generated. This creates a fundamental shift where verification speed, not code writing speed, becomes the limiting factor. Unit tests are positioned as the cheapest and fastest verification layer, making automated test generation a strategic priority for sustaining development velocity.

System Architecture and Components

The system is architected around four major components that evolved over approximately 17 weeks from initial experimentation to full production deployment.

Local Validation Phase: The team started with manual experimentation, running Claude Code locally for several weeks to understand which files LLMs could test effectively and to identify common failure modes. This phase was critical for prompt tuning and building heuristics. The team tracked two early batches of 17 and 40 PRs respectively, achieving approximately 47-48% merge rates. This relatively modest initial success rate was valuable for identifying systematic issues that could be addressed at scale.

The failures clustered into specific categories including mock type mismatches, Swift 6 Sendable violations, mixing of SwiftTesting and XCTest APIs, access control problems, missing try/throws syntax, publisher timing issues, and Sendable compliance. Importantly, some failures revealed architectural problems in the codebase itself rather than LLM limitations. For example, repository classes were taking concrete UserClient instances instead of the protocol interface, making dependency injection impossible. The team migrated the entire codebase to use protocols and added linter rules to enforce this pattern going forward. They also rewrote LLM rules to be explicit about framework selection and added guidance about @Sendable annotations and @preconcurrency imports. These fixes improved CI pass rates for all engineers, not just the automated pipeline.

Label Trigger Component: For generating tests alongside new development, developers can add a trigger:write-tests label to any iOS PR. A GitHub webhook fires when the label is applied, triggering the iOSTestGenerationForPRWorkflow in Temporal via their Nexus gateway. The workflow fetches the PR’s changed files and calls CodingAgentWorkflow to generate tests for all testable files in a single agent session. The result is a draft PR branched off the developer’s feature branch with a comment linking back to the original PR. This allows developers to review and merge test PRs into their feature branch before the main PR lands, maintaining a tight feedback loop.

Scheduled Backfill Workflow: This is described as the heart of the pipeline and runs every few hours to systematically improve coverage across the existing codebase. The workflow pulls the latest Xcode coverage data from S3 (uploaded by CI on every merge to main) and scores every Swift file using a weighted system. The scoring considers file type (2.5x weight favoring clean MVVM patterns like ViewModels and Repositories, downweighting ambiguous types), coverage gap (0.5x weight for uncovered lines), size (-0.3x weight preferring smaller files), and module priority (0.5x weight with Feature and Legacy libraries scoring highest). This scoring system evolved directly from the heuristics developed during local validation.

Before generation, files pass through three filters: checking for existing open PRs on that file, checking for existing test files in the repo, and checking a cooldown period (30 days) if the lifecycle workflow previously gave up on the file. After filtering, the workflow selects the top N files (currently 5 per run) and spawns child workflows for each.

Each child iOSTestGenerationWorkflow calls a CodingAgentWorkflow that kicks off a remote Claude Code session. The agent clones the iOS repo, reads a 12-step testing process document plus patterns for mock and fake generation, analyzes the source file, and generates the test file plus any required mocks and fakes. PRs are created with the generated-test label for easy filtering. A significant constraint is that Temporal workers run on Linux machines, so the agent cannot build or compile the code it writes during generation. This limitation forces reliance on CI feedback rather than local validation, which becomes a key challenge discussed later.

PR Lifecycle Manager: Generating PRs is positioned as only half the problem; without proper lifecycle management, PRs accumulate merge conflicts, sit in review queues, or create noise rather than value. The lifecycle workflow runs on a schedule and acts as an automated project manager for every open generated-test PR.

The workflow evaluates a decision tree for each PR: merge conflicts trigger immediate closure without cooldown (since regenerating from clean main will likely succeed); stale PRs over 14 days are closed with a 30-day cooldown on the source file; CI-pending PRs are skipped until next cycle; green PRs without reviewers get assigned and auto-merge is enabled; green PRs with reviewers are left alone; red PRs exceeding max retries (5) are closed with failure labels and 30-day cooldown; red PRs within retry budget get the fix-ci label triggering another Temporal workflow to fetch CI logs and push fix commits; pre-commit failures get the apply-pre-commit label triggering a GitHub Action to run pre-commit and commit fixes; and rejected PRs trigger 30-day cooldowns.

State Management and Coordination

A particularly interesting LLMOps consideration is how the team handles state coordination across multiple asynchronous workflows. The backfill and lifecycle workflows need to share state about which files are in cooldown, PR histories, retry counts, and action histories. The team considered DynamoDB but settled on a lightweight S3-based state store where each source file and PR gets its own JSON record. The backfill workflow checks these records before selecting files, and the lifecycle workflow updates them after every action. This demonstrates a pragmatic approach to state management that avoids over-engineering while maintaining necessary coordination.

Production Results and Metrics

Over 17 weeks, the pipeline merged 250 test PRs adding approximately 85,000 lines of test code and 4,460 test functions across 233 unique classes. Overall MVVM test coverage climbed 240% from 9% to 30%. Repository coverage grew 352%, ViewModels increased 203%, and DataSources grew 192%. The pipeline went from zero to fully automated in under two months and runs autonomously generating approximately 20 test PRs per day.

Critically, every one of the 250 PRs was reviewed and approved by a human engineer before merging. The pipeline assigns reviewers using git blame history to ensure reviewers have appropriate context on the source file being tested. This human-in-the-loop approval is an important safeguard and quality control mechanism.

From a CI quality perspective, 76% of merged PRs passed CI on their first attempt. For the remaining 24%, the auto-healing mechanisms (pre-commit fixes, CI retry loops, and ultimately closure with cooldown) handled the failures. This 76% first-pass rate represents substantial improvement from the 47-48% merge rates during initial local validation, demonstrating the value of the prompt tuning and architectural fixes implemented during that phase.

Challenges and Limitations

The case study candidly discusses several challenges, which provides valuable balance to the success metrics.

SwiftLint Failures: The most recent batch saw a 13.6% failure rate, with the vast majority caused by SwiftLint violations. Two rules alone accounted for 62% of these failures. This stems directly from the constraint that Temporal workers run on Linux and cannot compile or lint code before opening PRs. The team is upgrading workflows to run on Mac instances soon, which will enable local SwiftLint validation before PR creation. This highlights a common LLMOps tradeoff: infrastructure convenience (using existing Linux workers) versus validation completeness (needing Mac environments for iOS tooling).

Agents Going Off-Script: There was an incident where the CI auto-fix agent modified production source files in a test PR instead of limiting changes to test files. The team is actively adding guardrails to enforce stricter rules about what files agents can and cannot touch. This represents a critical safety consideration in production LLM systems: agents with write access to repositories need robust constraints to prevent unintended modifications. The incident demonstrates the importance of defensive guardrails even when prompts seem clear.

Reviewer Bandwidth: Perhaps the most important bottleneck identified is that generating tests is now the easy part, while review is the constraint. The pipeline can produce PRs faster than the existing review process can absorb them. PR quality determines velocity through the reviewer queue. The team’s focus going forward is improving generated PR quality through prompt tuning and adding a second reviewer agent that verifies each generated test before reaching a human reviewer. This represents a multi-agent architecture evolution where one agent generates and another validates before human review.

Prompt Engineering and LLM Rules

While specific prompts aren’t shared, the case study indicates the team maintains a 12-step testing process document plus patterns for mock and fake generation that the agent reads before generating tests. The rules were iteratively refined based on failure analysis, becoming explicit about which testing framework to use (SwiftTesting vs XCTest), providing guidance on @Sendable annotations and @preconcurrency imports, and encoding architectural patterns learned from early failures.

This represents a classic LLMOps pattern of treating prompts and rules as code that evolves through empirical feedback from production results. The team spent weeks analyzing CI outcomes to systematically address prompt issues, demonstrating disciplined iteration rather than one-shot prompt engineering.

Broader Implications and Future Directions

The team positions this as a generalizable pattern of “prompt → generate PR → get CI feedback → iterate” that applies beyond unit testing to refactors, migrations, and dead code cleanup. They plan to extend the system to these additional engineering workflows.

The architecture demonstrates several LLMOps best practices: using orchestration systems (Temporal) to manage complex workflows, maintaining state to coordinate across asynchronous processes, implementing automatic retry and healing mechanisms, preserving human oversight through required reviews, using empirical feedback (CI results) to improve system quality, and building incrementally from local validation through to full automation.

Assessment and Balanced View

From a balanced perspective, the results are impressive in absolute terms—85,000 lines of test code and tripling coverage—but should be contextualized. The 76% first-pass CI success rate is good but not exceptional, indicating room for improvement. The 13.6% SwiftLint failure rate in recent batches suggests the system regressed slightly, possibly as it tackled more challenging files or as prompts drifted. The need for five retry attempts and 30-day cooldowns indicates a non-trivial failure rate even with auto-healing.

The human review requirement for all 250 PRs means this isn’t fully autonomous code generation but rather LLM-assisted test generation with human validation. This is appropriate and responsible, but the reviewer bandwidth bottleneck suggests the system may have reached capacity at current quality levels. The planned addition of a reviewer agent is logical but adds complexity and latency.

The incident with agents modifying production code is concerning and highlights the risks of giving LLMs write access to repositories, even within seemingly constrained contexts. The need for additional guardrails after deployment indicates the initial safety measures were insufficient.

The architectural decision to use Linux workers despite requiring Mac tooling for proper validation created a fundamental limitation that persisted for 17 weeks and is only now being addressed. This represents a technical debt tradeoff where infrastructure convenience was prioritized over validation completeness, though the team is now correcting it.

Overall, this case study represents a sophisticated production LLMOps implementation that achieved meaningful impact while honestly acknowledging limitations and challenges. The system demonstrates mature engineering practices including empirical validation, iterative improvement, human oversight, automated lifecycle management, and candid assessment of failure modes. It provides a realistic picture of LLM-assisted code generation at scale, neither overselling capabilities nor understating the engineering required to make such systems reliable in production.

More Like This

Autonomous Self-Healing System for Bug Resolution

Wix 2026

Wix developed a self-healing system called Gandalf that autonomously processes support tickets from initial detection through to pull request creation for bug fixes. The system was motivated by overwhelming support ticket volumes taking an average of 14 days to resolve, with the goal of reducing this to under 24 hours. Using a four-agent architecture that handles ticket classification, context enrichment, code generation, and review, the system successfully generates pull requests for production deployment, though challenges remain around accurately classifying certain ticket types and accessing organizational knowledge that exists only in institutional memory rather than documented form.

code_generation customer_support poc +20

Building and Deploying Background Coding Agents at Scale

Cognition 2026

Cognition, the company behind Devon, discusses their journey building production-ready autonomous coding agents that operate in cloud environments. The conversation with Walden Yan (Co-founder, CPO at Cognition) and Cole Murray (creator of Open Inspect) explores the architectural decisions, infrastructure challenges, and production considerations for deploying AI agents that can autonomously write, test, and merge code. They discuss the shift from local IDE-based AI assistants to background agents that work autonomously in cloud environments, the technical infrastructure required to support this paradigm (including VM management, sandbox security, and state management), and real-world use cases like automated incident response, customer support triage, and continuous security scanning. The discussion covers how Devon now contributes 80% of commits on Cognition's repositories (up from 16% in January), representing a fundamental shift in how engineering teams work with AI.

code_generation code_interpretation poc +29

Building Custom Agents at Scale: Notion's Multi-Year Journey to Production-Ready Agentic Workflows

Notion 2026

Notion, a knowledge work platform serving enterprise customers, spent multiple years (2022-2026) iterating through four to five complete rebuilds of their agent infrastructure before shipping Custom Agents to production. The core problem was enabling users to automate complex workflows across their workspaces while maintaining enterprise-grade reliability, security, and cost efficiency. Their solution involved building a sophisticated agent harness with progressive tool disclosure, SQL-like database abstractions, markdown-based interfaces optimized for LLM consumption, and a comprehensive evaluation framework. The result was a production system handling over 100 tools, serving majority-agent traffic for search, and enabling workflows like automated bug triaging, email processing, and meeting notes capture that fundamentally changed how their company and customers operate.

chatbot question_answering summarization +52