Duolingo: AI Agent for Automated Feature Flag Removal

Overview

Duolingo’s AI agent for feature flag removal represents a practical application of LLMs in production for automating routine software engineering tasks. The case study describes how the language learning platform built an autonomous tool that removes obsolete feature flags from their codebase, addressing a common source of technical debt in software development. Beyond the specific tool, Duolingo explicitly positions this as a foundational effort to establish reusable patterns for deploying additional AI agents that operate on their code infrastructure.

The project demonstrates a pragmatic approach to LLMOps, prioritizing rapid development and immediate utility while establishing architectural patterns that can be extended to future agentic applications. The team reports moving from initial experimentation to production deployment in approximately one week after settling on their technology stack, with a working prototype operational within a single day.

Technical Architecture and Infrastructure

The system architecture centers on Temporal, a workflow orchestration platform that Duolingo selected for several key operational characteristics. Temporal provides the foundational infrastructure for managing the agent’s execution lifecycle, offering trivially easy local testing capabilities that proved essential for rapid prompt engineering iteration. The platform’s robust retry logic addresses a critical challenge in agentic workflows: dealing with AI non-determinism. The team explicitly acknowledges that even well-prompted agents can exhibit unpredictable behaviors—going off the rails and crashing, failing to produce changes, hanging or freezing, or otherwise failing to complete tasks—making retry capabilities essential for production reliability.

The workflow begins when engineers trigger the agent through Duolingo’s Platform Self-Service UI. This initiates a workflow in Temporal’s gateway dispatcher namespace, which then kicks off the feature flag removal worker workflow. The system first executes an activity to retrieve the user’s GitHub account name, then launches the main work activity. This activity clones the relevant repository to a temporary directory before invoking Codex CLI to perform the actual code modification work. Upon completion, if the agent has identified changes to make, the system automatically creates a pull request and assigns it to the requesting engineer.

Duolingo made a deliberate architectural decision to perform most operations through standard coding practices rather than relying exclusively on AI-native approaches. They chose to access GitHub through conventional means and only invoke the AI agent when specifically needed, creating what they describe as a “clean separation of work.” This design reflects a finding that most AI tools operate most efficiently on local code, and that a GitHub Model Context Protocol (MCP) integration is simply unnecessary for tasks that don’t operate on open pull requests or code history.

The team notes an important Temporal-specific constraint that influenced their architectural decisions: each activity in Temporal may run on a separate worker instance. This makes it impractical to split operations like repository cloning into separate, reusable activities—doing so could result in cloning code on one instance and then attempting to operate on it from a different instance that lacks access to that code. While splitting work into more granular, reusable activities might seem architecturally cleaner in the abstract, the distributed nature of Temporal’s execution model makes this approach unviable.

For security and isolation, Duolingo sandboxes the agent’s work onto a dedicated ECS (Elastic Container Service) instance separate from their other tasks. This isolation makes it significantly safer to run Codex in its “dangerously bypass approvals and sandbox” mode, which provides full autonomous control necessary for agentic operation.

Technology Selection and Evolution

The case study reveals an interesting technology evaluation process that ultimately converged on Anthropic’s Codex CLI. Duolingo initially pursued parallel development of two versions using different toolchains: one based on LangChain and another using fast-agent. The LangChain implementation leveraged Baymax, an internally developed toolset for operating directly on local code, while the fast-agent version used the GitHub MCP. After one to two weeks of development, the team successfully got the LangChain version working well using a set of three prompts run in loops, with the fast-agent version making similar progress.

The release of Codex fundamentally changed this trajectory. The team consolidated their three separate prompts into a single prompt, tested it in the Codex web UI, and report it “just worked.” They then tried the same prompt through Codex CLI with the same successful result. This immediate efficacy led them to abandon their previous development efforts in favor of the Codex-based approach.

For agentic operation, Duolingo runs Codex CLI in full-auto mode with quiet mode enabled, executing it as a Python subprocess. The current command structure is: codex exec --dangerously-bypass-approvals-and-sandbox -m {model} {prompt}. The team acknowledges this is not ideal—they would prefer using a proper API through a hypothetical Codex SDK—but running the CLI directly allows them to move forward without waiting for such an SDK to become available. They explicitly state their expectation to replace this approach with an official Codex SDK if and when one is released.

The team identifies a significant limitation of the current Codex CLI implementation: it does not provide control over output format or enable structured JSON responses like other AI tools. While sufficient prompt engineering can coax Codex into producing structured output, this approach lacks the determinism of a true response format specification and proves inconvenient in practice. This limitation likely constrains certain types of workflows and forces workarounds in validation and error handling.

Prompt Engineering and Model Usage

While the case study doesn’t provide extensive detail on the specific prompts used, it reveals several key aspects of their prompt engineering approach. The team successfully consolidated three separate prompts that were necessary in their LangChain implementation into a single prompt that proved effective with Codex. This consolidation suggests that Codex’s architecture and capabilities enabled a more streamlined approach to expressing the task requirements.

The team reports spending a significant portion of their development week on prompt engineering, alongside feature development work such as extending the initial Python-only prototype to also handle Kotlin code. This indicates an iterative refinement process to optimize the agent’s behavior across different programming languages and edge cases in feature flag removal.

The case study doesn’t specify which underlying model they use with Codex, referring only to a {model} parameter in their command structure. This suggests they may experiment with or switch between different models, or that the specific model choice is configurable based on task requirements or performance considerations.

Deployment and Production Operations

The production deployment model centers on self-service accessibility. Engineers can initiate feature flag removal directly from Duolingo’s internal Platform Self-Service UI, lowering the barrier to cleaning up technical debt. This integration into existing internal tooling likely increases adoption by embedding the capability into engineers’ existing workflows rather than requiring them to use separate tools or interfaces.

The system automatically handles the end-to-end process from code modification to pull request creation. Generated PRs are assigned back to the requesting engineer and include a friendly automated comment suggesting they “apply pre-commit” to fix simple formatting or linting errors. This indicates the current implementation doesn’t always produce code that passes all pre-commit hooks, requiring some manual intervention.

A critical limitation acknowledged by the team is the current lack of robust validation before PR creation. The system sends PRs “as is” without verifying that the changes pass continuous integration checks, pre-commit hooks, or unit tests. Duolingo identifies this as a key area for improvement, with ongoing work to add testing and validation tools to their agentic framework. Their goal is to only send PRs that either pass all automated checks or are clearly marked as requiring manual work before submission. This validation gap represents a meaningful trade-off in their initial implementation—prioritizing rapid deployment and immediate utility while acknowledging the need for more sophisticated quality gates.

Development Velocity and Iteration

The case study emphasizes the remarkably rapid development timeline. After settling on their technology stack (Codex CLI + Temporal), the team had a working prototype within approximately one day and a production-ready version within roughly one week. This acceleration compared to their initial parallel development efforts (one to two weeks without reaching full production readiness) demonstrates the impact of selecting tools well-matched to the use case.

The team attributes much of their efficiency to Temporal’s local testing capabilities, which enabled rapid iteration on prompt engineering without requiring deployment to remote environments. This tight feedback loop proved essential for developing effective prompts and debugging agent behavior.

Duolingo explicitly positions this project as establishing reusable patterns rather than just building a single tool. They invested effort in understanding generalizable architectural approaches, with the expectation that future agent development will build on these patterns. They anticipate that subsequent agents will be developed even more rapidly, allowing teams to “focus entirely on understanding the problem we are solving and developing one or more prompts to perform it” rather than solving infrastructure and orchestration challenges repeatedly.

LLMOps Maturity and Practices

The case study reveals an organization in the early stages of operationalizing LLMs for internal automation, with a clear-eyed view of both capabilities and limitations. Several aspects indicate pragmatic LLMOps practices:

Sandboxing and Safety: Running the agent on isolated ECS instances demonstrates awareness of security concerns when granting autonomous code modification capabilities. The willingness to bypass Codex’s built-in approval mechanisms within this sandboxed environment shows a risk-calibrated approach—accepting certain risks within controlled boundaries to enable autonomous operation.

Monitoring and Reliability: The emphasis on Temporal’s retry logic as “very necessary for agentic workflows” indicates direct experience with the non-deterministic nature of LLM-based systems in production. The acknowledgment that even well-prompted agents can fail in various ways—crashing, hanging, or producing no output—reflects realistic expectations about AI reliability that inform their infrastructure choices.

Validation and Quality Control: The ongoing work to add pre-commit, CI, and unit test validation before PR creation represents an evolution from “ship and iterate” to more robust quality gates. This progression is typical of LLMOps maturity, where initial implementations prioritize proving value quickly, with subsequent iterations adding the guardrails and quality controls necessary for broader adoption and trust.

Local Development Capabilities: The team’s emphasis on “trivially easy local testing” as a key selection criterion for Temporal highlights the importance of developer experience in LLMOps tooling. The ability to rapidly iterate on prompts and test agent behavior locally accelerates development and debugging, addressing one of the significant challenges in developing non-deterministic AI systems.

Technology Pragmatism: The decision to run Codex CLI as a subprocess despite preferring a proper SDK demonstrates practical engineering judgment—using what’s available now rather than waiting for ideal solutions. This pragmatism balanced against clear articulation of technical debt (the subprocess approach) and migration plans (moving to an SDK when available) shows mature engineering planning.

Scope and Limitations

The case study is notably transparent about current limitations and areas for improvement. The lack of structured output format from Codex CLI constrains certain workflow patterns and makes validation more challenging. The current absence of automated testing before PR creation means engineers receive pull requests that may require manual fixes before they can be merged. The team’s acknowledgment that Codex CLI is “a new tool with minimal documentation” indicates they’re operating somewhat at the bleeding edge, accepting documentation gaps as a trade-off for capabilities.

The feature flag removal task itself represents a relatively well-scoped problem with clear success criteria—flags are either successfully removed or they’re not, and the code either compiles and passes tests or it doesn’t. This bounded problem space likely contributed to the rapid development success. More complex or ambiguous tasks may prove more challenging as Duolingo extends their agentic framework to additional use cases.

The case study doesn’t provide quantitative metrics on success rates, time savings, or adoption levels. We don’t know what percentage of automatically generated PRs are successfully merged without modification, how often the agent fails to produce usable output, or how much engineer time this actually saves. The enthusiastic tone suggests positive results, but specific impact measurement isn’t detailed.

Broader Context and Strategic Direction

Duolingo frames this project within a broader strategic initiative to “build agentic tools to automate simple tasks for Duos and save engineers time.” The feature flag remover is explicitly positioned as the first tool in what’s intended to be a suite of autonomous coding agents. This suggests organizational commitment to investing in AI-powered developer productivity tools beyond a single experiment.

The emphasis on establishing reusable patterns indicates an intention to scale this approach across multiple use cases. By solving orchestration, deployment, testing, and safety concerns once at the infrastructure level, Duolingo aims to commoditize the “plumbing” of agentic development, allowing future efforts to focus primarily on problem understanding and prompt development.

The case study concludes with active recruiting messaging (“If you want to work at a place that uses AI to solve real engineering problems at scale, we’re hiring!”), suggesting this work is part of positioning Duolingo as an AI-forward engineering organization. Whether feature flag removal represents “real engineering problems at scale” is somewhat debatable—it’s certainly a real problem, though perhaps not the most challenging or impactful one—but it serves as a credible proof point for their AI engineering capabilities.

Assessment and Trade-offs

From an LLMOps perspective, this case study demonstrates several sound practices: rapid prototyping and iteration, pragmatic technology selection based on actual results rather than theoretical preferences, appropriate sandboxing and safety measures, and transparent acknowledgment of limitations. The architectural decision to use standard tooling (GitHub APIs, repository cloning) for deterministic operations while reserving AI for the core code modification task shows good engineering judgment about where AI adds value versus where traditional approaches suffice.

However, several questions remain unanswered that would provide a more complete picture: What are the actual success and adoption rates? How much engineer time does this save in practice? What percentage of generated PRs require manual fixing? How does the team handle cases where the agent makes incorrect or unsafe changes? What monitoring and observability exists around agent performance?

The rapid development timeline is impressive but should be contextualized—feature flag removal is a relatively straightforward task compared to many other potential applications of coding agents. The success here may not directly predict the difficulty of more complex or ambiguous automation tasks. Additionally, the team’s abandonment of nearly two weeks of LangChain and fast-agent development work upon Codex’s release, while pragmatically justified, represents sunk cost that should factor into any complete accounting of development effort.

The reliance on a newly released tool (Codex CLI) with minimal documentation and no proper SDK introduces dependency risk. If Anthropic changes the CLI interface, deprioritizes the tool, or shifts to a different product strategy, Duolingo may need to refactor or rebuild. The subprocess-based integration is inherently more fragile than a proper API integration would be.

Overall, this represents a solid early-stage LLMOps implementation that prioritizes proving value quickly while establishing extensible patterns. The team demonstrates appropriate awareness of limitations and clear plans for addressing them. The true test will be whether this pattern successfully extends to more complex automation tasks and whether the anticipated efficiency gains in developing subsequent agents materialize as expected.

AI Agent for Automated Feature Flag Removal

Industry

Technologies