Factory presents "Missions," an LLM-based autonomous development system designed to solve the fundamental limitation of single-agent contexts becoming diluted and unreliable during complex, multi-day software projects. The solution employs a multi-agent architecture with separation of concerns: an orchestrator for planning and coordination, workers for implementation, and independent validators for quality assurance. The system implements test-driven development at both unit and system levels, uses externalized shared state to avoid context overload, and employs model specialization for different roles. A real-world demonstration shows the system autonomously building a Slack clone over 16.5 hours with 185 agent runs, generating 38.8k lines of code with 89.25% test coverage, demonstrating that structured multi-agent orchestration with validation loops can produce reliable, production-quality software autonomously.
Factory’s “Missions” system represents a sophisticated approach to deploying LLMs for autonomous, multi-day software development projects. The case study, published in April 2026, describes an architectural solution to a fundamental challenge in production LLM systems: how to maintain agent reliability and focus when tasks exceed the effective capacity of a single context window. Rather than attempting to cram increasingly complex information into a single agent session, Factory has designed a multi-agent orchestration system with explicit separation of concerns, validation gates, and shared state management.
The core insight driving the architecture is empirical: agents are highly reactive to their context. Since agent trajectories are append-only, the model’s reasoning at any point becomes a function of every past thought, observation, and action. Large language models seek coherence and integrate everything in their context into a unified worldview, which means they perform optimally only when every previous step in the trajectory urges them toward the next optimal step. When irrelevant or adversarial context accumulates, performance degrades measurably.
Factory identifies two critical failure modes that emerge from poor context management in production agent systems. First, irrelevant context accumulates when tasks are unfocused or overly broad, causing the agent’s context window to fill with information that isn’t relevant to the current subtask. As scope broadens, less of the context actively pulls the agent toward optimal next steps, creating what they term “context dilution.” Second, adversarial context accumulates when an agent that implemented something is later asked to evaluate its own work. The prior reasoning creates confirmation bias, making objective evaluation nearly impossible.
These observations have significant implications for LLMOps practitioners. It’s insufficient to simply decompose work into smaller chunks; each agent’s goal must be narrowly focused and its trajectory must remain directionally consistent throughout execution. Every run must avoid accumulating context that is either not useful to the agent’s current task or not aligned with the agent’s incentive structure and the ideal outcome for that specific run.
The Missions architecture embeds four key principles that address these context management challenges while enabling reliable autonomous operation at scale.
Separation of concerns and incentives forms the foundation. Each role in the system has a single, well-defined goal, and the architecture ensures nothing in an agent’s trajectory pulls it away from that goal. The orchestrator plans and decomposes approaches to the user’s goal, steering execution to completion through all validation gates. Critically, it avoids accumulating overly granular implementation context by delegating all investigation and detailed implementation to subagents and workers. The orchestrator doesn’t drive validation directly; instead, the system injects independent validators at milestones to surface gaps. Workers complete well-specified features with clear success criteria, iterating until they believe the work is correct, but they don’t make the final judgment on correctness—an independent validator makes that determination. Validators evaluate completed work for correctness and completeness, surfacing bugs and gaps, but they don’t implement fixes. Instead, they report issues back to the orchestrator, which creates targeted fix features that future workers implement.
This separation addresses the adversarial context problem directly: by ensuring the agent evaluating work has never seen the implementation reasoning, Factory maintains objectivity in the validation process. It’s a production-oriented solution to a fundamental challenge in agent reliability.
Test-driven development at two levels operates at both micro and macro scales. At the worker level, each agent writes tests before code, ensuring tests reflect intended behavior rather than being retrofitted to match implementation details. At the mission level, the orchestrator defines correctness first by creating a validation contract—a set of behavioral assertions that define success—before defining any features. This ordering matters significantly from an LLMOps perspective. When creating the validation contract, the orchestrator draws from its understanding of requirements in isolation. If features were created first, the contract would be influenced by the planned implementation, creating subtle confirmation bias even at the architectural level.
The validation assertions are later verified by fresh agents that exercise the system as a black box, using it the way a real user would rather than inspecting implementation code. This mirrors human software testing best practices but implements them through careful agent orchestration and context management.
Externalized state ensures no single agent needs to hold a complete picture in its context at once. The full state is distributed across shared artifacts: the validation contract, the feature list, research notes, operational guidelines, and an evolving knowledge base. Each agent reads only what’s relevant to its current job. Even the orchestrator delegates deep investigation to subagents to avoid consuming every detail itself. This is a pragmatic approach to working within context window limitations while maintaining system coherence—rather than fighting the constraints of current models, the architecture works with them.
Model specialization recognizes that different models have different strengths in reasoning, discipline, creativity, thoroughness, speed, and cost, and that no single model excels at everything. Once roles are cleanly separated, model choice becomes local to each role: the orchestrator might use a model optimized for broad planning and judgment, workers might use models balancing reliable execution with cost efficiency, and validators might employ models known for thoroughness and skepticism. This represents mature LLMOps thinking—rather than treating model selection as a global decision, the architecture makes it a role-specific optimization problem.
The actual execution of a Mission follows a structured workflow that implements these principles. A user describes what they want built, and the orchestrator investigates and asks clarifying questions until requirements are unambiguous. Then it writes the validation contract—a finite checklist of testable behavioral assertions that define completion and correctness for the entire mission. Each assertion specifies which tool will verify it and what evidence constitutes proof. For example, a validation assertion might specify that a user with valid credentials submits a login form and is redirected to the dashboard, verified using an agent-browser tool with screenshot and network response evidence.
From there, the orchestrator decomposes work into features, where each feature is a bounded piece of implementation that explicitly claims which assertions it will fulfill. Features are grouped into milestones, each encompassing a logical unit of functionality. The orchestrator also creates shared state files—boundaries and procedures for its workers that enforce optimal structure and behavior, along with a library that accumulates knowledge over the mission’s duration.
A programmatic runner takes the feature list and spawns a worker for each feature in order. Each worker starts with a fresh context, receives its feature spec, writes tests first, then implements. This fresh-context-per-feature approach is central to avoiding context dilution—no worker is burdened with the implementation details of previous features unless explicitly provided through shared state.
Once all features within a milestone are complete, the runner triggers validation using fresh agents operating in two modes. Scrutiny validators review each worker’s implementation and trajectory for quality and correctness, encoding relevant knowledge updates into shared state for future workers. User-testing validators exercise the system as a black box, verifying behavior against the validation contract assertions without access to implementation details.
After validation, the orchestrator reviews what workers and validators flagged and creates fix features targeted at actionable gaps. These fixes are executed before the milestone re-validates, creating a correction loop that continues until milestone validation passes. If implementation or validation is blocked, the orchestrator halts the mission and hands control back to the user, maintaining human-in-the-loop oversight for critical decisions.
Factory provides detailed metrics from a production mission that built a Slack clone with workspace authentication, channels and threads, real-time messaging with reactions and mentions, file uploads, search, and presence and notifications. While these metrics are from Factory’s own system and should be interpreted with appropriate skepticism about potential cherry-picking, they provide valuable insight into the operational characteristics of production autonomous agent systems.
The mission completed in 16.5 hours of total runtime, distributed as 2.3% orchestration (0.38 hours), 60.5% implementation (9.98 hours), and 37.2% validation (6.14 hours). The substantial time allocation to validation—more than a third of total runtime—reflects the architectural commitment to reliability over pure speed. The mission progressed through six milestones following a consistent implementation-validation cadence.
The system executed 185 total agent runs: one orchestrator run spawning 12 subagents, 63 worker runs, and 27 validator runs spawning 82 subagents. The high number of validator subagents suggests that validation involves substantial investigation and testing activity, not just simple checks. The system consumed 778.5 million tokens total, with 30.3M input tokens, 744.9M cache read tokens, and 3.4M output tokens. The cache read numbers indicate Factory is making extensive use of prompt caching to manage costs—without caching, this mission would have been substantially more expensive to run.
Token distribution by role shows 29.2M tokens for orchestration, 485.5M for implementation, and 263.8M for validation, roughly tracking the time distribution but with validation using proportionally more tokens relative to time, likely due to the need for thorough investigation and testing.
The system generated 38.8k lines of code with 52.5% being tests (20.4k test lines vs 18.5k source lines), achieving 89.25% statement coverage. The test-heavy output reflects the architectural commitment to test-driven development, though it’s worth noting that statement coverage is only one dimension of test quality and doesn’t guarantee the tests are actually effective at catching bugs.
The reliability metrics show convergence behavior: across six milestones and four validation rounds, zero milestones passed initially, one passed in round two, two in round three, and all six passed in round four. Validators surfaced 81 total issues (65 blocking, 11 non-blocking, 5 suggestions), and the orchestrator generated 21 fix features representing 34.4% of implementation work. This correction loop demonstrates the system’s ability to iteratively improve toward correctness, though it also reveals that initial implementations were far from perfect—requiring substantial rework through multiple validation cycles.
Trajectory lengths remained bounded throughout execution, with median run lengths of 51 assistant turns for implementation (90th percentile at 123 turns) and 30 turns for validation (90th percentile at 37 turns). These bounded trajectories suggest the architecture successfully prevents runaway agent loops and maintains focus within individual runs.
From an LLMOps perspective, Factory’s Missions system demonstrates several mature practices for production LLM deployment, though the case study is ultimately a product announcement and should be evaluated critically.
Strengths in the architectural approach include the explicit handling of context management as a first-class concern, recognizing that context quality determines agent reliability in production systems. The separation of concerns between implementation and validation mirrors established software engineering practices adapted thoughtfully for LLM agents. The multi-level testing approach (unit tests by workers, system tests by validators) creates defense in depth against errors. Externalized state and shared knowledge bases allow for learning and consistency across agent runs without context pollution. Model specialization by role enables cost-performance optimization at a granular level. The system’s transparency around metrics (tokens, runs, trajectories) suggests operational maturity in monitoring and understanding production LLM behavior.
Potential concerns and limitations include that the case study presents results from a single, potentially cherry-picked example rather than aggregate statistics across many missions, which limits our ability to assess typical vs. best-case performance. The 16.5-hour runtime for a Slack clone is significant, and while the output quality metrics look reasonable, there’s no comparison to human developer time or discussion of when autonomous development is cost-effective versus traditional approaches. The 34.4% fix-feature ratio suggests substantial rework is normal, raising questions about efficiency and total cost. The system requires 185 agent runs and 778.5M tokens for one application, and the economics depend heavily on prompt caching (744.9M cached tokens)—actual costs could be substantially higher without effective caching strategies.
The validation approach relies on agents evaluating agent-generated code and behavior, which may miss entire classes of issues that human reviewers would catch. There’s no discussion of failure modes, how often missions need to be halted for human intervention, or what types of problems the system cannot handle autonomously. The statement coverage metric of 89.25% is presented positively but without discussion of whether these tests are actually effective at catching bugs or just mechanically achieve coverage. The architecture assumes well-specified requirements upfront, but real software development often involves evolving understanding of requirements, and it’s unclear how the system handles requirements that shift during execution.
Production LLM operations insights that emerge from this case study include the importance of architectural patterns that prevent context dilution through fresh-context execution and role separation. The value of validation loops and multi-round correction for achieving reliability in autonomous agent systems is clearly demonstrated, even at the cost of substantial time and token investment. The metrics show that prompt caching is essential for making long-running, multi-agent systems economically viable—the 744.9M cached token reads would be prohibitively expensive as fresh reads. Bounded trajectory lengths through focused goals and scoped roles help prevent runaway execution and unpredictable costs.
The substantial allocation of resources to validation (37.2% of time, 33.9% of tokens) suggests that achieving reliability in autonomous systems requires treating testing and validation as equal partners to implementation, not afterthoughts. The system demonstrates that model-based evaluation of model outputs can work at scale when properly structured with separation of concerns, though the ultimate effectiveness remains an open question.
Factory’s Missions system represents a significant data point in the evolution of production LLM applications toward longer-running, more autonomous operation. The architectural principles—context management through separation of concerns, externalized state, test-driven development at multiple levels, and validation loops—provide a framework that other teams building production agent systems can learn from, even if they’re not using Factory’s specific product.
The case study illustrates that current LLM capabilities can support genuinely autonomous, multi-hour software development tasks when wrapped in appropriate orchestration architecture, though with non-trivial resource requirements and validation overhead. The system’s reliance on multiple models, extensive prompt caching, and substantial validation loops highlights that production autonomous agent systems are complex infrastructure projects requiring careful engineering, not just clever prompting.
The transparency around metrics like token usage, agent runs, and trajectory lengths is valuable for the LLMOps community’s developing understanding of what production autonomous agent systems actually look like in operation. However, practitioners should approach these results with appropriate skepticism given that Factory is selling this as a product and the case study presents a single, potentially optimistic example rather than aggregate performance data across diverse projects and failure cases.
As LLMs continue to improve in reasoning, planning, execution, and computer use capabilities, Factory argues that each improvement will compound through their architecture. This is plausible—better planners will produce tighter specs, better workers will make fewer mistakes, and better validators will judge correctness more reliably. Similarly, as models become faster and cheaper, tighter validation loops and more ambitious missions become economically viable. However, the fundamental question of when autonomous development provides better outcomes than human development at acceptable cost remains open and likely depends heavily on project characteristics, team capabilities, and specific use cases.
This panel discussion brings together engineering leaders from HRS Group, Netflix, and Harness to explore how AI is transforming DevOps and SRE practices. The panelists address the challenge of teams spending excessive time on reactive monitoring, alert triage, and incident response, often wading through thousands of logs and ambiguous signals. The solution involves integrating AI agents and generative models into CI/CD pipelines, observability workflows, and incident management to enable predictive analysis, intelligent rollouts, automated summarization, and faster root cause analysis. Results include dramatically reduced mean time to resolution (from hours to minutes), elimination of low-level toil, improved context-aware decision making, and the ability to move from reactive monitoring to proactive, machine-speed remediation while maintaining human accountability for critical business decisions.
Union's Chief ML Engineer shares lessons learned from productionizing agentic systems at scale, addressing the critical infrastructure challenges that arise when deploying LLM agents in production environments. The presentation introduces six design principles for building crash-proof, durable agents using the Flyte 2.0 orchestration platform, focusing on how agents can recover from multi-layer failures (infrastructure, network, logical, semantic) through proper context engineering and durability mechanisms. A key case study with Dragonfly demonstrates these principles in action, where a tiered agent architecture processes 250,000+ software products with 200+ steps and 100+ LLM calls each, achieving 2,000+ concurrent runs, 50% reduction in failure recovery time, 30% increased development velocity, and 12 hours per week saved on infrastructure maintenance.
AI practitioners Aishwarya Raanti and Kiti Bottom, who have collectively supported over 50 AI product deployments across major tech companies and enterprises, present their framework for successfully building AI products in production. They identify that building AI products differs fundamentally from traditional software due to non-determinism on both input and output sides, and the agency-control tradeoff inherent in autonomous systems. Their solution involves a phased approach called Continuous Calibration Continuous Development (CCCD), which recommends starting with high human control and low AI agency, then gradually increasing autonomy as trust is built through behavior calibration. This iterative methodology, combined with a balanced approach to evaluation metrics and production monitoring, has helped companies avoid common pitfalls like premature full automation, inadequate reliability, and user trust erosion.