Cognition developed Devin Fusion, a multi-model routing system designed to reduce the cost of AI-powered code generation while maintaining frontier-level performance. The problem addressed is the unsustainable cost of using expensive frontier models for every coding task, coupled with the poor performance of existing model routing solutions on real-world code quality. Devin Fusion employs a "sidekick" architecture where a frontier model delegates appropriate tasks to a cheaper model while retaining decision-making authority, combined with dynamic mid-session routing that switches models during execution based on task complexity. The solution achieved a 35% cost reduction compared to frontier models while maintaining equivalent performance on the FrontierCode benchmark, with 88% of internally merged pull requests being driven entirely by the automated routing system.
Cognition’s Devin Fusion represents a sophisticated approach to multi-model orchestration in production AI systems, specifically targeting the economic challenges of deploying frontier LLMs for software engineering tasks. The case study, published in June 2026, describes a production system designed to address a fundamental tension in LLMOps: the need to balance cost efficiency with the intelligence capabilities of frontier models. While the text is promotional in nature and makes strong performance claims, it provides substantial technical detail about real-world challenges in deploying multi-model systems at scale.
The core problem Cognition identifies is that engineering teams cannot sustainably use the most expensive frontier models for every task, yet existing model routing solutions fail to produce code of mergeable quality despite performing well on traditional benchmarks. This highlights a critical gap between benchmark performance and production utility—a recurring theme in LLMOps where synthetic evaluation metrics often fail to capture real-world quality requirements.
The centerpiece of Devin Fusion’s approach is what Cognition calls the “sidekick” architecture. Rather than simply routing individual prompts to different models based on complexity classification, this system runs two parallel agents simultaneously: a frontier model serving as the “main agent” and a more cost-effective model serving as the “sidekick.” Both agents are fully-fledged with their own toolsets and ability to gather and act on context independently.
The architectural philosophy is that the main frontier model should take minimal actions and read only what is absolutely necessary, defaulting to delegation and monitoring while retaining responsibility for significant decisions including overall planning, interpretation of ambiguous requirements, and final review. This design pattern attempts to capture the benefits of frontier model intelligence—creativity, nuanced judgment, and general problem-solving capability—while offloading mechanical and well-defined subtasks to cheaper models.
This approach addresses several fundamental problems with simpler routing strategies. First, it retains what Cognition calls “real frontier intelligence” rather than “benchmark-score intelligence,” acknowledging that routers often overfit to specific evaluation metrics. By keeping a frontier model actively involved throughout task execution, the system continues to benefit from capabilities that may not be well-captured by benchmarks. Second, the sidekick pattern generalizes beyond single-prompt tasks. Many routing systems make a one-time decision about which model to use based on the initial prompt, but prompts often lack sufficient information to accurately assess task difficulty, and users may follow simple initial requests with difficult clarifications. The sidekick architecture allows dynamic movement between models throughout a session.
Third, and particularly important from an LLMOps efficiency perspective, the sidekick approach avoids costly cache misses when switching between models. Cognition notes they previously explored a “Smart Friend” tool (and Anthropic released a similar “Advisor” tool) where one model could query another for advice. However, these tool-based approaches suffered from a critical limitation: each call to the secondary model required sending the full task context without being able to leverage caching, resulting in expensive reprocessing of tokens. In contrast, the sidekick architecture maintains separate, persistent cached contexts for both the main and sidekick models, allowing both to build up context incrementally over the course of a session.
The case study includes specific examples from the FrontierCode benchmark that illustrate when the sidekick pattern works well and when it struggles. For instance, on a task to modernize JavaScript code to ES6 and verify with a slow test suite, delegating the time-consuming test execution to the sidekick achieved a 62% cost reduction with no quality impact. Similarly, mechanically removing OpenTracing integration across many Go files—work requiring consistency but minimal judgment—was successfully delegated for a 32% cost reduction. However, the examples also acknowledge failure modes: a complex React/Redux feature requiring subtle judgment calls saw degraded quality when critical decisions were delegated to the sidekick, despite the cost savings. This transparency about limitations is notable in what is otherwise a promotional piece, and it highlights a fundamental challenge in production multi-model systems: accurately identifying which aspects of a task require frontier-level intelligence.
The case study reveals sophisticated attention to the practical engineering challenges of maintaining model context in a multi-agent system. Cognition notes that most cached inputs have only a 5-minute expiry, presenting a significant operational challenge for longer-running tasks. While they don’t fully detail their solution (explicitly encouraging readers to “think about how to engineer around this”), the mention of this constraint highlights a critical but often under-discussed aspect of production LLM systems: the engineering required to work within the operational constraints of API providers.
The importance of cache efficiency permeates the design. Context caching can dramatically reduce costs by allowing subsequent API calls to reuse previously processed tokens at a fraction of the cost, but only if the system can maintain cache consistency. The dual-agent architecture with persistent contexts for both main and sidekick agents is explicitly designed to maximize cache utilization. This represents a more sophisticated approach to cost optimization than simply selecting cheaper models, recognizing that the interaction patterns between model calls can be as important as per-token pricing.
Beyond the basic sidekick pattern, Devin Fusion implements dynamic mid-session routing that can change which models are being used during task execution. The system employs lightweight classifiers that run during task execution to signal when switching to a different model would be beneficial—either upgrading from sidekick to main agent when a task proves more challenging than expected, or switching between different models entirely.
The elegant aspect of this design is how model switching is integrated with context compaction. Since context windows have limits and longer sessions eventually require condensing or summarizing earlier context (a process that triggers a cache miss anyway), Cognition uses these compaction points as opportunities to evaluate the situation and potentially switch models, effectively getting model switching “for free” from a cache penalty perspective. This design demonstrates thoughtful attention to the operational economics of LLM APIs, where cache misses represent significant cost events that should be strategically managed rather than incurred carelessly.
The system can even “upgrade” the sidekick model itself without reverting to the main model, providing flexibility to match model capabilities to evolving task requirements without necessarily escalating to the most expensive option. This multi-tier approach acknowledges that model selection is not binary but exists on a spectrum of capability and cost.
Cognition evaluates Devin Fusion on FrontierCode, which they describe as a “new state-of-the-art coding benchmark” measuring both code correctness and quality. The reported results show Devin Fusion achieving a 35% cost reduction compared to frontier models (GPT-5.5 and Opus 4.8) while maintaining comparable performance scores (47.9 vs 44.8 and 48.8). When combined with Fable 5 (a high-performance model that had its access suspended by US government directive in June 2026), the cost reduction increases to 41% while matching Fable 5’s standalone performance of 57.0-57.6.
It’s important to approach these benchmark results with appropriate skepticism. The case study itself acknowledges that “routers often overfit to specific benchmarks,” and FrontierCode is explicitly described as measuring what Cognition cares about in code generation. There’s an inherent risk that optimizing for a self-created or self-selected benchmark may not generalize to other teams’ priorities or coding contexts. Additionally, the case presents results without confidence intervals, information about benchmark composition, or details about how representative the evaluation set is of real-world software engineering work.
More compelling than the benchmark numbers is Cognition’s report of internal usage: they enabled Fusion for internal users at Cognition and found that 88% of merged pull requests were driven entirely by the automated router without manual model selection. This suggests the system performs adequately in real usage by developers working on actual codebases, though the sample is limited to Cognition’s own engineering work, which may not be representative of broader software development. The fact that 12% of PRs required manual intervention (or at least didn’t use automatic routing) also indicates the system is not universally appropriate for all tasks.
The case study notes that recent models, particularly Fable 5, perform unusually well in multi-agent setups. They observe that Fable 5 “delegates work more intelligently, requests context more efficiently, and plans more precisely,” suggesting that as base models improve, they become better at the meta-cognitive task of managing delegation and resource usage. This is an interesting observation about the scaling properties of LLM capabilities: not only do stronger models perform individual tasks better, but they may also be more effective at orchestrating complex workflows.
Cognition reports that Fusion with Fable 5 achieved 41% cost reduction versus 35% with Opus and GPT-5.5-level models, but they acknowledge this comparison may understate the difference because they had less time to tune the system for Fable 5 before access was suspended. This candor about the limitations of their evaluation is commendable, though it also means the headline numbers should be interpreted cautiously.
The case study also suggests that different frontier models have different relative strengths—some models excel at UI testing while others are better at identifying bugs in PRs. This observation supports the broader thesis that multi-model systems can capture complementary capabilities, though the text doesn’t provide enough detail to verify these specific claims.
While the case study focuses heavily on the technical architecture, it reveals several production deployment considerations. The system is deployed as part of Devin, Cognition’s AI software engineering agent, accessible through their cloud platform. The fact that they’re releasing it as a preview to external users suggests they’ve reached a level of operational maturity where they’re comfortable with external traffic, though “preview” status also provides appropriate hedging against potential issues.
The emphasis on cost reduction—35-41% depending on configuration—addresses a real concern for organizations deploying LLMs at scale. As frontier model capabilities have increased, so have their costs, and at high usage volumes, model inference expenses can become prohibitive. However, it’s worth noting that the cost comparison is relative to using expensive frontier models exclusively. Organizations already using cheaper models for appropriate tasks may see smaller benefits.
The system’s reliance on maintaining cached contexts introduces operational dependencies on API provider caching policies. The 5-minute cache expiry mentioned in the text suggests they’re working with constraints imposed by model providers, and changes to these policies could significantly impact system economics. This highlights a broader challenge in LLMOps: production systems built on third-party APIs are subject to provider policies and pricing changes that are outside the organization’s control.
Several aspects of this case study warrant careful consideration. First, the text is explicitly promotional—Cognition is marketing their product and encouraging readers to try it and potentially work for them. While they provide substantial technical detail and acknowledge some limitations, the framing is consistently positive and the metrics chosen naturally highlight their strengths.
Second, the benchmark used for evaluation (FrontierCode) is described as “new” and “state-of-the-art,” but no details are provided about its composition, who created it, whether it’s publicly available, or how it relates to existing coding benchmarks. The risk of overfitting to a self-selected evaluation metric is real. The claim that existing routing tools “look nice on most benchmarks but fail to write code you’d actually merge” could equally apply to FrontierCode if it doesn’t accurately represent real-world code review standards.
Third, while the 88% figure for internal merged PRs is compelling, it’s based on usage by Cognition’s own engineers working on their own codebase. This population is likely to be more tolerant of system quirks, more skilled at working around limitations, and more motivated to report success than external users would be. The generalization to other organizations, codebases, and engineering cultures is unclear.
Fourth, the specific cost savings (35-41%) depend heavily on the baseline assumption of using expensive frontier models for all work. Organizations that have already implemented basic model tiering or that work on tasks where cheaper models are generally sufficient may see much smaller benefits. The cost comparison also doesn’t account for the engineering effort required to build, maintain, and operate a sophisticated multi-model routing system, which represents an ongoing operational cost.
Fifth, the case study focuses almost entirely on cost and benchmark performance, with limited discussion of other important production concerns like latency, reliability, error handling, observability, or how the system behaves when one model or API is unavailable. For a production system, these operational characteristics can be as important as cost and quality metrics.
Despite these limitations, the case study illustrates several important themes in production LLM systems. The tension between cost and capability is fundamental to LLMOps, and as models continue to improve and become more expensive, sophisticated routing strategies will likely become more important. The shift from single-model to multi-model architectures represents a maturation of the field, acknowledging that different tasks have different requirements and that one-size-fits-all approaches leave significant value on the table.
The attention to caching and context management reflects an understanding that LLMOps economics involves more than just per-token pricing. The interaction patterns between API calls, the management of context windows, and the strategic use of caching can significantly impact total cost of ownership. This level of sophistication is characteristic of mature production systems that have moved beyond proof-of-concept implementations.
The concept of maintaining frontier model involvement while delegating appropriate work represents a middle ground between full automation with cheap models and expensive frontier-model-everywhere approaches. This philosophy—using expensive intelligence strategically rather than uniformly—may become increasingly important as the capability gap between frontier and commodity models persists.
Finally, the acknowledgment that different models have different strengths points toward a future where LLMOps platforms need to manage portfolios of models and route work based not just on general capability levels but on task-specific strengths. This adds complexity but also potentially unlocks better performance across diverse workloads. The challenge for the field will be developing routing strategies that are robust, generalizable, and don’t require extensive task-specific tuning to work effectively.
Cognition, the company behind Devon, discusses their journey building production-ready autonomous coding agents that operate in cloud environments. The conversation with Walden Yan (Co-founder, CPO at Cognition) and Cole Murray (creator of Open Inspect) explores the architectural decisions, infrastructure challenges, and production considerations for deploying AI agents that can autonomously write, test, and merge code. They discuss the shift from local IDE-based AI assistants to background agents that work autonomously in cloud environments, the technical infrastructure required to support this paradigm (including VM management, sandbox security, and state management), and real-world use cases like automated incident response, customer support triage, and continuous security scanning. The discussion covers how Devon now contributes 80% of commits on Cognition's repositories (up from 16% in January), representing a fundamental shift in how engineering teams work with AI.
Cursor, an AI-powered code editor company, developed Cloud Agents to enable independent, asynchronous AI coding agents that run in dedicated cloud environments. The company transitioned from a homegrown orchestration system with 90% reliability to Temporal-based workflows achieving over 99% activity success rates. By leveraging Temporal for workflow orchestration, they enabled parallel agent execution, automated code reviews, and proof-of-correctness through screenshots and videos. The system now processes over 50 million Temporal actions daily across 7+ million workflows, with cloud agents generating one-third of internal merged pull requests, demonstrating significant developer productivity gains.
Notion, a knowledge work platform serving enterprise customers, spent multiple years (2022-2026) iterating through four to five complete rebuilds of their agent infrastructure before shipping Custom Agents to production. The core problem was enabling users to automate complex workflows across their workspaces while maintaining enterprise-grade reliability, security, and cost efficiency. Their solution involved building a sophisticated agent harness with progressive tool disclosure, SQL-like database abstractions, markdown-based interfaces optimized for LLM consumption, and a comprehensive evaluation framework. The result was a production system handling over 100 tools, serving majority-agent traffic for search, and enabling workflows like automated bug triaging, email processing, and meeting notes capture that fundamentally changed how their company and customers operate.