Warp, a terminal software company, developed a cloud-based agent orchestration platform called Oz to address the limitations of running multiple AI coding agents on local laptops. The problem emerged as developers increasingly shifted from writing code by hand to writing by prompt, creating laptop capacity constraints, lack of visibility into agent work across teams, and inability to run agents when laptops are offline. Warp's solution provides cloud-hosted agent execution with automatic tracking, team visibility, programmable APIs, and support for multiple agent harnesses, enabling developers to parallelize coding tasks across multiple cloud agents, create scheduled automations, and embed agent capabilities into internal applications. The platform demonstrates successful use cases including parallel feature implementation, automated issue triage, and team-wide agent coordination.
Warp is a terminal software company that has evolved from building a modern terminal interface to becoming a platform for agentic development workflows. The founder and CEO, Zach, presented a comprehensive vision for agent orchestration in 2026, which he predicts will be “the year of agent orchestration.” The case study centers on Warp’s product called Oz, a cloud-based platform for running and managing AI coding agents. The core insight driving this work is that developers have fundamentally shifted from writing code by hand to writing code by prompt, with Zach noting he hasn’t written code manually in the last six months despite still shipping software regularly.
The presentation provides an honest look at the emerging challenges in LLMOps as AI agents become more central to software development workflows. Rather than simply accepting all claims at face value, it’s important to note that Warp is actively building and selling products in this space, so the narrative is naturally shaped to highlight problems their platform solves. However, the fundamental challenges identified around laptop capacity, visibility, security, and orchestration do resonate with real production constraints teams face when scaling agent usage.
The case study identifies several interconnected problems that emerge when development teams move to agent-driven workflows at scale. The first is a surprisingly mundane but critical issue: laptop capacity constraints. Warp’s codebase consists of approximately one million lines of Rust code, and developers typically have four or five checkouts on their laptops. Running multiple coding agents simultaneously across these checkouts quickly hits the limits of what local machines can handle, particularly when agents are performing compute-intensive tasks like code analysis, testing, and generation.
Beyond raw computational constraints, the presentation highlights visibility and governance challenges. When agents run exclusively on individual developer laptops, engineering managers and team leads have no visibility into how agents are being adopted, what work they’re doing, or whether there are security vulnerabilities in how they’re being used. This creates what Zach describes as “chaos” that needs to be brought under control without sacrificing the productivity gains that agents provide.
The third major problem relates to availability and persistence. Agents tied to laptops can only run when those laptops are powered on and connected to the internet. This fundamentally limits the types of workflows agents can support, particularly for long-running tasks, scheduled automations, or work that needs to happen asynchronously from developer activity.
Finally, there’s an emerging need for longer-running and multi-threaded agent workflows. Interactive pairing with a single agent is becoming insufficient as tasks grow more complex. Developers need agents that can work on hard tasks for extended periods without exceeding context windows or losing track of their work. They also need multiple agents to work in parallel or collaboratively on different aspects of a problem, which requires orchestration infrastructure that doesn’t exist when agents run in isolation on individual machines.
Warp’s Oz platform addresses these challenges through several key primitives that provide a foundation for cloud-based agent orchestration. At the core is the concept of environments, which define what an agent has access to including toolchain configurations, code repositories, and other resources. The platform provides hosting infrastructure for agents to run in the cloud, utilizing technologies like Daytona, E2B, Docker sandboxes, and namespace systems that Warp employs.
A critical component is what Warp calls auto-tracking, which ensures that no matter where or how a developer launches an Oz coding agent, they automatically receive a link to track what it’s doing. All artifacts from agent runs are automatically recorded, including reasoning traces, code changes, and conversation history. This creates a persistent record that can be revisited to understand how agents arrived at particular solutions.
The platform emphasizes keeping humans in the loop even for cloud-based agents. Developers can see what agents are doing in real-time, steer them mid-task, and pick up work when an agent completes 80% of a task but needs human intervention for the remaining 20%. This handoff capability is positioned as essential for production workflows where complete automation isn’t always possible or desirable.
Importantly, Oz is designed to be agent-agnostic and support multiple agent harnesses rather than locking users into a single approach. While Warp has its own coding agent, the platform is architected to work with other agents like Cloud Code or Codex. This flexibility acknowledges that different agent implementations have different strengths and weaknesses for various tasks.
The entire platform is built with programmability as a first-class concern. All functionality is exposed through APIs, SDKs, and CLI tools, allowing developers to launch agents programmatically, configure environments through code, pull artifacts, and access conversation history. This programmability is explicitly designed not just for human developers but also to enable agents to orchestrate each other, which Warp sees as a key future capability.
The presentation included several live demonstrations showing how Oz enables different patterns of agent orchestration. The first use case focused on parallel agent execution for accelerating development work. Working on an internal project that recreates Google Docs functionality built using what Zach calls “vibe coding,” he demonstrated launching multiple cloud agents simultaneously to implement different sets of missing spreadsheet formulas. Rather than running these agents sequentially or overloading a local laptop by running them in parallel locally, he launched five separate cloud agents each responsible for implementing formulas starting with different letters.
Each agent appeared to use the same interactive interface developers are accustomed to, but execution happened in the cloud with full tracking. The demonstration showed how a developer could launch these agents from the terminal and then access them through a web interface that provides team-wide visibility. This web view showed all agent runs across the team, allowing developers to see what their colleagues’ agents were working on, examine artifacts produced, and review PRs that agents created. Zach positioned this as creating a “Google Docs for agents” where agent-driven development becomes collaborative and visible rather than isolated on individual machines.
The second major use case centered on scheduled automations built using what Warp calls skills. These are version-controlled configuration files that define repeatable agent tasks. Examples included an agent that runs after each Warp release to analyze the codebase and documentation, generating PRs to update documentation based on changes in the release. Another automation runs weekly to identify feature flags guarding dead code that will never be turned off, creating PRs to clean up this technical debt.
These skill-based agents represent a shift from interactive pairing to fully automated workflows. Because they run in the cloud on schedules rather than on developer laptops, they can execute reliably without requiring developer presence. The tracking infrastructure means teams can still review what these automated agents did, examining the reasoning and artifacts to ensure quality.
The third use case demonstrated embedding agent capabilities into custom applications. Warp built an internal tool called Power Fixer for triaging GitHub issues. The application provides a terminal user interface showing live GitHub issues and uses Oz agents for two distinct purposes. First, agents automatically analyze every filed issue to detect duplicates, running this analysis in the background. Second, human triagers can select an issue and launch an agent directly from the TUI to generate a fix. This demonstrates using agents not just as standalone tools but as intelligent components embedded in larger application workflows.
While the presentation focused primarily on use cases rather than deep technical implementation details, several architectural elements emerged as important to the platform’s LLMOps capabilities. The platform handles access control and permissions, allowing teams to configure who can see what agent runs and artifacts. The demonstration showed public access for presentation purposes, but this is configurable for production use where security and privacy are concerns.
The environment definition system appears to be flexible enough to support different tech stacks and toolchains, though specific implementation details weren’t provided. The hosting infrastructure abstracts away the complexity of provisioning cloud resources, managing sandboxes, and handling cleanup after agent runs complete.
Artifact management is handled automatically, with the platform capturing and storing everything produced during agent runs including code changes, conversation logs, and reasoning traces. This creates an audit trail and knowledge base that teams can reference when reviewing agent work or debugging issues.
The web interface provides real-time views of agent execution that are synchronized with terminal interfaces, allowing developers to switch between contexts seamlessly. A developer can launch an agent from their terminal, view it on their phone, and make adjustments from a web browser, with all views reflecting the same underlying agent state.
While the presentation highlighted significant capabilities, several production considerations warrant careful evaluation. The platform’s value proposition rests heavily on teams hitting laptop capacity limits or needing team visibility, which may not apply uniformly across all organizations. Smaller teams or those working on less compute-intensive codebases might not immediately feel these pain points, potentially making the cloud orchestration overhead unnecessary complexity.
The emphasis on supporting multiple agent harnesses is architecturally sound but raises questions about the depth of integration with each. Supporting many different agents could lead to lowest-common-denominator functionality where advanced features of specific agents aren’t fully exposed through the orchestration layer. The presentation didn’t deeply address how the platform handles differences in agent capabilities, prompt engineering approaches, or optimization strategies across different harnesses.
The human-in-the-loop emphasis is pragmatic and likely reflects real production experience where full automation remains challenging. However, this also suggests that even with cloud orchestration, these agents still require significant human oversight and intervention. The 80% completion scenario mentioned indicates that agents frequently produce incomplete work requiring developer cleanup, which is an honest acknowledgment of current limitations but also means the productivity multiplier may be lower than pure automation would suggest.
Security and access control were mentioned but not demonstrated in depth. Production deployments would need careful configuration of what code agents can access, what actions they can take, and how credentials and secrets are managed. Running agents in cloud environments introduces additional attack surface compared to local execution, and the presentation didn’t address how the platform mitigates these risks.
The cost model for cloud agent execution wasn’t discussed. Running multiple agents in cloud sandboxes with access to codebases and toolchains likely incurs non-trivial infrastructure costs beyond what teams pay for local execution. Understanding the economics of cloud agent orchestration versus local execution would be important for teams evaluating adoption.
While much of the demonstration focused on parallel execution of independent agents, the presentation positioned multi-agent orchestration as a key future direction. Zach explicitly noted that the next major development would be moving from one-off cloud agent runs to coordinating multiple agents that work together on complex tasks based on a concrete plan.
This represents a significant architectural evolution from the current state. True multi-agent coordination requires additional primitives for agent communication, task decomposition, work synchronization, and result aggregation. The programmable API foundation provides building blocks for this, but actual orchestration logic would need to handle complex scenarios like agents waiting for each other, sharing intermediate results, and adapting plans based on what other agents discover.
The skill-based automation system hints at how this might work, with version-controlled definitions specifying not just individual agent tasks but potentially multi-agent workflows. However, the presentation didn’t provide concrete examples of agents orchestrating each other or demonstrate how the platform would handle failures, conflicts, or coordination challenges in multi-agent scenarios.
A notable strength of Warp’s approach is the emphasis on developer experience and minimizing friction in the transition from local to cloud execution. The demonstration showed that launching a cloud agent uses nearly identical interfaces to local agents, reducing the learning curve. The automatic tracking means developers don’t need to explicitly configure monitoring or logging; it happens by default.
The integration with existing terminal workflows is strategic. Developers already working in Warp terminals can adopt cloud agent execution without switching tools or significantly changing their workflows. The web interface provides additional capabilities for team visibility and mobile access but isn’t required for basic usage.
However, this tight integration with Warp’s terminal also creates potential lock-in. Teams not already using Warp would need to adopt it to leverage Oz, which represents a significant tool change beyond just adopting cloud agent orchestration. The presentation didn’t address how teams using other terminals or IDEs would integrate with the platform, though the API-first design suggests programmatic access could work from any environment.
The presentation acknowledged that building custom agent orchestration infrastructure is a viable alternative to using Warp’s platform. Companies like Stripe and Ramp have reportedly built internal systems called Minions and similar tools. For organizations of sufficient size and with specific requirements, building custom infrastructure provides maximum flexibility and control.
Warp’s value proposition is essentially providing “Vercel for cloud agents” or “Supabase for cloud agents” - managed infrastructure that handles the undifferentiated heavy lifting so teams can focus on using agents rather than operating them. This is a common platform play in developer tools, and the validity depends on whether the abstraction level and feature set match what teams need.
GitHub’s cloud agents were mentioned in the Q&A as offering similar functionality. Zach acknowledged the similarity but positioned Warp’s differentiators as programmability and flexibility in bringing agents to your own infrastructure rather than being locked into rigid environments. However, without detailed feature comparisons, it’s difficult to assess how substantive these differences are versus marketing positioning.
The platform’s agent-agnostic design suggests Warp views itself as infrastructure rather than trying to own the entire agent stack. Supporting Cloud Code, Codex, and other agent harnesses in addition to Warp’s own agent positions Oz as a layer that works with the broader ecosystem rather than requiring lock-in to Warp’s agent technology.
The skill system using version-controlled configuration files suggests integration with standard development workflows and tools like Git. This allows teams to treat agent configurations as code, applying the same review, versioning, and deployment practices they use for application code.
The API and SDK approach enables integration with other tools in the development workflow. The Power Fixer demonstration showed embedding agent capabilities in custom applications, but the same primitives could enable integration with CI/CD systems, issue trackers, project management tools, and other parts of the development toolchain.
This case study represents an honest attempt to address real pain points that emerge as AI agents become central to software development workflows. The problems identified around laptop constraints, team visibility, persistent execution, and orchestration are legitimate challenges that production deployments of AI agents face. Warp’s approach of providing managed infrastructure with automatic tracking, programmability, and human-in-the-loop capabilities addresses many of these needs.
However, the presentation is fundamentally a product pitch, and teams should carefully evaluate whether they actually face the problems Oz solves before adopting additional platform complexity. The demonstrations showed genuine functionality but were carefully curated to highlight strengths. Production adoption would require deeper evaluation of cost, security, integration requirements, and whether the platform’s abstractions align with team needs.
The emphasis on future multi-agent orchestration capabilities is promising but also indicates the platform is still evolving. Teams adopting now would be betting on Warp’s ability to deliver on this roadmap while managing through the current state where many features are still emerging. The broader trend toward cloud-based agent execution and orchestration appears sound, but whether any particular platform becomes the standard remains to be seen as the market is still very early and rapidly evolving.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
OpenAI's Codex team developed a dedicated GUI application for AI-powered coding that serves as a command center for multi-agent systems, moving beyond traditional IDE and terminal interfaces. The team addressed the challenge of making AI coding agents accessible to broader audiences while maintaining professional-grade capabilities for software developers. By combining the GPT-5.3 Codex model with agent skills, automations, and a purpose-built interface, they created a production system that enables delegation-based development workflows where users supervise AI agents performing complex coding tasks. The result was over one million downloads in the first week, widespread internal adoption at OpenAI including by research teams, and a strategic shift positioning AI coding tools for mainstream use, culminating in a Super Bowl advertisement.