GitHub Next presents Ace, a research prototype addressing the critical alignment bottleneck in agentic software development. The problem identified is that existing coding agents are single-player tools that accelerate individual implementation without supporting team coordination, leading to wasted work, coordination debt, and misaligned outputs. Ace combines real-time multiplayer chat, cloud-based microVMs, shared agent access, and integrated development tools into a unified workspace where teams can align on plans, collaborate with AI agents, and maintain shared context throughout the development lifecycle. Early results demonstrate that teams can prompt agents collaboratively, share live development environments instantly, and maintain alignment through continuous planning-implementation cycles rather than delayed PR reviews.
GitHub Next’s Ace prototype represents an important exploration in LLMOps focused on the organizational and collaborative challenges of deploying AI coding agents at scale within development teams. Presented in early 2026 by Maggie Appleton, a staff research engineer at GitHub Next, this case study tackles a fundamental problem that has emerged as AI agents have matured: the misalignment between single-player agentic tools and the inherently collaborative nature of software development.
The core thesis is that the current generation of coding agents (circa early 2026) are designed as isolated, individual experiences—what Appleton calls the “one man, two dozen claudes” model. While these tools excel at accelerating individual implementation, they fail to address the coordination, alignment, and context-sharing requirements that determine whether teams build the right things. This represents a critical LLMOps challenge: as implementation becomes faster and cheaper through AI agents, the bottleneck shifts from “how to build it” to “should we build it” and ensuring team alignment on priorities, approach, and context.
The case study articulates several interconnected problems that emerge when powerful coding agents are deployed without adequate collaborative infrastructure:
Misaligned Primitives and Infrastructure: Traditional development tools like GitHub (PRs and issues), Slack, Jira, and Linear were designed for a different era of software development with slower implementation cycles. These tools cannot handle the speed, volume, and shape of agent-generated work. Pull requests, in particular, become overburdened as the sole checkpoint for alignment, occurring too late in the development cycle when code has already been written.
Collapsed Implementation Windows: The time between logging an issue and an agent opening a PR has shrunk to minutes. This acceleration has eliminated many natural touchpoints for team alignment that previously occurred during planning, building, and review phases. Teams often skip proper planning because implementation appears so cheap, leading to agents executing on poorly-considered plans.
Hidden Planning and Context: Most coding agents in 2026 feature local plan modes that remain unshared with teammates. This means developers can dispatch agents to implement features without team discussion of the approach, losing critical alignment checkpoints. Furthermore, the context needed for good decision-making—business constraints, political dynamics, product vision, user research, organizational history—exists in people’s heads rather than in codebases, making it inaccessible to agents working in isolation.
Coordination Debt: The lack of shared visibility into agent work creates several problems: merge conflicts from multiple agents touching the same files, duplicated work when multiple team members assign agents to the same feature, and massive PR backlogs with insufficient context for reviewers. Wasted work results from features that don’t solve real problems or that receive critical feedback only after completion.
Speed-Context Mismatch: Teams now ship five features per day instead of one, making it extremely difficult to maintain awareness of what coworkers are doing. The volume and velocity of agent-generated output overwhelms traditional coordination mechanisms.
Ace is positioned as a research prototype exploring how to enable collaborative AI engineering. It’s explicitly not a production product as of the talk (early 2026) but is entering technical preview with thousands of users for iterative learning. The architecture combines several key components:
Session-Based MicroVM Infrastructure: Each work session in Ace is backed by a sandboxed cloud computer (microVM) running on its own git branch. This architecture provides several LLMOps advantages. First, it enables instant context switching between parallel workstreams without local git management overhead. Second, it allows teammates to join sessions instantly without pulling branches or managing local worktrees—they simply join and see the current state. Third, work persists independently of individual laptops; sessions don’t die when a developer closes their computer, enabling true asynchronous collaboration and even mobile access (planned for future development).
Multiplayer Agent Prompting: Unlike traditional single-user coding agents, Ace allows multiple team members to prompt the same agent within a session. The agent has access to the full conversation history, treating team chat as input to its prompts. This enables several workflows: designers and PMs without deep coding skills can participate in agent-driven development alongside engineers; teams can iteratively refine agent outputs through collaborative prompting; and the full reasoning and decision context becomes visible to all participants rather than hidden in one developer’s local terminal.
Integrated Development Environment: Ace integrates multiple development primitives into a unified interface. It combines Slack-like chat for communication, shared terminal access for all session participants, live preview capabilities viewable by the entire team, real-time multiplayer code editing (when manual intervention is needed), and direct PR creation with bidirectional links between Ace sessions and GitHub. This integration means context isn’t fragmented across tools—planning discussions, agent prompting, implementation, and review can all occur in the same environment with full shared visibility.
Model Selection and Multi-Model Support: Users can select which model to use for agent tasks (demonstrated with Claude Opus 4.6 in the talk), indicating that Ace supports multiple LLM backends. This flexibility is important for LLMOps at scale, allowing teams to match models to task requirements and manage costs.
Automatic Commit Generation: Agents automatically create commits with readable commit messages as they make changes, maintaining development hygiene without manual overhead. This also provides granular visibility into what changes agents have made and enables easy rollback if needed.
Context Summarization and Awareness: Ace maintains continuously updated summary blocks showing the latest changes in each session, whether from humans or agents. This addresses the speed-context mismatch problem by providing at-a-glance orientation when switching between sessions or returning to work. The dashboard provides proactive AI-driven summaries of recent work, unfinished tasks from previous sessions (e.g., reminding developers what they left incomplete before the weekend), and team pulse summaries of coworker activity. This represents an interesting LLMOps pattern: using agents not just for code generation but for synthesizing and surfacing relevant organizational context.
Several important LLMOps patterns emerge from this case study:
Collaborative Planning-Implementation Cycles: Ace embodies a shift from sequential planning-then-building phases to continuous interleaved cycles. The collaborative plan editing feature allows teams to iterate on agent plans together before execution, with all team members seeing cursors and edits in real-time. This addresses a key LLMOps challenge: ensuring AI systems receive good specifications before acting, which is much more efficient than correcting bad outputs after the fact.
Shared Context Management: By making all conversations, prompts, and implementation visible to the team and accessible to agents, Ace creates what Appleton calls a “social information fabric.” This has several implications for LLMOps. Agents can potentially understand team dynamics and organizational context that would be invisible to traditional code-only systems. The system could eventually notify developers of relevant decisions or pull them into conversations about features they originally built. Context is preserved across sessions rather than lost when individual developers finish work.
Backwards Compatibility: Ace doesn’t completely replace existing workflows—PRs created in Ace can be opened in GitHub with links back to the originating session. Code can still be edited in VS Code. This pragmatic approach recognizes that LLMOps solutions need to integrate with existing toolchains rather than require wholesale replacement.
Quality Over Velocity: The underlying philosophy is that faster implementation should enable higher-quality work through more rigorous thinking, not just more output. This represents a maturing perspective on LLMOps: the goal isn’t maximum code generation but using AI to free up human time for the irreducibly difficult work of deciding what to build and ensuring it’s done well.
While Ace addresses real problems, several considerations warrant balanced evaluation:
Unproven Claims: As a research prototype just entering technical preview, most benefits are theoretical rather than empirically demonstrated. The talk doesn’t provide metrics on actual improvements in alignment, reduction in wasted work, or quality outcomes. Claims about how much time agents save and how that translates to better decision-making remain to be validated.
Complexity and Adoption: Ace requires teams to adopt an entirely new development environment combining chat, terminals, code editing, and agent prompting. The learning curve and change management challenges could be significant. It’s unclear whether teams will embrace this consolidation or prefer specialized tools integrated through APIs.
Noise and Information Overload: As one commenter noted, managing the “flood of information and wall-o-text” from agents in a chat-like interface could become overwhelming, similar to Slack channels with excessive CI/CD notifications. The talk acknowledges agents are “shit at CSS” for design work, suggesting significant limitations remain in agent capabilities that may require frequent manual intervention.
Context Accessibility Limitations: While Ace makes conversation context available to agents, many critical alignment factors—business constraints, political dynamics, product vision—remain primarily in human heads. It’s unclear how effectively these can be surfaced and incorporated into agent prompting without adding process overhead, which the talk explicitly wants to avoid.
Scaling Questions: The demo shows small team collaboration, but it’s uncertain how the approach scales to larger organizations with many parallel work streams, complex approval processes, and diverse stakeholder groups. The microVM approach also raises questions about infrastructure costs and management at scale.
Mobile and Accessibility: While mobile access is mentioned as a future feature, the current implementation appears desktop-focused. True accessibility for all team members (especially non-technical stakeholders) remains to be proven.
This case study illuminates several important considerations for organizations deploying LLMs in production development workflows:
The shift from implementation bottlenecks to alignment bottlenecks is real and significant. Organizations need to think carefully about how team coordination mechanisms will adapt as AI agents accelerate individual coding. Simply deploying coding agents without collaborative infrastructure risks creating coordination chaos.
The PR as the sole review mechanism is increasingly inadequate for high-velocity agent-generated code. Teams need earlier checkpoints and continuous alignment mechanisms rather than relying on post-implementation review to catch problems.
Shared context and visibility are critical for effective multi-agent and human-agent collaboration. Isolated terminal instances where developers work with agents in private before sharing results create information silos that prevent effective team coordination.
The value of agents extends beyond code generation to context synthesis, summarization, and proactive awareness-building. Using LLMs to help teams navigate the increased complexity and velocity of agent-driven development may be as important as using them for implementation.
Infrastructure choices matter significantly for collaborative AI engineering. The microVM approach enables sharing and persistence patterns that wouldn’t be possible with purely local agent execution. Organizations should consider whether cloud-based, session-oriented architectures better support their collaborative needs than laptop-local agents.
Quality differentiation becomes more important as implementation becomes commoditized. The strategic question shifts from “can we build it fast enough” to “are we building the right things and doing them well.” LLMOps infrastructure should support this shift by enabling better planning and decision-making, not just faster coding.
Ace represents an important exploration in collaborative LLMOps, addressing real pain points that have emerged as coding agents have matured. The fundamental insight—that single-player agents create alignment problems in team environments—is sound and important. The architectural approach of combining multiplayer chat, cloud microVMs, shared agent access, and integrated development tools offers a coherent vision for how teams might work effectively with AI agents.
However, as an early research prototype, the actual impact on software quality, team alignment, and development effectiveness remains to be demonstrated through empirical evaluation in the technical preview phase. The success of this approach will depend on whether the benefits of integration and shared context outweigh the complexity of adopting a new unified environment, and whether the collaborative features genuinely improve alignment or simply add overhead. Organizations watching this space should pay attention to how the technical preview evolves and what data emerges about actual usage patterns and outcomes.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This case study explores how Temporal provides durable execution infrastructure for building reliable, long-running AI agents in production environments. The problem addressed is that traditional approaches to building production systems—whether through manual retry logic, event-driven architectures, or checkpoint-based solutions—require significant engineering effort to handle failures common in cloud environments and agentic workflows. Temporal solves this through a deterministic execution model that separates business logic from reliability concerns, allowing developers to write regular code in their preferred language while automatically handling crashes, retries, and state management. The solution has been adopted by companies like OpenAI (Codex on the web), Replit, and Lovable, with integrations across major AI frameworks including OpenAI Agents SDK, Pydantic AI, Vercel AI SDK, BrainTrust, and LangFuse, enabling developers to build production-grade agentic systems with significantly reduced complexity.