ZenML

AI-Powered Developer Productivity with Minions and Machine-to-Machine Payments

Stripe 2026
View original source

Stripe has deployed an internal AI agent system called "Minions" that autonomously handles software development tasks, landing approximately 1,300 pull requests per week with no human assistance beyond code review. Engineers can initiate development work from Slack by simply adding an emoji reaction, which provisions cloud-based development environments and uses AI agents built on the Goose harness to implement features, update documentation, and make code changes. The system leverages Stripe's existing developer productivity infrastructure including hosted development environments, comprehensive CI/CD pipelines, and internal tooling accessible through MCP servers. Additionally, Stripe is pioneering machine-to-machine payment capabilities that allow AI agents to act as economic actors, autonomously purchasing services from third-party APIs to complete tasks, demonstrated through an agent that planned a birthday party by paying for browser automation, venue search, and mail services.

Industry

Tech

Technologies

Overview and Context

Stripe has implemented a comprehensive AI-driven developer productivity system that represents one of the most sophisticated examples of LLMs in production for software engineering. The system, internally branded as “Minions,” demonstrates how large-scale engineering organizations can leverage AI agents not just for code assistance but for fully autonomous software development workflows. Steve Khalifski, a software engineer at Stripe with over six years at the company, presents this case study showing both the internal developer tooling innovations and forward-looking machine-to-machine payment protocols.

The fundamental innovation is that Stripe has moved beyond AI-assisted coding to AI-autonomous coding at scale. The company reports landing approximately 1,300 pull requests per week that have no human assistance beyond code review. This represents a significant shift in how software development work flows through the organization, with AI agents taking on complete implementation tasks from initial prompt to submitted pull request.

Architecture and Infrastructure Foundation

The Minions system is built on top of Stripe’s existing developer productivity infrastructure, which predates the AI initiative by many years. This foundation proves critical to the success of the AI implementation. Stripe maintains hosted cloud-based development environments that any engineer or agent can provision on demand. These environments come fully configured with the entire Stripe codebase, all necessary services running, proper database configurations, git setup, and VS Code server access. This infrastructure investment, originally made for human developer productivity, becomes even more valuable in the AI era.

The cloud-based nature of these environments is essential for the multi-threaded agentic approach Stripe employs. Rather than being limited to what can run on a local laptop, engineers can spin up multiple isolated development environments simultaneously, each running its own AI agent working on different tasks in parallel. This architecture enables the scale of 1,300 autonomous PRs per week. An engineer can trigger multiple Minions from Slack with simple emoji reactions, and each will work in its own isolated environment without interference.

The development environments are production-grade, not simplified sandbox versions. They include full CI/CD integration, comprehensive test suites, synthetic end-to-end testing capabilities, and blue-green deployment infrastructure for safe rollouts and rollbacks. This robust testing and deployment infrastructure provides the confidence needed to trust AI-generated code changes at scale.

Agent Harness and Tooling

Stripe’s Minions are built using Goose as the base agent harness. Goose is an open-source agentic framework developed by Block that Stripe has forked and customized for their specific needs. The choice to use and extend an open-source framework rather than building entirely from scratch or using commercial solutions reflects Stripe’s approach of leveraging community tools while customizing for their specific developer environment requirements.

The agents have access to an extensive suite of internal tooling through MCP (Model Context Protocol) servers. These tools include code search capabilities for navigating the large Stripe codebase, access to internal documentation, integration with Stripe’s ticket systems, connection to test data, and interaction with CI/CD pipelines. The system prompt for agents is notably simple, essentially just “implement this task completely, no mistakes,” with the complexity residing in the tool availability and the harness orchestration rather than in elaborate prompting strategies.

Stripe has invested significantly in creating high-quality MCP servers for most internal tools that could benefit agents. This allows agents to interact programmatically with the same resources human developers use, from searching documentation to running tests to querying internal APIs. The quality and comprehensiveness of these tools directly impacts agent success rates in one-shot task completion.

Workflow and User Experience

The typical Minion workflow begins in Slack, where Stripe’s entire company already collaborates. An engineer encounters a task whether in a Google doc during feature planning, in a Jira ticket, or in a Slack conversation and can initiate development work by simply adding an emoji reaction. This low-friction activation is key to the system’s adoption. The prompt itself can be informal and conversational, not requiring carefully engineered prompt structure.

Once triggered, the system provisions a cloud development environment, which takes a few seconds. The environment boots up with all necessary configuration, checks out the relevant repository with a new branch created specifically for this task, and launches the Goose agent harness with the user’s prompt. The agent then begins its work, searching through code, reading documentation, making file modifications, running tests, and iterating until it believes the task is complete.

The entire process is observable. Engineers can click a “follow along” link to watch the agent work in real-time, seeing which tools it invokes, what files it modifies, how it handles test failures, and how it tracks its own progress through internal to-do lists. This transparency builds trust and allows engineers to understand agent reasoning when reviewing the final output.

When the agent completes its work, it commits the changes and creates a pull request, which enters the normal human code review process. The engineer receives a notification in Slack with a link to the PR. Critically, the agent doesn’t merge code autonomously; human review remains a required gate before code reaches production.

Code Review at Scale

The volume of 1,300 AI-generated PRs per week raises an important question about code review capacity. Stripe addresses this through multiple strategies. First, as engineers spend less time writing code themselves, they have more capacity for reviewing code. Second, and more importantly, the comprehensive CI/CD infrastructure provides automated confidence in code quality. Extensive test coverage, synthetic end-to-end tests, and integration tests catch issues that might otherwise require deep manual inspection.

The philosophy is that whether code is written by a human or an AI agent, the same quality gates apply. Strong CI provides confidence that code works correctly regardless of authorship. The test suite becomes even more critical in an AI-generated code environment. Additionally, Stripe’s blue-green deployment capabilities mean that even if an issue makes it through review, it can be quickly rolled back without customer impact.

There’s an acknowledgment that as coding becomes less of a bottleneck, attention and effort will shift to other areas. If coding effort approaches zero, code review could become the new bottleneck, or perhaps gathering sufficient high-quality ideas for what to build, or ensuring proper distribution and rollout of changes. The system doesn’t eliminate bottlenecks but shifts them around in the product development lifecycle.

Developer Experience Philosophy

A key insight shared in the case study is the virtuous cycle between developer experience and agent experience. Investments made for human developer productivity directly benefit AI agents, and vice versa. Good documentation written for engineers is equally valuable for agents trying to understand how to implement a feature. Well-structured internal APIs that are easy for humans to use are also easy for agents to call through MCP servers. Clear coding patterns and conventions that help human onboarding also help agent success rates.

This creates a compelling business case for continued investment in developer experience tooling. Organizations can justify DX investments not just for human productivity but also for enabling AI agent capabilities. Conversely, infrastructure built to support agents such as comprehensive MCP servers and improved observability into automated workflows also benefits human developers.

Stripe’s long-standing developer productivity team, which existed well before AI agents became feasible, provided the foundation that made Minions possible. The team maintains all tooling from git workflows to text editor configurations to development environments to CI/CD pipelines. This centralized focus on developer experience enabled rapid AI agent adoption because the infrastructure was already in place.

Adoption Beyond Engineering

While Minions originated in the engineering organization, the Slack-based interface enables adoption by non-engineers. Product managers and designers who can articulate what they want in plain text can trigger Minions to create prototypes, update documentation, or make simple changes without needing to open a text editor or understand the technical implementation details.

The low activation energy is crucial here. Even if a non-engineer theoretically had access to development tools and documentation, the intimidation factor and learning curve would prevent usage. But writing a plain text description of a desired outcome in Slack is something anyone can do. The system handles all the technical complexity of translating that description into working code.

This democratization of development capability allows good ideas to surface and be prototyped regardless of where in the organization they originate. It reduces the friction between idea and implementation that often exists in large organizations, where ideas from non-technical teams might get deprioritized or lost in translation when handed off to engineering.

Machine-to-Machine Payment Protocol

Beyond internal developer productivity, Stripe is exploring a paradigm shift in how AI agents interact with commercial services through their Machine Payment Protocol, co-designed with Tempo Labs. The core insight is that agents should be able to act as economic actors, spending money to purchase services needed to complete their tasks.

The demonstration involves an agent planning a birthday party using Claude in a local development environment. The prompt provides a person’s website and asks the agent to research their interests, find an appropriate venue, send physical invitations, and offset the carbon cost of the token usage. The agent proceeds to autonomously purchase access to several third-party services: Browser Base for a web scraping session to research the person’s website, Parallel AI for venue search capabilities, Postal Form to send physical mail, and Stripe Climate for carbon offset credits.

The key innovation is that these are real micro-transactions happening programmatically without human intervention. The agent doesn’t require pre-existing accounts with these services, doesn’t need stored credentials, and doesn’t involve a human logging in and entering payment information. Instead, it uses the Machine Payment Protocol to make ephemeral, purpose-specific purchases paying only for exactly what it needs for this specific task.

Each service integration represents a real API call and real payment, with costs measured in fractions of cents to a few dollars. The agent generates a receipt showing each service used, the cost, and the purpose. In this example, the entire birthday party planning task cost approximately $5.47, including the LLM token costs alongside the third-party service costs.

Economics of AI Agent Operations

The case study makes explicit something often left implicit: AI agent operations have real economic costs, even when agents aren’t making external purchases. Every prompt consumes tokens, which have monetary costs borne either directly or through subscription plans with usage limits. By surfacing these costs alongside the costs of external services, Stripe highlights the economic reality of agent operations.

There’s a natural convergence between token costs and service costs. Whether an agent is spending hundreds of thousands of tokens to generate text or spending dollars to call a third-party API, both represent economic resources consumed to achieve the desired outcome. Making these costs visible in a unified agent receipt provides clarity about the true cost of agent-driven workflows.

This economic transparency becomes important as agents become more capable and handle more complex tasks. Understanding whether a task cost $1 or $100 in combined tokens and services helps inform decisions about when to use agentic approaches versus traditional development. It also enables organizations to set budgets and guard rails for agent operations.

The ephemeral, pay-per-use model for third-party services could enable a new generation of businesses that primarily serve AI agents rather than humans. Instead of needing to build landing pages, dashboards, admin panels, and subscription management systems, a business could focus on providing a single high-quality API with micro-transaction support and market primarily to agents.

Technical Implementation Details and Observations

The implementation involves several technical considerations. The agent harness runs locally or in a cloud environment and has access to code generation capabilities for tasks like creating PDFs or writing Playwright scripts for browser automation. Some tasks are handled by the LLM itself using its native capabilities, while others are delegated to specialized third-party services based on which approach is more appropriate.

The system maintains a balance between what the agent can do with its own tools versus what requires external services. For example, generating a PDF invitation uses local code generation, while actually sending that PDF through postal mail requires paying Postal Form. The agent makes these decisions based on tool availability and task requirements.

For code implementation tasks, agents search through codebases, reference existing implementations, read documentation, and generate code following established patterns. When tests fail, agents iterate on their implementations, reading error messages and adjusting code. This iterative debugging process happens autonomously, with the agent troubleshooting and fixing issues without human intervention.

Prompt engineering strategies that prove effective include asking agents to explain or justify their reasoning when they struggle, providing directional breadcrumbs by starting the work and having the agent examine the diff to understand the intended direction, and maintaining prompts or skills that worked well for reuse in similar future situations. The approach resembles teaching, gently guiding agents toward correct solutions rather than prescriptive instructions.

Integration with Existing Development Practices

Minions integrate into existing development workflows rather than replacing them. Code review remains a human responsibility, with reviewers examining AI-generated code the same way they would human-written code. The same CI/CD pipelines run, the same test suites execute, and the same deployment processes apply. The difference is in the authoring phase, not the validation and deployment phases.

Engineers can still manually edit code generated by Minions. The cloud development environments support direct SSH access, VS Code web access, and local IDE connections. An engineer might let a Minion get 80% of the way on a complex task, then jump in to handle the nuanced final 20% that requires deeper domain knowledge or judgment calls.

The system also supports hybrid workflows where multiple approaches work together. An engineer might use Claude or Cursor for some development work, trigger a Minion for parallel tasks, and manually code other components, with all work eventually converging in the same codebase through the standard git workflow.

Organizational and Cultural Implications

The case study reveals several organizational insights. First, having a dedicated developer productivity team that treats internal engineers as customers is essential for building the infrastructure that enables successful AI agent deployment. This team existed long before AI agents and provided the foundation.

Second, activation energy and friction matter enormously for adoption. The emoji-in-Slack interaction model succeeds because it requires minimal effort and no context switching. More complex activation mechanisms would see less usage regardless of agent capabilities.

Third, transparency and observability build trust. Being able to watch agents work, see their reasoning, and understand their tool usage helps engineers trust the output and feel comfortable handing off tasks. Black-box agent systems would face more resistance.

Fourth, the democratization of technical capability through natural language interfaces expands who can contribute to product development. Product managers, designers, and other roles can now initiate technical work directly rather than always routing through engineering queues.

Production Readiness Considerations

Several factors contribute to Stripe’s success in running AI agents at production scale. Comprehensive test coverage catches issues that might slip through in generated code. Blue-green deployments and rollback capabilities provide safety nets. Isolated cloud environments prevent agents from interfering with each other or with human development work.

The quality of internal documentation and tooling directly impacts agent success rates. Well-documented APIs, clear coding patterns, and comprehensive internal knowledge bases all improve the agent’s ability to complete tasks correctly on the first attempt. This creates a feedback loop where improving documentation for agents also improves it for humans.

The system doesn’t attempt to be fully autonomous. Human review remains a required gate, acknowledging that AI agents, like human developers, can make mistakes. The goal is to automate the routine implementation work while keeping human judgment in the loop for validation and decision-making.

Future Directions and Implications

The case study points toward several future directions. As more businesses adopt machine-to-machine payment protocols, agents will be able to compose complex workflows using multiple specialized services, paying micro-amounts for each capability rather than requiring pre-integrated monolithic platforms.

The shift from subscription-based to usage-based pricing models for AI services aligns with this agentic future. Rather than paying monthly fees for tools that agents might use occasionally, organizations pay only for what gets consumed. Stripe’s work on token-level billing for LLM usage exemplifies this trend.

The convergence of agent-as-developer and agent-as-economic-actor creates new possibilities. An agent could not only write code but also purchase the cloud resources to run it, subscribe to the APIs it needs, and pay for monitoring and observability services, all autonomously based on requirements inferred from the initial prompt.

As coding becomes less of a bottleneck through AI automation, organizational focus shifts to other areas such as generating high-quality ideas, making sound prioritization decisions, ensuring proper product validation, and maintaining effective communication. The bottleneck moves but doesn’t disappear.

More Like This

Enterprise Agentic AI for Customer Support and Sales Using Amazon Bedrock AgentCore

Swisscom 2025

Swisscom, Switzerland's leading telecommunications provider, implemented Amazon Bedrock AgentCore to build and scale enterprise AI agents for customer support and sales operations across their organization. The company faced challenges in orchestrating AI agents across different departments while maintaining Switzerland's strict data protection compliance, managing secure cross-departmental authentication, and preventing redundant efforts. By leveraging Amazon Bedrock AgentCore's Runtime, Identity, and Memory services along with the Strands Agents framework, Swisscom deployed two B2C use cases—personalized sales pitches and automated technical support—achieving stakeholder demos within 3-4 weeks, handling thousands of monthly requests with low latency, and establishing a scalable foundation that enables secure agent-to-agent communication while maintaining regulatory compliance.

customer_support chatbot poc +35

Building and Optimizing AI Programming Agents with MLOps Infrastructure at Scale

Weights & Biases 2025

This case study describes Weights & Biases' development of programming agents that achieved top performance on the SWEBench benchmark, demonstrating how MLOps infrastructure can systematically improve AI agent performance through experimental workflows. The presenter built "Tiny Agent," a command-line programming agent, then optimized it through hundreds of experiments using OpenAI's O1 reasoning model to achieve the #1 position on SWEBench leaderboard. The approach emphasizes systematic experimentation with proper tracking, evaluation frameworks, and infrastructure scaling, while introducing tools like Weave for experiment management and WB Launch for distributed computing. The work also explores reinforcement learning for agent improvement and introduces the concept of "researcher agents" that can autonomously improve AI systems.

code_generation poc prompt_engineering +31

Building and Scaling Codex: OpenAI's Production Coding Agent

OpenAI 2025

OpenAI developed Codex, a coding agent that serves as an AI-powered software engineering teammate, addressing the challenge of accelerating software development workflows. The solution combines a specialized coding model (GPT-5.1 Codex Max), a custom API layer with features like context compaction, and an integrated harness that works through IDE extensions and CLI tools using sandboxed execution environments. Since launching and iterating based on user feedback in August, Codex has grown 20x, now serves many trillions of tokens per week, has become the most-served coding model both in first-party use and via API, and has enabled dramatic productivity gains including shipping the Sora Android app (which became the #1 app in the app store) in just 28 days with 2-3 engineers, demonstrating significant acceleration in production software development at scale.

code_generation chatbot poc +32