Paperclip is an open-source agent orchestration platform designed to manage AI agents in production environments for business automation. The platform addresses the challenge of coordinating multiple AI agents across different organizational functions by providing a centralized control plane with organizational hierarchies, task management, quality assurance workflows, and vendor-neutral agent integration. The creator demonstrates using Paperclip to manage its own development, including creating marketing videos through agent collaboration, managing code reviews, and coordinating work across engineering and marketing teams. The platform achieved rapid adoption with 50,000 GitHub stars within approximately two months of release, though it remains in early stages with planned features for multi-user support, cloud deployment, and improved organizational learning.
Paperclip represents an open-source approach to production-scale multi-agent orchestration, positioning itself as a “human control plane for AI labor.” The platform was released on March 4th, 2026, and within approximately 34 days had gained significant traction in the open-source community. The creator uses Paperclip to manage Paperclip’s own development, providing a self-referential case study of AI agents building and maintaining the very platform that orchestrates them. This meta-level application offers insights into both the capabilities and limitations of current agent orchestration approaches.
The fundamental problem Paperclip addresses is the chaos that emerges when managing multiple AI agents across different tasks and organizational functions. Users often find themselves with numerous agent sessions open simultaneously, losing track of context, unable to enforce quality standards consistently, and struggling to maintain coherent workflows when agents have different personalities and capabilities across different model providers. The platform’s tagline about “zero-human companies” should be viewed critically as aspirational marketing rather than current reality, as the demonstration clearly shows significant human involvement in task definition, quality control, and organizational design.
Paperclip’s architecture centers on a vendor-neutral approach to agent integration, which is one of its most significant LLMOps contributions. The platform supports multiple model providers including Claude, GPT models through Codex, Gemini, Pi, Hermes, and others, with integration through OpenRouter for accessing additional models at various price points. This “bring your own agent” philosophy acknowledges a critical reality in production LLM systems: no single model excels at all tasks, and model availability, pricing, and capabilities constantly evolve.
The platform implements agent integration through a standardized interface that allows different model types to communicate within a shared organizational structure. Each agent maintains its own memory and context within the Paperclip system, enabling cross-agent collaboration without requiring manual context transfer between different model providers. This addresses a real pain point in production scenarios where teams might prefer Claude for certain reasoning tasks, GPT models for others, and open-source alternatives for cost-sensitive operations.
The use of OpenRouter integration demonstrates practical cost management in production LLM deployments. The example of Quinn 3.6 plus being available at no cost for certain usage levels illustrates how a vendor-neutral orchestration layer enables dynamic model selection based on task requirements and budget constraints. However, the creator acknowledges that cheaper models often lack the intelligence required for complex tasks, suggesting a tiered approach where frontier models handle critical reasoning while less expensive models handle routine operations.
Paperclip implements organizational hierarchies as a core abstraction for agent coordination. The system models traditional business structures with CEOs, CTOs, CMOs, and individual contributors, each represented by AI agents with specific roles and capabilities. This organizational metaphor serves multiple purposes: it provides familiar mental models for task delegation, creates clear chains of responsibility, and enables hierarchical task decomposition.
The demonstrated workflow shows the CEO agent receiving high-level instructions, breaking them down into subtasks, hiring specialized agents as needed, and delegating work through the organizational hierarchy. For example, when asked to create a celebratory video for reaching 40,000 GitHub stars, the CEO agent autonomously hired a video writer agent, installed the necessary Remotion skills, and coordinated the work through the marketing organization. This represents genuine autonomous task decomposition, though the degree of human oversight required suggests this is augmentation rather than full automation.
The platform includes built-in support for organizational learning through agent configuration and skill management. Each agent can be customized with specific instructions that encode organizational preferences and best practices. The creator demonstrates extensive customization of their Codex coder agent with instructions like avoiding blocking issues without suggesting solutions and limiting test suite generation. This instruction layer represents a form of prompt engineering at the organizational level, where accumulated knowledge about agent behavior gets encoded into persistent configuration.
A significant LLMOps contribution of Paperclip is its structured approach to quality assurance in agent workflows. The platform implements reviewer and approver roles that can be assigned to tasks, creating multi-stage validation processes. This addresses a common failure mode in agent-based systems where agents claim to complete tasks without actually verifying their work, particularly in testing scenarios.
The QA agent with browser skill capabilities demonstrates one approach to automated verification, where specialized agents test work products before they’re marked complete. The distinction between reviewers who iterate with the primary agent and approvers who make final acceptance decisions creates a workflow that mirrors human organizational structures while attempting to maintain quality standards without constant human intervention.
However, the demonstration reveals that these quality mechanisms still require significant human oversight. The creator repeatedly reviews plans, provides feedback on agent outputs, and makes approval decisions. This suggests that while Paperclip provides structure for quality assurance, achieving reliable autonomous quality control remains an unsolved challenge. The platform’s value appears to lie more in making quality control processes explicit and trackable rather than eliminating the need for human judgment.
Paperclip implements a skills system that represents a practical approach to capability management in multi-agent systems. Skills are installable packages that provide agents with specific capabilities, similar to plugins or tools in other agent frameworks. The demonstration shows skills for Remotion video creation, Greptile code review integration, and agent browser automation. There’s also reference to skills.sh as a community resource for discovering and sharing skills.
The skills manager built into Paperclip allows agents to autonomously install skills they need for their tasks. In the video creation example, the CEO agent knew to install the Remotion best practices skill for the newly hired video writer. This represents a form of meta-learning where agents understand their own capability gaps and can acquire the tools needed to address them.
However, the platform also reveals tension between skills and routines, with some overlap in functionality. Routines provide templated workflows that can run on schedules or manually with template variables, while skills provide reusable capabilities. The creator acknowledges this overlap, suggesting that the system’s abstractions are still evolving. In production scenarios, this kind of conceptual overlap can create confusion about where to encode different types of organizational knowledge.
A core claim of Paperclip is that it solves context management problems that plague standalone agent interactions. In the video creation example, the agent had access to brand guidelines, statistics dashboards, and project-specific information without explicit prompting. This context came from the Paperclip system’s shared knowledge base and organizational memory.
The platform maintains memory across agent interactions, allowing agents to reference previous work, access organizational resources, and maintain consistency across tasks. This addresses the common problem of having to repeatedly provide the same context to different agent sessions. For example, the video writer agent could access the existing dashboard for statistics and the branding guide for visual consistency without these being explicitly attached to the task description.
However, the demonstration doesn’t deeply explore how this memory system works technically. There’s no discussion of vector databases, retrieval mechanisms, or how the system determines what context to provide to which agents. The knowledge base feature is mentioned as something being actively developed, suggesting that context management remains an area of active work. In production LLM systems, context management is often the difference between successful and failed agent interactions, so the lack of detail here represents a gap in understanding the platform’s actual capabilities.
Paperclip provides first-class support for plans, which represent an important pattern in production agent systems. Rather than immediately executing on instructions, agents can first create plans that humans review and refine. In the video creation workflow, the video writer agent produced a detailed plan that the creator critiqued, specifically requesting 2-second cuts instead of 6-second cuts and adjustments to the animation approach.
This plan-review-iterate cycle represents a more reliable approach to complex tasks than single-shot agent execution. It provides checkpoints where humans can inject preferences and course-correct before significant resources are spent on the wrong approach. The demonstration shows multiple rounds of iteration on the video, with feedback being incorporated into subsequent attempts.
The creator notes that currently this organizational learning happens manually through agent instruction updates, but future versions aim to automate this learning process. The idea is that when feedback is consistently provided about certain agent behaviors, the system should automatically update agent configurations or skill definitions to incorporate these preferences. This represents an interesting vision for self-improving agent systems, though it raises questions about how to prevent degradation or drift in agent behavior over time.
While the creator emphasizes that Paperclip is not specifically a coding tool, much of the demonstration focuses on software development workflows. The platform integrates with development infrastructure including GitHub for pull requests, branch management, and code reviews. The experimental workspace support allows isolated development environments for managing different features or work streams.
The Greptile integration demonstrates how third-party tools can be incorporated into Paperclip workflows through the skills system. Community contributions receive automated first-pass code reviews through this integration, representing a practical application of agent-based quality control in open-source development. However, the creator’s emphasis on using other tools like Cursor or GitHub Copilot for actual coding suggests that Paperclip’s value is more in coordination than in the coding itself.
The routine for creating Discord messages about merged changes and writing release changelogs represents the kind of documentation and communication automation that provides clear value in production software development. These are tasks that require some intelligence and context but don’t require human creativity or judgment, making them good candidates for agent automation.
Paperclip includes built-in cost tracking and budgeting features that reflect the practical realities of running multiple agents in production. The platform tracks monthly spending across agents and projects, allowing organizations to monitor and control inference costs. Individual agents and projects can have budget limits set, providing guardrails against runaway costs from autonomous agent activity.
The demonstration shows zero monthly spending due to using API subscriptions rather than pay-per-token pricing, but the creator notes that at scale this approach may not be sustainable. This highlights an important consideration for production agent systems: subscription pricing designed for individual human users doesn’t necessarily translate to agent usage patterns that might consume significantly more tokens.
The ability to assign different model tiers to different agents based on task complexity represents a practical cost optimization strategy. Using expensive frontier models only for tasks that truly require their capabilities while routing routine work to cheaper alternatives is a key pattern in production LLM deployments. However, actually implementing this effectively requires understanding which tasks need which capabilities, which itself requires experimentation and organizational learning.
The creator is refreshingly transparent about Paperclip’s current limitations. The platform launched only about five weeks before this demonstration and is described as “very, very early” with “missing pieces.” Critical features like multi-user support are acknowledged as “huge gaps” actively being worked on. The inability to have multiple human users working in the same Paperclip instance is a significant limitation for actual business deployment.
Sandboxing and cloud deployment capabilities are also in development, which are essential for production security and scalability. Running agents locally without proper isolation creates security and reliability risks, particularly when agents have access to organizational resources and can execute code or API calls. The planned integration with E2B, dev.exe, and cloud agent deployments suggests awareness of these requirements, but their absence in the current version limits production readiness.
The “maximizer mode” concept mentioned in the roadmap represents an interesting direction but also highlights a current limitation: agents apparently need to frequently stop and seek approval rather than working continuously toward goals. While the stop-and-check approach is probably more reliable given current agent capabilities, it contradicts the “zero-human company” marketing and represents a significant operational overhead.
The manual nature of organizational learning, where humans must update agent instructions based on observed behavior patterns, represents another gap between current capabilities and the vision. The creator mentions a “skill consultant” meta-agent that helps other agents use their skills better, which is an interesting approach to organizational learning, but it’s unclear how effective this is compared to simply having humans refine the prompts.
From an LLMOps perspective, Paperclip represents an interesting approach to the orchestration layer problem but with significant caveats about production readiness. The platform provides valuable abstractions for organizing agent work, tracking costs, and implementing quality controls. The vendor-neutral approach and organizational hierarchy metaphor offer genuine value for teams trying to coordinate multiple agents.
However, the demonstration reveals that successful use requires significant upfront investment in agent configuration, continuous refinement of instructions, and ongoing human oversight. The creator emphasizes the importance of carefully crafting agent behaviors and building organizations agent-by-agent rather than using templates, which suggests a high learning curve and operational overhead.
The platform’s reliability for business-critical work remains questionable based on this demonstration. The need for reviewers, approvers, and frequent human intervention suggests these agents are augmenting human work rather than replacing it. This isn’t necessarily a criticism, as augmentation may be the more realistic and valuable use case, but it contrasts with the “zero-human company” positioning.
The rapid community adoption with 50,000 GitHub stars indicates strong interest in agent orchestration solutions, but stars don’t necessarily translate to production deployments. The active development roadmap suggests the platform is evolving quickly, which is positive for addressing current limitations but creates uncertainty for teams considering production adoption.
Several architectural patterns in Paperclip merit attention from an LLMOps perspective. The organizational hierarchy as a coordination mechanism provides a clear delegation structure but may also introduce unnecessary complexity for simpler workflows. The distinction between skills, routines, and agent instructions creates multiple places to encode organizational knowledge, which could lead to confusion but also provides flexibility for different types of reusability.
The first-class support for plans represents a valuable pattern for complex tasks, creating explicit checkpoints for human review. The reviewer/approver workflow provides structure for quality assurance. The integration of cost tracking and budgeting directly into the platform acknowledges the economic realities of production agent deployments.
However, some technical decisions raise questions. The default concurrency limit of one task per agent seems conservative, potentially limiting throughput. The local-first architecture without cloud deployment options limits accessibility and collaboration. The lack of detailed information about context retrieval and memory systems makes it difficult to assess how effectively the platform manages one of the most critical aspects of agent reliability.
The platform’s positioning as a general business tool rather than a coding-specific tool is ambitious but not fully demonstrated. The majority of examples involve software development workflows, with marketing content creation as the main non-coding use case. Broader adoption across finance, sales, and other business functions would require additional skills, integrations, and demonstrated patterns that aren’t shown here.
In summary, Paperclip represents an early-stage but interesting approach to production agent orchestration with valuable abstractions for cost management, quality control, and multi-agent coordination, but with significant limitations in its current form that prevent it from delivering on the “zero-human company” vision while still providing value as an agent augmentation and coordination platform.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
Shopify's CTO discusses how the company has achieved near-universal AI adoption internally, with nearly 100% of employees using AI tools daily as of December 2025. The company has developed sophisticated internal platforms including Tangle (an ML experimentation framework), Tangent (an auto-research loop for automatic optimization), and SimGym (a customer simulation platform using historical data). These systems have enabled dramatic productivity improvements including 30% month-over-month PR merge growth, significant code quality improvements through critique loops, and the ability to run hundreds of automated experiments. The company provides unlimited token budgets to employees and emphasizes quality token usage over quantity, focusing on efficient agent architectures with critique loops rather than many parallel agents. They've also implemented Liquid AI models for low-latency applications, achieving 30-millisecond response times for search queries.