Factory is building a platform to transition from human-driven to agent-driven software development, targeting enterprise organizations with 5,000+ engineers. Their platform enables delegation of entire engineering tasks to AI agents (called "droids") that can go from project management tickets to mergeable pull requests. The system emphasizes three core principles: planning with subtask decomposition and model predictive control, decision-making with contextual reasoning, and environmental grounding through AI-computer interfaces that interact with existing development tools, observability systems, and knowledge bases.
Factory is a company co-founded by Eno (CTO) that is building a platform for what they term “agent-driven software development.” The core thesis is that the industry is transitioning from human-driven to agent-driven software development, and that current approaches of adding AI capabilities to existing IDEs are fundamentally limited. Factory argues that to achieve transformative productivity gains (5-20x rather than 10-15%), organizations need to shift from collaborating with AI to delegating entire tasks to AI systems. Their platform, which powers autonomous agents they call “droids,” is designed specifically for enterprise organizations with large engineering teams (5,000-10,000+ engineers).
It’s worth noting that this presentation is essentially a product pitch, so claims about productivity gains (5-20x improvement, 50%+ task delegation targets) should be viewed with appropriate skepticism until validated by independent benchmarks or customer testimonials with concrete metrics.
Factory’s approach represents a departure from the current industry trend of augmenting existing developer tools with AI capabilities. The company argues that truly transformative AI-assisted development requires:
The platform is designed to allow engineers to operate primarily in the “outer loop” of software development—reasoning about requirements, working with colleagues, listening to customers, and making architectural decisions—while delegating the “inner loop” (writing code, testing, building, code review) to autonomous agents.
Factory identifies three core characteristics that define agentic systems, and they’ve built their platform around optimizing for each:
Factory emphasizes that “a droid is only as good as its plan.” The challenge of ensuring agents create high-quality plans and adhere to them is addressed through several techniques borrowed from robotics and control systems:
Subtask decomposition: Breaking complex tasks into manageable subtasks rather than attempting to execute high-level directives directly. An example provided shows a droid creating separate plan steps for frontend work, backend work, tests, and feature flag rollout.
Model predictive control (MPC): Plans are continuously updated based on environmental feedback during execution, allowing the agent to adapt to changing conditions and new information discovered during task execution.
Explicit plan templating: For certain types of tasks, Factory implements predefined templates or workflows. They acknowledge the trade-off between rigidity (reducing creativity) and structure (improving reliability). For instance, knowing that a coding task will ultimately result in a pull request or commit allows the system to reason about beginning, intermediate, and end steps more effectively.
The planning system is designed to prevent wasted customer time and resources from agents executing on incorrect interpretations of tasks.
This is identified as “probably the hardest thing to control” in agentic systems. When building software, agents must make hundreds or thousands of micro-decisions: variable naming, change scope, code placement, whether to follow existing patterns or improve upon technical debt, and more. Factory’s approach involves:
Domain-specific decision criteria: For constrained domains (like code review or travel planning), explicit decision-making criteria can be defined. More open-ended agents require different approaches.
Contextual instruction: Since much decision-making happens in the reasoning layer of the underlying models, careful attention to how agents are instructed and what context they receive about their environment is critical.
Requirements synthesis: Droids are expected to evaluate user requirements, organizational standards, existing codebase patterns, and performance implications before arriving at final decisions.
The text describes customer expectations that droids can handle questions like “How do I structure an API for this project?” which requires synthesizing multiple sources of context and constraints.
This is where Factory reports spending most of their engineering effort. Environmental grounding refers to the agent’s connection with the real world through AI-computer interfaces. Key insights include:
Tool control is the primary differentiator: Factory explicitly states that “control over the tools your agent uses is the single most important differentiator in your agent reliability.”
Information processing is critical: When tools return large volumes of data (example given: 100,000 lines from a CLI command), the system must process this information intelligently—identifying what’s important, surfacing relevant details to the agent, and hiding extraneous information to prevent the agent from “going off the rails.”
Multi-source integration: The platform connects to GitHub (source code), Jira (project management), observability tools (like Sentry for error alerts), knowledge bases, and the broader internet.
A practical example provided shows a droid receiving a Sentry error alert and being tasked with producing a root cause analysis (RCA). The agent searches through repositories using multiple strategies (semantic search, glob patterns, APIs), reviews GitHub PRs from around the time of the error, and synthesizes all gathered information into an RCA document.
Factory grapples with what they call a “meta problem”: regardless of agent reliability, where does the human fit in? Their philosophy is encapsulated in several design principles:
Blending delegation with control: The UX must allow humans to stay in “flow” in the outer loop while maintaining precision control when agents cannot accomplish tasks.
Transparency through showing work: As agents reason and make decisions, the platform shows their work, which builds user trust and enables debugging and iteration.
Human collaboration as climbing gear: Factory uses the metaphor that AI systems are like climbing gear for scaling Mount Everest (shipping high-quality software)—essential tools that enhance human capability rather than replace human judgment entirely.
The presentation video mentioned shows the experience of triggering a droid from a project management system and watching it progress from ticket to pull request, emphasizing the visibility into agent work.
The platform integrates across multiple systems in the engineering stack:
While Factory presents an ambitious vision, several aspects warrant skepticism:
Quantitative claims are vague: The 5-20x productivity improvement and 50%+ task delegation targets are aspirational statements without published benchmarks or detailed customer case studies with metrics.
Enterprise focus may limit applicability: The platform appears designed for very large engineering organizations, which may limit relevance for smaller teams.
Complexity of implementation: The presentation acknowledges that this transition “requires an active investment” and isn’t simply about switching tools or watching tutorials—suggesting significant organizational change management is required.
Technology maturity: The statement that “current state of where we’re at” may prevent agents from accomplishing certain tasks acknowledges current limitations in the technology.
Factory mentions ongoing work in:
The Factory case study illustrates several principles relevant to production LLM systems:
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Uber's developer platform team built a suite of AI-powered developer tools using LangGraph to improve productivity for 5,000 engineers working on hundreds of millions of lines of code. The solution included tools like Validator (for detecting code violations and security issues), AutoCover (for automated test generation), and various other AI assistants. By creating domain-expert agents and reusable primitives, they achieved significant impact including thousands of daily code fixes, 10% improvement in developer platform coverage, and an estimated 21,000 developer hours saved through automated test generation.
This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.