ZenML

AI Agent-Driven Software Development Platform for Enterprise Engineering Teams

Factory 2025
View original source

Factory is building a platform to transition from human-driven to agent-driven software development, targeting enterprise organizations with 5,000+ engineers. Their platform enables delegation of entire engineering tasks to AI agents (called "droids") that can go from project management tickets to mergeable pull requests. The system emphasizes three core principles: planning with subtask decomposition and model predictive control, decision-making with contextual reasoning, and environmental grounding through AI-computer interfaces that interact with existing development tools, observability systems, and knowledge bases.

Industry

Tech

Technologies

Overview

Factory is a company co-founded by Eno (CTO) that is building a platform for what they term “agent-driven software development.” The core thesis is that the industry is transitioning from human-driven to agent-driven software development, and that current approaches of adding AI capabilities to existing IDEs are fundamentally limited. Factory argues that to achieve transformative productivity gains (5-20x rather than 10-15%), organizations need to shift from collaborating with AI to delegating entire tasks to AI systems. Their platform, which powers autonomous agents they call “droids,” is designed specifically for enterprise organizations with large engineering teams (5,000-10,000+ engineers).

It’s worth noting that this presentation is essentially a product pitch, so claims about productivity gains (5-20x improvement, 50%+ task delegation targets) should be viewed with appropriate skepticism until validated by independent benchmarks or customer testimonials with concrete metrics.

The Platform Philosophy

Factory’s approach represents a departure from the current industry trend of augmenting existing developer tools with AI capabilities. The company argues that truly transformative AI-assisted development requires:

The platform is designed to allow engineers to operate primarily in the “outer loop” of software development—reasoning about requirements, working with colleagues, listening to customers, and making architectural decisions—while delegating the “inner loop” (writing code, testing, building, code review) to autonomous agents.

Agentic System Architecture

Factory identifies three core characteristics that define agentic systems, and they’ve built their platform around optimizing for each:

Planning

Factory emphasizes that “a droid is only as good as its plan.” The challenge of ensuring agents create high-quality plans and adhere to them is addressed through several techniques borrowed from robotics and control systems:

The planning system is designed to prevent wasted customer time and resources from agents executing on incorrect interpretations of tasks.

Decision-Making

This is identified as “probably the hardest thing to control” in agentic systems. When building software, agents must make hundreds or thousands of micro-decisions: variable naming, change scope, code placement, whether to follow existing patterns or improve upon technical debt, and more. Factory’s approach involves:

The text describes customer expectations that droids can handle questions like “How do I structure an API for this project?” which requires synthesizing multiple sources of context and constraints.

Environmental Grounding

This is where Factory reports spending most of their engineering effort. Environmental grounding refers to the agent’s connection with the real world through AI-computer interfaces. Key insights include:

A practical example provided shows a droid receiving a Sentry error alert and being tasked with producing a root cause analysis (RCA). The agent searches through repositories using multiple strategies (semantic search, glob patterns, APIs), reviews GitHub PRs from around the time of the error, and synthesizes all gathered information into an RCA document.

Human-AI Interaction Design

Factory grapples with what they call a “meta problem”: regardless of agent reliability, where does the human fit in? Their philosophy is encapsulated in several design principles:

The presentation video mentioned shows the experience of triggering a droid from a project management system and watching it progress from ticket to pull request, emphasizing the visibility into agent work.

Integration Points

The platform integrates across multiple systems in the engineering stack:

Limitations and Honest Assessment

While Factory presents an ambitious vision, several aspects warrant skepticism:

Future Directions

Factory mentions ongoing work in:

Key Takeaways for LLMOps Practitioners

The Factory case study illustrates several principles relevant to production LLM systems:

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building AI Developer Tools Using LangGraph for Large-Scale Software Development

Uber 2025

Uber's developer platform team built a suite of AI-powered developer tools using LangGraph to improve productivity for 5,000 engineers working on hundreds of millions of lines of code. The solution included tools like Validator (for detecting code violations and security issues), AutoCover (for automated test generation), and various other AI assistants. By creating domain-expert agents and reusable primitives, they achieved significant impact including thousands of daily code fixes, 10% improvement in developer platform coverage, and an estimated 21,000 developer hours saved through automated test generation.

code_generation code_interpretation classification +27

Building Production-Ready AI Agent Systems: Multi-Agent Orchestration and LLMOps at Scale

Galileo / Crew AI 2025

This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.

customer_support code_generation document_processing +41