Factory.ai built an enterprise-focused autonomous software engineering platform using AI "droids" that can handle complex coding tasks independently. The founders met at a LangChain hackathon and developed a browser-based system that allows delegation rather than collaboration, enabling developers to assign tasks to AI agents that can work across entire codebases, integrate with enterprise tools, and complete large-scale migrations. Their approach focuses on enterprise customers with legacy codebases, achieving dramatic results like reducing 4-month migration projects to 3.5 days, while maintaining cost efficiency through intelligent retrieval rather than relying on large context windows.
Factory AI is a Sequoia-backed enterprise-focused autonomous software engineering (A-SWE) platform founded in 2023 by Matan Grinberg and Eno Reyes after they met at a LangChain hackathon. The company builds what they call “droids” - specialized AI agents designed to handle different aspects of the software development lifecycle including coding, knowledge retrieval, and reliability/incident response. Their focus is specifically on enterprise use cases, serving Fortune 500 companies with complex, legacy codebases that may be 30+ years old.
The founding thesis was that code generation capability is fundamental to LLM intelligence - the better an LLM is at code, the better it tends to be at downstream tasks. Combined with the fact that code is one of the few domains where outputs can be rigorously validated (via execution, tests, etc.), this made autonomous coding agents a particularly promising application area.
Factory’s platform is organized around specialized “droids” that handle different categories of tasks:
The name “droid” was deliberately chosen to differentiate from the overloaded “agent” terminology that had become associated with unreliable “while loop” systems like Baby AGI and AutoGPT. The founders wanted to signal that their system was more constrained and goal-oriented while still being agentic underneath.
A key differentiator emphasized in the case study is Factory’s approach to context. The company argues that giving agents access only to code (as many IDE-based tools do) is analogous to onboarding a human engineer by just throwing them into a codebase. Real engineers need access to Slack, Notion, Linear, Jira, Datadog, Sentry, PagerDuty, and other tools to be productive.
Factory integrates with these enterprise systems natively, allowing droids to pull in relevant context from tickets, documentation, and monitoring systems. They also generate “synthetic insights” at indexing time - automatically analyzing code structure, module relationships, and setup procedures rather than requiring users to manually configure these details.
The platform automatically ingests configuration files from other tools (cursor rules, etc.) but intelligently parses them to extract only information the system doesn’t already know, avoiding redundant or conflicting guidance.
Despite the industry excitement around large context windows, Factory emphasizes efficient retrieval as critical for enterprise deployments. Even with million-token context windows, customers care deeply about cost efficiency. The company invests heavily in retrieval capabilities so that for each actual LLM call, they’re selecting only the most relevant context rather than dumping entire repositories into prompts.
Their demo showed a task completing with only 43% of context capacity used despite operating on a large monorepo, demonstrating this retrieval efficiency.
A notable feature is support for both local and remote execution. Users can start a task locally with high involvement, then hand it off to a remote droid to continue independently - avoiding the IDE as a bottleneck. The browser-based interface (rather than IDE-native) is a deliberate architectural choice based on the thesis that the optimal UI for AI-assisted development will be fundamentally different from traditional coding environments.
Factory initially gained recognition for strong SWE-Bench results but has since stopped running the benchmark. Their stated reasons align with broader industry concerns: SWE-Bench is Python-only and doesn’t match real-world enterprise tasks. They note that newer benchmarks like PutnamBench, Commit0, SWELancer, MLEBench, and PaperBench may be preferred, but none have achieved the universal recognition that HumanEval or SWE-Bench had.
Factory’s internal evaluation combines multiple approaches:
The company emphasizes that “vibe-based” or sentiment-based internal assessment also matters significantly. The engineers working with these models daily develop intimate understanding of behavioral changes that may not be captured in automated metrics.
A particularly interesting technical challenge discussed is dealing with RL-induced biases in newer models. Sonnet 3.7 and Codex show strong preferences for CLI-based tools (grep, glob) that appear to result from post-training on their respective agentic products (Claude Code, Codex). When Factory provides better search tools, models may still prefer the CLI approaches they were trained on. Their evals must specifically test whether models effectively utilize Factory’s superior tooling rather than falling back to familiar patterns.
The team explicitly chose not to fine-tune models, believing that freezing at a specific quality level is lower leverage than iterating on external systems. They view these model biases as bugs the labs will eventually fix rather than permanent constraints to work around.
Factory uses LangSmith for observability despite not using LangChain itself. However, they note significant remaining challenges in the observability space, particularly around what they call “semantic observability” - understanding user intent and satisfaction beyond simple thumbs up/down signals.
The challenge is especially acute with enterprise customers where they cannot see code data directly. They suggest that product analytics companies (Amplitude, StatSig) may be closer to solving this than traditional observability tools, as the need is really to understand when users are having good vs. bad experiences with natural language, intent-driven interactions.
For enterprise customers, Factory tracks and reports on:
They found that traditional metrics (commits, lines of code, DORA metrics) matter less than developer sentiment and concrete deadline compression. Their most compelling ROI example was reducing a four-month migration to three and a half days.
The case study provides a detailed walkthrough of how Factory handles large migration projects:
The traditional process involves 4-10 people analyzing codebases, creating documentation, developing migration strategies, populating project management tools with epics and tickets, and coordinating dependent work over months.
With Factory, a single engineer can:
The bottleneck shifts from skilled humans writing code to how fast humans can delegate tasks appropriately.
The team identified their primary model capability wish: LLMs post-trained on general agentic trajectories over very long timespans (hours, not minutes). Current models lack the ability to maintain goal-directed behavior over extended problem-solving sessions. They reference OpenAI’s Operator benchmark where human testers worked for two hours and gave up as the type of challenge they want solved.
They’re currently building benchmarks with post-training techniques in mind, suggesting they may eventually use these for their own post-training efforts if necessary, though they continue to prefer staying on frontier models.
Factory deliberately targets the enterprise segment which they see as underserved - developers working on 30+ year old codebases doing migrations and maintenance rather than greenfield projects. They argue that while COBOL migrations aren’t as viral as flashy demos, the value delivered is dramatic.
Their go-to-market has been primarily word-of-mouth within enterprise networks, with executives sharing experiences at CEO dinners. They’re now actively expanding sales and customer success teams, looking specifically for highly technical individuals who can both sell to CIOs and work side-by-side with developers in the platform.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.