ZenML

Building an AI-Native Development Platform at Scale

Kilo 2025
View original source

Kilo, an all-in-one agentic engineering platform founded in March 2025 and launched in May 2025, processed over 25 trillion tokens within its first year while serving 1.5 million developers. The company tackled the challenge of transforming traditional software development workflows by building a platform that enables developers to transition from manual coding to AI agent orchestration. By implementing multi-agent systems with context-aware capabilities, model routing strategies, and trust-building mechanisms, Kilo increased their internal team's feature shipping velocity from one feature every two to three weeks to one to two features per week with just 15 engineers, demonstrating the production-scale potential of agentic development platforms.

Industry

Tech

Technologies

Overview

Kilo represents a comprehensive case study in building and operating an AI-native development platform at production scale. Founded in March 2025 and launched in May 2025, the company has processed over 25 trillion tokens and served 1.5 million developers within approximately one year. The platform positions itself as an all-in-one agentic engineering solution that fundamentally transforms how development teams operate by shifting developers from direct code writers to orchestrators of AI agents.

The case study is particularly valuable because it presents both the internal transformation of Kilo’s own development practices and the learnings derived from operating a platform serving millions of users. With a team of only 15 engineers, Kilo claims to have improved feature shipping velocity from one feature every two to three weeks to one to two features per week, representing roughly a 4-6x productivity improvement.

The Transformation Model: From Coding to Orchestration

A central thesis of Kilo’s approach is the fundamental evolution of the developer role. The platform advocates for a shift from what they call the “2023 paradigm” where developers wrote every line of code, engaged in extensive collaboration meetings, managed product requirement documents, and experienced constant context switching. The new model positions developers as orchestrators who manage AI agents, guide vision, and focus on high-value architectural decisions.

Kilo has adopted an explicitly anti-collaboration stance, influenced by frameworks like those from PostHog. They minimize default collaboration, requiring engineers to only collaborate when it truly adds value. This is coupled with end-to-end ownership where individual engineers own entire features from conception through deployment to user feedback iteration, rather than features being owned by teams. This organizational structure is designed to eliminate what they identify as velocity killers in traditional development processes.

The platform reports that this shift changes the cognitive load distribution for developers from approximately 20% deep thinking and 80% coding to 80% thinking and 20% or less coding. This represents a significant mental model change that requires developers to build new muscles around sustained architectural thinking rather than hands-on implementation work.

The Trust Ladder: Progressive AI Adoption Framework

One of the most significant LLMOps insights from Kilo is their “trust ladder” framework for AI adoption. They observed that developers don’t immediately jump to full AI orchestration but instead progress through four distinct stages, each requiring increasing levels of trust:

Autocomplete serves as the entry point or “gateway drug” where developers remain fully in control, simply tabbing through suggestions. Developers don’t even necessarily think of this as AI; it’s just enhanced autocomplete. The low commitment and high control make this the easiest trust hurdle.

Chat represents the next stage where developers ask questions and provide context while maintaining control over what gets implemented. They’re still the decision-maker but beginning to delegate research and suggestion tasks to AI.

Single Agent involves delegating complete tasks to an agent while closely monitoring execution. Kilo describes this as analogous to a student driver situation where the AI has the steering wheel but the developer has a secret steering wheel and brake pedal, ready to intervene at any moment. This was approximately where Kilo’s team operated around August-September 2025.

Orchestration is the final stage where developers operate hands-off, managing multiple parallel agents (Kilo developers report using two to four parallel agents simultaneously), with the developer’s role fundamentally shifted to being an orchestrator rather than implementer.

Critically, Kilo observed that trust can break at any point on this ladder, causing developers to fall back to lower levels. Trust breakage occurs through specific failure modes at each level: slow suggestions at the autocomplete stage, agents editing wrong file paths during single agent usage, or orchestrators requiring constant permission checks that slow down the developer.

Critical Trust Breakage Points and Mitigation

Kilo identified three primary categories where AI adoption consistently breaks down, which informed their platform architecture:

Context represents perhaps the most fundamental challenge. The platform must have the right information about what it’s working on. As developers progress up the trust ladder, context requirements expand dramatically. Autocomplete needs only the current file, maybe imports and relevant files. Chat requires multiple files, the broader repository, related files, and documentation. Single agents need the full repository, dependencies, and upstream/downstream ecosystem understanding. Orchestration demands access to all repositories across the organization.

A concrete example involves Pedro, Kilo’s one-person data team, who built their entire DBT data model infrastructure in one to two weeks (versus an estimated six months traditionally). He achieved this by giving the Kilo platform access not just to the data transformation repository he was building but also to the application code repository that generated the source data. This cross-repository context allowed the agent to understand the origin story of the data and structure transformations appropriately without requiring constant clarification.

Model Selection is the second critical factor. Kilo initially made the mistake of routing everything to the most expensive, state-of-the-art model, assuming it would yield the best results. They discovered this was both cost-ineffective (metaphorically described as “heating your home with inference”) and introduced latency that damaged trust. Their evolved approach uses model routing based on task type: state-of-the-art models like Claude Opus for complex architecture tasks, but more cost-effective and faster models like Kimmy, MiniMax, or GLM for coding and debugging where speed matters more than maximum reasoning capability. The key insight is that optimal performance comes from combining models and using the right model for each job, not from using the single best model for everything.

Continuous Improvement is the third pillar. Given the rapid pace of AI advancement, Kilo positions their platform as something that improves daily, with the expectation that sharp edges discovered one day are eliminated within days. This requires robust feedback loops and rapid iteration cycles, which ties into their internal practice of shipping product updates every Friday.

Latency as a First-Class Trust Metric

Kilo discovered that latency has a measurable, direct impact on AI adoption and trust. They observed that when autocomplete latency peaked at 200 milliseconds, usage declined noticeably. This finding drove obsessive focus on monitoring latency and optimizing for speed. The insight is that if suggestions arrive after a developer has already moved on or become irrelevant to the current context, trust erodes immediately.

This makes latency optimization not just a performance concern but a fundamental product and trust concern in LLMOps contexts. The platform continuously monitors how usage trends correlate with performance metrics to identify and address degradations before they cause widespread trust erosion.

Measuring Trust Across the Adoption Spectrum

Kilo developed different measurement strategies for trust at each level of the adoption ladder, recognizing that trust signals become progressively harder to capture as autonomy increases:

Autocomplete provides excellent signal-to-noise ratio through simple acceptance/rejection metrics. When a suggestion is made, did the developer tab to accept it or ignore it? This binary signal arrives immediately and in high volume since autocomplete is the highest-frequency request type.

Chat provides weaker signals with slightly delayed feedback. When a developer makes a request and receives a result, did they copy it? Did they ignore it? Are they taking action on it? The signal is measurable but requires seconds to manifest and is less clear-cut than autocomplete acceptance.

Agent and Orchestration levels present significant measurement challenges. Tasks may execute over minutes, hours, or even days. Understanding whether a request to an orchestrator or long-running agent was effective becomes difficult without sophisticated instrumentation. Kilo invested heavily in instrumentation specifically to capture signals about whether features at these advanced levels are trustworthy and being used effectively.

This graduated measurement approach allows them to identify which parts of the platform are building or breaking trust and to rapidly address issues, enabled by their high shipping velocity.

Multi-Agent Architecture and Out-of-the-Box Agents

Kilo provides five core agents as part of their platform architecture:

Orchestrator Agent breaks down complex tasks and delegates to appropriate specialized agents, managing the overall workflow without requiring developers to manually route requests.

Ask Agent serves as the knowledge discovery and onboarding tool. New engineers use it to learn about the codebase rather than interrupting colleagues. It’s particularly popular for production issue analysis, with developers asking it to analyze problems and point them toward relevant context.

Architect Agent helps design and construct features, focusing on high-level structural decisions.

Code Agent handles the actual implementation work based on architectural decisions.

Debug Agent addresses issues and sharp edges encountered during development.

Beyond these out-of-the-box agents, Kilo supports custom agents and modes. Developers have created personalized configurations like “Brian mode” containing Brian’s preferred settings, which other team members can then use. This customization capability recognizes that in a world where developers manage teams of agents, they need the right team configured for their specific context and preferences.

The orchestrator approach is designed to be seamless: developers simply tell the orchestrator what they want accomplished, it formulates a plan, kicks off multiple agents as needed, and executes without requiring the developer to manually coordinate multi-agent workflows.

Integration Patterns: Slack as a Primary Interface

One of the most popular features is the Kilo Slackbot integration, which allows developers to interact with the platform without context switching. Examples include requests like “extend this promotion one more week” or “change the color on this page” for website modifications, or “@Kilo, analyze the problem” during production issues. This integration pattern recognizes that reducing friction and meeting developers where they already work (in Slack) is critical for adoption.

The Slackbot serves as both a convenience interface and an orchestration entry point, demonstrating how LLM-powered tools can integrate into existing workflows rather than requiring entirely new interaction paradigms.

Organizational Structure and the One-Engineer-One-Feature Model

Kilo has radically restructured their organization around AI-augmented productivity. With 15 engineers, they’ve implemented a strict one-engineer-one-feature ownership model. Each engineer owns their feature entirely: conception, architecture, coding, deployment, and direct user interaction for feedback and iteration. This contrasts sharply with traditional team-based ownership.

The company has essentially eliminated the Product Manager role, with only one PM in the entire organization focused on horizontal platform concerns that support all the vertical features owned by individual engineers. Each engineer effectively serves as their own PM, setting product direction for their domain.

This structure creates interesting operational challenges. When asked about handling vacations and time off, the response was candid: if an engineer is on vacation for a week, their feature might not progress that week, and they’re comfortable with that tradeoff given the productivity gains when the engineer is present. They maintain an on-call rotation for production issues but otherwise embrace the possibility of features pausing when their owner is unavailable.

Their product roadmap literally contains blank lines for features gated solely by hiring a person to own them, illustrating how deeply the one-to-one mapping between engineers and features is embedded in their operational model.

Production Monitoring and Proactive Issue Detection

Despite the autonomous agent approach, Kilo emphasizes that all work is reviewed by humans, though not necessarily every single line of code. They leave the level of review to the discretion of the individual developer based on what’s appropriate for the specific change.

They’ve implemented proactive monitoring systems designed to discover issues before they become significant problems. When production issues do occur, the on-call engineer’s first action is often to use the ask agent to analyze and understand the problem, providing immediate context without requiring manual searching through repositories or documentation.

This approach to monitoring and incident response demonstrates how AI agents can augment traditional DevOps and SRE practices, serving as first-line analysis tools that accelerate time-to-understanding during incidents.

Data and Scale Metrics

The scale of Kilo’s operation provides validation for their LLMOps approach:

These metrics are offered as proof points for the “proof is in production” philosophy the company emphasizes. The token processing volume in particular represents massive-scale LLM operations requiring robust infrastructure, model management, and cost optimization.

Adoption Statistics and Industry Reality Check

While presenting their own success, Kilo also provides sobering context about broader industry adoption:

These statistics highlight that despite the rapid advancement and availability of AI development tools, actual adoption and sophisticated usage remains limited outside of cutting-edge organizations and communities. Kilo positions themselves as serving users across this entire adoption spectrum, from those just beginning with autocomplete through to advanced orchestration users.

Critical Perspective and Potential Limitations

While the case study presents impressive metrics and a compelling vision, several aspects warrant balanced consideration:

The extreme anti-collaboration stance and one-engineer-one-feature model may not generalize well to all types of software projects, particularly those requiring deep domain expertise coordination, complex system integration, or regulated environments with mandatory review processes.

The 80% thinking / 20% coding shift is acknowledged as taxing, with new engineers needing to “warm up their muscle” for sustained architectural thinking. The long-term sustainability of this cognitive load and potential burnout risks aren’t fully explored.

The company’s small size (15 engineers) and presumably high talent density may make their results difficult to replicate at larger organizations with more varied skill levels. The approach seems to assume developers who are capable of effective end-to-end ownership and sophisticated AI orchestration.

The measurement challenges acknowledged at the agent and orchestration levels suggest that understanding true effectiveness and ROI of these advanced capabilities remains partially opaque even to platform builders, which could mask productivity impacts (positive or negative) that aren’t being captured.

The specific models mentioned (Kimmy, MiniMax, GLM) alongside Claude Opus suggest a multi-provider strategy, but details about model evaluation, selection criteria, fallback strategies, and handling model updates or deprecations aren’t provided.

Technical Architecture Implications

While specific infrastructure details aren’t provided, the scale of operation implies significant technical capabilities:

The platform appears to implement RAG (Retrieval-Augmented Generation) patterns for context provision, though specific details about embedding strategies, vector databases, or retrieval mechanisms aren’t discussed.

Key LLMOps Takeaways

This case study offers several valuable lessons for LLMOps practitioners:

Trust is not binary but progressive, requiring different support mechanisms at different adoption stages. Building trust requires matching AI autonomy with appropriate context, model selection, and performance characteristics for each level.

Latency matters profoundly and measurably impacts adoption and trust, requiring first-class optimization and monitoring as a product concern, not just an infrastructure concern.

Context provisioning must scale with autonomy, from single-file awareness for autocomplete to cross-repository understanding for orchestration, representing a significant technical and architectural challenge.

Model routing strategies that match models to tasks rather than defaulting to the most powerful model can provide better cost-performance tradeoffs and build more trust through reduced latency.

Measurement strategies must adapt to the level of autonomy, with more sophisticated instrumentation required as agent autonomy increases and immediate feedback signals become less available.

Integration into existing workflows (like Slack) can reduce adoption friction more effectively than requiring developers to adopt entirely new tools and interaction patterns.

Organizational structure and development practices may need to evolve alongside AI tooling adoption to fully realize productivity gains, though such radical restructuring (like Kilo’s one-engineer-one-feature model) may not be appropriate or feasible for all organizations.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Post-Training and Production LLM Systems at Scale

OpenAI 2025

This case study explores OpenAI's approach to post-training and deploying large language models in production environments, featuring insights from a post-training researcher working on reasoning models. The discussion covers the operational complexities of reinforcement learning from human feedback at scale, the evolution from non-thinking to thinking models, and production challenges including model routing, context window optimization, token efficiency improvements, and interruptability features. Key developments include the shopping model release, improvements from GPT-4.1 to GPT-5.1, and the operational realities of managing complex RL training runs with multiple grading setups and infrastructure components that require constant monitoring and debugging.

code_generation question_answering chatbot +32

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61