ZenML

Evolving AI Coding Agent Workflows from Research-Plan-Implement to CRISPY

HumanLayer 2026
View original source

HumanLayer developed an improved methodology for deploying AI coding agents in production environments, evolving from their original Research-Plan-Implement (RPI) approach to a new seven-stage framework called CRISPY (Context-Research-Iterate-Structure-Plan-sYnthesize). The original RPI methodology suffered from inconsistent results across teams, with engineers not reading generated code, plans becoming too complex to review effectively, and reliance on "magic words" in prompts to get proper agent behavior. By decomposing monolithic 85+ instruction prompts into smaller focused stages (under 40 instructions each), implementing explicit human-agent alignment checkpoints through design discussions and structure outlines, and advocating for engineers to read and own the actual code rather than lengthy plan documents, HumanLayer achieved more reliable 2-3x productivity gains while maintaining code quality and avoiding "slop" that would require future rework.

Industry

Tech

Technologies

Overview

HumanLayer developed and evolved a comprehensive methodology for deploying AI coding agents in production software development environments. The company initially released the Research-Plan-Implement (RPI) methodology around August of the previous year, which gained significant traction with approximately 10,000 people downloading their open-source prompts for internal use across organizations ranging from small startups to Fortune 500 enterprises. However, after working with thousands of engineers over several months, HumanLayer discovered significant operational challenges with their initial approach and developed an evolved seven-stage framework called CRISPY to address these issues.

The Initial RPI Methodology and Its Challenges

The original Research-Plan-Implement approach was built around three main stages using AI coding agents. The research phase involved launching coding agent sessions that would send sub-agents through deep vertical slices of the codebase to gather compressed, objective context about the relevant parts of the system. The planning phase used a single monolithic prompt with 85 or more instructions to generate comprehensive implementation plans. The implement phase then executed the plan to generate code.

While expert engineers who spent extensive time (cited as 70 hours per week) with the system achieved excellent results, the methodology suffered from inconsistent outcomes when distributed to broader engineering teams. HumanLayer identified several critical problems through their hands-on work with users.

Research Quality Issues

The research phase was compromised when engineers simply passed their ticket or feature description directly to the research agent. This caused the model to generate opinions rather than objective facts about the codebase. Skilled engineers would instead decompose tickets into specific questions about how endpoints work, logic flows, and relevant workers or services, allowing the agent to gather purely factual information without being biased by implementation ideas.

Planning Inconsistencies and the “Magic Words” Problem

The planning prompt contained embedded steps for the agent to present design options, gather feedback on structure, and work iteratively with the user before writing the actual plan. However, approximately 50% of the time, agents would skip these crucial interaction steps and immediately generate complete plans without asking questions or gathering user input. This happened either because users didn’t include specific prompt instructions or because the model was inconsistent in following all 85+ instructions.

HumanLayer discovered that users had to include phrases like “work back and forth with me starting with your open questions and outline before writing the plan” to reliably get the desired interactive behavior. This requirement for “magic words” became embarrassing for the company when conducting enterprise workshops, as it indicated fundamental usability problems with their tooling rather than user error.

The Instruction Budget Constraint

A key technical insight emerged from research showing that frontier LLMs could only consistently follow approximately 150-200 instructions. When combining an 85-instruction planning prompt with Claude MDE system prompts, tool definitions, and MCP (Model Context Protocol) configurations, teams were significantly exceeding the instruction budget. This caused the model to only partially attend to instructions, resulting in unreliable workflow adherence and skipped steps.

Plan Review Overhead Without Leverage

The original methodology advocated reading and reviewing the generated plans, which typically ran around 1,000 lines for features that would generate approximately 1,000 lines of code. Some teams even conducted pull request reviews on plans. However, this created several problems. First, the implemented code would often differ from the plan, requiring reviewers to read both the plan and the final code. Second, asking teammates to spend an hour reviewing a lengthy plan document didn’t provide true leverage—it was nearly equivalent work to reviewing the code itself. Third, engineers who invested significant review effort in the plan became psychologically attached to it even when the implementation diverged.

The Code Reading Debate

The presenter candidly acknowledged that in August they had advocated for not reading AI-generated code, suggesting that plans were sufficient for understanding what would be shipped. After six months of production experience, they reversed this position entirely. HumanLayer had to rip out and replace large parts of systems built without proper code review. While acknowledging impressive open-source projects like Beads (300,000+ lines of AI-generated code) and OpenClaw that were built without line-by-line code review, the presenter emphasized that production SaaS systems, particularly in regulated industries with paying customers depending on the code, required different quality standards. The talk positioned 2026 as “the year of no more slop,” emphasizing craft over pure speed.

The CRISPY Methodology: Seven Stages of Human-Agent Alignment

HumanLayer redesigned their workflow into seven distinct stages: Context (questions), Research, Iterate (design discussion), Structure (outline), Plan, sYnthesize (work tree), and Implement. This acronym-driven naming (CRISPY, evolving from the earlier RPI) represents a fundamental rethinking of how to structure AI coding workflows.

Stage 1: Context - Question Generation

Rather than exposing the ticket or feature description to the research phase, the system now uses a separate context window to generate appropriate research questions. This deterministic step extracts questions like “how do endpoints work,” “trace the logic flow for components touching X,” and “find workers that handle Y” without allowing implementation opinions to contaminate the research phase.

Stage 2: Research - Objective Fact Gathering

With the ticket intentionally hidden from the research context window, agents gather purely objective facts about how the codebase currently works. This maintains the original RPI goal of compressing truth about code structure without injecting premature implementation decisions. The research serves as a vertical slice through relevant parts of the codebase.

Stage 3: Design Discussion - Early Human-Agent Alignment

The design discussion represents a major innovation for creating lightweight but high-value alignment checkpoints. This stage produces a roughly 200-line markdown document covering current state, desired end state, patterns to follow, resolved design decisions, and open questions requiring human input.

The design discussion addresses a critical problem where coding agents would discover and replicate outdated or deprecated patterns in the codebase. By surfacing the patterns the agent found and plans to follow, engineers can redirect it toward preferred implementations before any code is written. The stage forces the agent to “brain dump” its understanding and assumptions, enabling what the presenter called “brain surgery on the agent” before downstream code generation.

This concept draws from ideas by Matt from PCO about “design concepts”—the shared understanding locked in a context window between human and agent regarding what’s being built and how. The 200-line design discussion provides 5x better leverage than reviewing a 1,000-line plan, as it captures all the critical decision points at a much earlier stage.

Stage 4: Structure Outline - Vertical Planning

The structure outline addresses another persistent problem: LLMs’ tendency to generate “horizontal plans” that complete all database changes, then all service layer changes, then all API changes, then all frontend changes. This approach makes it difficult to verify correctness incrementally, as nothing fully works until all 1,200 lines of code are complete.

HumanLayer advocates for “vertical plans” that build complete thin slices of functionality iteratively—creating a mock API endpoint, wiring it to the frontend, implementing the service layer, adding the database migration, then connecting everything. This mirrors how experienced developers naturally work and creates checkpoints where functionality can be tested before proceeding.

The structure outline is deliberately kept high-level (approximately two pages versus eight pages for full plans), analogous to C header files showing function signatures and types without implementation details. It specifies the phases of work and testing checkpoints without diving into exact code changes. This provides sufficient detail for engineers to verify the agent’s approach and correct misconceptions while remaining lightweight enough to review quickly.

Stage 5: Plan - Tactical Implementation Document

With design and structure already aligned, the plan generation uses the same template and prompt as the original RPI approach but now builds on the previous artifacts. Because major decisions and structure have already been reviewed, engineers can spot-check the plan rather than conducting deep reviews, saving their detailed attention for the actual generated code.

Stage 6-7: Implementation and Pull Requests

The presentation didn’t cover implementation details in depth due to time constraints, noting this as a separate topic. However, the workflow culminates in generating code and creating pull requests with all the preceding context artifacts available for reference.

Technical Architecture and Context Engineering Principles

Prompt Decomposition and Control Flow

A fundamental principle articulated in the talk is “don’t use prompts for control flow if you can use control flow for control flow.” Rather than embedding complex conditional logic and multiple workflow paths into a single mega-prompt, HumanLayer’s architecture uses the LLM for classification and routing, then directs work to specialized prompts with narrow instruction sets.

This approach references the presenter’s earlier work on “12 Factor Agents” and the concept of context engineering. While many practitioners interpret context engineering as RAG pipelines and information retrieval, the presenter emphasized that the more important interpretation involves better instructions, simpler tasks, and smaller context windows.

The decomposition strategy reduced all prompts from the original 85+ instruction monolith to under 40 instructions each, with some potentially going even smaller. This respects the instruction budget constraint and dramatically improves reliability of workflow adherence.

Context Window Management and the “Dumb Zone”

The presentation introduced the concept of the “dumb zone”—a degradation in model performance that occurs when context windows exceed approximately 40% utilization (around 80,000 tokens of the available 168,000 in Claude’s 200,000 token window, accounting for output reservation). While results can remain acceptable at 60% utilization depending on the ratio of user messages to files and other factors, performance degrades as context fills.

Engineers are flooding context windows not just with information but also with instructions from multiple MCPs (Model Context Protocols), system prompts, and tool definitions. Even before reaching the task-specific instructions and code context, significant context budget is consumed by infrastructure.

For newcomers to AI coding agents, HumanLayer recommends keeping context under 40% and considering workflow wrapping at 60%. Experienced users who work 60+ hours per week with coding agents develop intuition for when to push context higher or keep it aggressively low based on task complexity and the ratio of instructions to information.

An important architectural decision is that HumanLayer doesn’t rely on built-in context compaction. Instead, everything important goes into static markdown artifacts that can be reloaded into fresh context windows. This allows resuming work without depending on the quality of automatic or manual compaction and ensures critical decisions aren’t lost.

Multi-Stage Context Windows

The CRISPY workflow uses multiple context windows rather than maintaining a single long-running conversation. The question generation happens in one window, research in a fresh window without the ticket, design discussion builds on research plus the ticket, and so forth. While each stage may involve back-and-forth conversation (represented in presentation slides as single columns to indicate unified sessions), the transitions between major stages involve creating new context with selected artifacts from previous work.

This approach keeps individual stages focused, prevents context bloat, and ensures each stage has only the instructions and information relevant to its specific purpose.

Organizational Adoption and Team Dynamics

Distributed Decision-Making and Code Ownership

The methodology emphasizes collaborative planning even within teams using AI coding agents extensively. The presenter described their own practice of sending design discussions to their co-founder (the code owner for most of their codebase) before implementing features. This front-loads alignment and feedback to the 200-line design discussion phase rather than discovering disagreements during code review of completed implementations.

This mirrors traditional architecture review processes but adapts them for AI-assisted development. Rather than replacing human collaboration with agent autonomy, the approach uses agents to formalize and document design thinking more explicitly, creating better artifacts for team alignment.

The Leverage Equation

HumanLayer’s framework explicitly considers where time is actually saved in the development lifecycle. For a typical two-day feature, traditional coding might consume 2-4 hours, with the remainder spent on alignment, design discussion, code review, cross-team coordination, and testing/verification. Simply using Claude Code or similar tools to reduce coding time from 4 hours to 20 minutes doesn’t change the two-day timeline because the other activities remain unchanged.

The CRISPY methodology applies AI assistance to the planning and alignment phases as well, reducing time spent in meetings and design discussions while simultaneously improving alignment quality through explicit artifact generation. With better upfront alignment, code review and rework are also reduced because reviewers already understand what’s coming and have had opportunities to provide input.

The stated goal is achieving 2-3x productivity improvements while maintaining near-human code quality, rather than pursuing 10x speed increases that generate “slop” requiring eventual replacement. This reflects a mature understanding that pure velocity without quality consideration creates technical debt and rework that negates apparent gains.

Training and Adoption Challenges

The evolution from three stages (RPI) to seven stages (CRISPY) potentially increases adoption complexity. The presentation acknowledged this concern but didn’t provide detailed solutions within the talk’s scope, noting it as a topic for future discussion. The tension between making workflows more reliable (through decomposition) and making them easier to learn represents an ongoing challenge in LLMOps.

Quality Philosophy and Production Standards

The “No Slop” Movement

The presentation strongly positioned itself within a 2026 movement toward code quality and away from rapid generation of low-quality output requiring eventual replacement. References to “slop versus craft” suggest a maturing industry conversation about sustainable AI-assisted development practices.

The presenter expressed skepticism toward agent swarms and highly autonomous systems without quality control mechanisms, arguing that 10x speed is meaningless if output must be discarded within six months. This represents a deliberate business and technical decision to optimize for sustainable velocity rather than headline-grabbing metrics.

Code Review as Professional Responsibility

The most emphatic points in the presentation centered on the necessity of reading AI-generated code when building production systems with paying customers, particularly in regulated industries. The presenter acknowledged impressive accomplishments in open-source AI-generated projects but distinguished between different risk profiles and stakes.

The argument positions code review not as a temporary measure until models improve but as a professional responsibility for engineers shipping production systems. This stance runs counter to some current industry messaging around fully autonomous coding but aligns with organizations prioritizing reliability and maintainability.

Measurement and Continuous Improvement

The presentation acknowledged but didn’t resolve several open questions around measuring and improving AI-assisted development:

These questions reflect the organizational complexity of deploying LLMOps practices across diverse teams with varying needs and contexts.

Tools and Infrastructure

While the presentation focused primarily on methodology rather than specific tooling, HumanLayer is developing an IDE that orchestrates the CRISPY workflow. The presenter emphasized that the value can be achieved without their specific tooling—the principles of prompt decomposition, human-agent alignment checkpoints, and explicit artifact generation can be implemented with various tools.

References to Claude (Anthropic’s model) appeared throughout as the primary LLM being used, with mentions of Claude Opus 4.5 and Claude Code as specific tools. The methodology appears designed to work with frontier LLMs generally, with the instruction budget and context window management principles applying broadly.

Industry Context and Evolution

The presentation situates itself within a rapidly evolving landscape of AI coding tools and practices. References to competing approaches include the “software factory” concept from companies like StrongDM that advocate for never having humans read generated code, instead relying entirely on evaluation and formal verification. The presenter expressed interest in formal verification approaches like TLA+ and experimental TLA++ but maintained that current production needs require human code review.

The talk demonstrates a willingness to publicly revise earlier positions based on production experience—explicitly acknowledging that advice given in August and November of the previous year proved incorrect after several months of real-world usage. This intellectual honesty and emphasis on empirical learning from production deployments represents a mature approach to an immature and rapidly changing field.

The evolution from RPI to CRISPY reflects broader industry learning about the operational challenges of deploying LLMs in production software development, moving from initial enthusiasm about autonomous generation toward more nuanced workflows that optimize for sustainable quality and team collaboration alongside raw productivity gains.

More Like This

Building Production AI Agents and Agentic Platforms at Scale

Vercel 2025

This AWS re:Invent 2025 session explores the challenges organizations face moving AI projects from proof-of-concept to production, addressing the statistic that 46% of AI POC projects are canceled before reaching production. AWS Bedrock team members and Vercel's director of AI engineering present a comprehensive framework for production AI systems, focusing on three critical areas: model switching, evaluation, and observability. The session demonstrates how Amazon Bedrock's unified APIs, guardrails, and Agent Core capabilities combined with Vercel's AI SDK and Workflow Development Kit enable rapid development and deployment of durable, production-ready agentic systems. Vercel showcases real-world applications including V0 (an AI-powered prototyping platform), Vercel Agent (an AI code reviewer), and various internal agents deployed across their organization, all powered by Amazon Bedrock infrastructure.

code_generation chatbot data_analysis +38

Context Engineering for Production AI Agents at Scale

Manus 2025

Manus, a general AI agent platform, addresses the challenge of context explosion in long-running autonomous agents that can accumulate hundreds of tool calls during typical tasks. The company developed a comprehensive context engineering framework encompassing five key dimensions: context offloading (to file systems and sandbox environments), context reduction (through compaction and summarization), context retrieval (using file-based search tools), context isolation (via multi-agent architectures), and context caching (for KV cache optimization). This approach has been refined through five major refactors since launch in March, with the system supporting typical tasks requiring around 50 tool calls while maintaining model performance and managing token costs effectively through their layered action space architecture.

code_generation data_analysis visualization +34

Engineering Principles and Practices for Production LLM Systems

Langchain 2025

This case study captures insights from Lance Martin, ML engineer at Langchain, discussing the evolution from traditional ML to LLM-based systems and the emerging engineering discipline of building production GenAI applications. The discussion covers key challenges including the shift from model training to model orchestration, the need to continuously rearchitect systems as foundation models rapidly improve, and the critical importance of context engineering to manage token usage and prevent context degradation. Solutions explored include workflow versus agent architectures, the three-part context engineering playbook (reduce, offload, isolate), and evaluation strategies that emphasize user feedback and tracing over static benchmarks. Results demonstrate that teams like Manis have rearchitected their systems five times since March 2025, and that simpler approaches with proper observability often outperform complex architectures, with the understanding that today's solutions must be rebuilt as models improve.

code_generation question_answering summarization +35