Harness Engineering: Structuring Context and Guardrails for AI Coding Agents in Production

OpenAI 2026
View original source

Ryan Leiulo from OpenAI presents the concept of "harness engineering," a novel approach to productionizing AI coding agents by systematically structuring context, guardrails, and feedback loops. The core problem addressed is that while modern LLMs have reached capabilities enabling significant parts of the software engineering lifecycle, they lack the durable memory and cultural osmosis that human engineers possess. The solution involves creating explicit, written documentation of non-functional requirements, implementing just-in-time context injection through tool calls and tests, and establishing reviewer agents with persona-based guardrails. Results demonstrate that teams can achieve headless operation with minimal human intervention by shifting quality controls rightward in the development process, enabling agents to self-correct through static guardrails, exhaustive tests, and automated review processes that continuously improve through systematic capture of all human feedback and failed builds.

Industry

Tech

Technologies

Overview

Ryan Leiulo from OpenAI introduces the concept of “harness engineering” at AI Native DevCon, describing a comprehensive approach to integrating AI coding agents into production software development workflows. The talk originates from his personal experience starting in June of the previous year when he began using early reasoning models and the Codex CLI coding agent to automate his own engineering work. Initially attempting to have the agent read Slack alerts and triage pages, he evolved into presenting himself as a tool to the model, gradually building up powerful tooling and context structures that enabled the agent to solve problems, write code, and manage issues on his behalf.

The fundamental thesis is that software development has undergone a singularity-level disruption comparable to the introduction of cloud computing, but compressed into a much shorter timeframe. With the introduction of GPT 5.2 and Claude Opus 4.5 in December, code production capabilities reached a level where traditional software engineering constraints no longer apply. This requires teams to completely retool their stacks with every point release of models, constantly re-evaluating what’s possible.

Core Constraints in the Agent-Driven World

Leiulo identifies three foundational limits that remain when teams of humans and agents produce software together:

Human time is the fundamentally scarce resource. He notes that he personally maxes out at about three concurrent sessions on his laptop. To achieve higher throughput and parallelism, teams must remove their own synchronous attention from the process. This represents a shift from the old world where code production was the expensive bottleneck.

Human and model attention remain limited. Due to the architecture of LLMs where attention must sum to one, thrashing agents with conflicting and overbearing requirements will degrade performance. Teams need to structure work to be more parallel, accept many more PRs of varying sizes, and let agents explore the solution space.

Model context windows, while growing larger over time, remain a scarce resource requiring protection. Leiulo shares that with GPT series models, auto-compaction is fantastic and he can let tasks run for 6, 12, even 36 hours while still getting good results. However, the context window being obliterated and rebuilt during these auto-compactions is something teams must actively contend with by continually resurfacing context to the model.

The Core Problem: Making Quality Legible to Agents

A critical insight is that agents lack the cultural osmosis and durable memory that human engineers accumulate through standup meetings, code reviews, and accumulated battle scars. Teams must articulate what constitutes “doing a good job” and write it down explicitly. Since LLMs are driven by text, defining quality in written form becomes a net new function for software engineering teams in 2026.

However, writing things down is insufficient. The text must be pulled into context at the right time in ways that don’t thrash the agent while still allowing it to be creative and reason effectively. For example, telling an agent to “write reliable network code by making sure retries and timeouts are consistently applied” is useless if that text never makes it into the agent’s context during implementation.

The Harness Engineering Approach

Harness engineering is fundamentally about making context around quality legible and surfacing it just-in-time to the agent during its execution trajectories. This steers and refines output to ensure every PR adheres to what the team considers acceptable, high-quality, aligned software.

Interestingly, Leiulo advocates for the opposite of traditional DevOps “shift left” practices. Rather than pushing interventions as early as possible in the development process, he pushes interventions as far right as possible to minimize synchronous human time. The hierarchy of interventions from right to left includes:

The key is progressively making mistakes impossible rather than just unlikely. He never wants to give the same review feedback twice.

Three Phases of Context Delivery

Leiulo structures the agent workflow into three distinct phases, each requiring different context delivery strategies:

Phase 1: Planning and Grounding

The most important artifact is an agents.md file containing a numbered set of steps the model should follow during every rollout and session. The agent first grounds itself in documentation, the knowledge base, and the ticket. It spiders through the history of Architecture Decision Records (ADRs) and design docs to understand impacts on other features. It reviews critical user journeys to determine what screens and user surfaces are affected, keeping the QA plan in mind throughout execution. Some amount of slowness is expected and welcome during this phase because the agent is paging in all necessary context about how this feature slots into the broader system.

Phase 2: The Messy Middle of Implementation

During code writing, test running, and codebase exploration, the system exploits the fact that agents call many tools and run many tests by using these interactions for just-in-time prompt injection. Tests and lints written for agents are fundamentally different from those written for humans. They recognize that agents will truncate tool call outputs, they respond well to descriptive error messages pointing to runbooks for remediation, and they can be fiddly and numerous in ways that would be burdensome for human developers.

A concrete example Leiulo provides is the common failure mode of missing timeouts and retries on cross-service network calls. While there’s no standard ESLint plugin for this, with cheap code production teams can “vibe up” guardrails with 100% code coverage and exhaustive table-driven tests, migrate the entire codebase, and surface failures just-in-time whenever the model writes another fetch call. Because tool call outputs receive less weight during auto-compaction, this just-in-time correction doesn’t pollute the context window while still allowing complex work.

Phase 3: Review and Merge

After implementation, the task becomes determining whether the code and diff are aligned. Static guardrails and multiple LLM-as-judge evaluators examine the code through various lenses: reliability, performance, adherence to team standards. Because LLMs crave text, these reviewer agents can collaborate with the implementation agent over the PR thread, providing detailed feedback that realigns the diff back to baseline.

Key Implementation Patterns

Agents.md as the Central Hub: This file doesn’t contain prescriptive guardrails but rather points to a curated set of review personas that are essentially bulleted lists of guardrails. This structure allows teams to cheaply refine agent output by having Slack conversations about performance regressions or bugs, then mentioning the agent to yoink the conversation and create a PR adding it to the static guardrails.

Review Personas: These persona-based guardrail files can be applied not just to code quality but also to product features, critical user journeys, and the fundamental user problems the application solves. All this context grounds the agent in what the team is trying to accomplish and how they think about working, producing more aligned output.

Structural Tests: Blunt hammers like file line count limits or requiring snapshot tests can be powerful. For example, requiring that every React component has a snapshot test with 100% branch coverage naturally causes the agent to decompose components, make them pure where possible, avoid prop drilling, and place hooks close to where data is used—all because it makes it easier to satisfy the snapshot test requirement.

Static Type Requirements: To combat type-shaped probing that results in scattered any or unknown types, Leiulo uses ESLint to statically disallow any function with these types except in route handlers or database parsing code. Combined with 100% code coverage requirements, this makes bad type probing behavior impossible because unknown types can’t exist and therefore functions can’t be exercised.

All Code is Prompting: Since agents crave text, every bit of text fed to them informs token prediction and thus the produced code. This means all code in the repository, outside of documentation, is also prompting. Standardizing the entire stack on a single observability framework (like OpenTelemetry) means the model can translate context from one part of the repository to another without loss of quality. Having six different observability stacks forces the model to spend significantly more attention figuring out which to use and whether code has been migrated.

Treating Agents as Team Members

Leiulo emphasizes treating agents like human teammates who need to convince you to merge their code. Just as you don’t shoulder-surf teammates in their editors, you trust their attestation that they tested the code. When unsure, you might ask for logs from staging deploys or screenshots of exercising features. Agents should be required to do the same.

With computer use and browser use capabilities now available (he highly recommends the Codex app), agents can provide visual confirmation. Even without those tools, vibing up an XC-connected headless display in a Docker container and wiring FFmpeg to the stream to record reproduction videos is within reach, especially since the code quality doesn’t matter for these throwaway test harnesses and Codex excels at FFmpeg manipulation.

The review process should assume benefit of the doubt and bias toward merge. What are the P2-and-above issues necessary to accept this code? Use reviewer agents to surface those, get the coding agent to address them, verify the reviewers are satisfied, and ship. Observing which review feedback regularly surfaces provides signals about which guardrails need to be shifted left in the process.

Systematic Capture of Feedback Loops

Every review comment, agent interruption, failed build, and production exception represents a signal that context was missing. The agent didn’t consider the full end-to-end consequences of its code. Teams should systematically capture all this human feedback and “slurp it up” to have sub-agents dream over it nightly, distilling whether humans can improve their prompting, whether missing guardrails should exist, and how to achieve more headless operation with less human interruption while enabling agents to handle increasingly complex tasks.

Vibe Coding as a Philosophy

Leiulo advocates for “vibe coding”—writing guardrails and tooling that only affect local development and can be gross in implementation because it brings into possibility the idea that you don’t need to care about parts of the software production function. This enables operating like a group tech lead or org lead without visibility into every engineer’s keyboard activity. What matters are invariants, interfaces, and whether components do what they claim with high reliability.

Production Experience and Results

Leiulo shares concrete experience from his work maintaining open-source Rust crates, particularly the Artichoke Ruby interpreter project. He specifically mentions exploiting automations in the Codex app to take his hands off the wheel for maintenance tasks on projects like artichoke-randmt, a Mersenne Twister implementation. While he hasn’t yet implemented the full review agent layer in these projects, the patterns are being actively deployed.

The overall result is a development process where:

Critical Considerations and Balanced Assessment

While Leiulo’s presentation is compelling and clearly comes from real production experience, several considerations warrant attention:

The approach requires significant upfront investment in creating the harness infrastructure—the agents.md files, review personas, structural tests, and feedback capture systems. Teams need to assess whether their development velocity and scale justify this investment versus simpler prompt engineering approaches.

The reliance on auto-compaction and long-running agent sessions (6-36 hours) assumes access to cutting-edge models with these capabilities. Teams using different models or with tighter latency requirements may need to adapt these patterns significantly.

The “shift right” philosophy directly contradicts decades of DevOps wisdom about catching issues early when they’re cheapest to fix. While Leiulo’s reasoning is sound for the specific context of AI agents with cheap code production, teams need to carefully evaluate whether this applies to their specific circumstances or whether some hybrid approach better suits their needs.

The treatment of agents as teammates who deserve trust assumes a level of model capability and reliability that may not hold across all tasks and domains. Critical systems may require more extensive human oversight regardless of how well-structured the harness becomes.

Finally, the continuous refinement process of capturing all feedback and “dreaming over it” nightly requires significant automation infrastructure and likely substantial compute resources. The ROI of this investment depends heavily on team size, development velocity, and the repetitiveness of the work being automated.

Nevertheless, the harness engineering framework represents a sophisticated and thoughtful approach to productionizing AI coding agents, grounded in real experience and offering concrete patterns that teams can adopt incrementally as they scale their use of AI in software development.

More Like This

Extreme Harness Engineering: Building Production Software with Zero Human-Written Code

OpenAI 2026

OpenAI's Frontier Product Exploration team conducted a five-month experiment building an internal beta product with zero manually written code, generating over 1 million lines of code across thousands of PRs while processing approximately 1 billion tokens per day. The team developed "Symphony," an Elixir-based orchestration system that manages multiple Codex agents autonomously, removing humans from the code review and merge loop entirely. By shifting focus from prompt engineering to "harness engineering"—building systems, observability, and context that enable agents to work independently—the team achieved 5-10 PRs per engineer per day and established a new paradigm where software is optimized for agent legibility rather than human readability.

code_generation chatbot data_analysis +23

Extreme Harness Engineering: Building Production Systems with Zero Human-Written Code

OpenAI 2026

OpenAI's Frontier Product Exploration team conducted a five-month experiment building an internal Electron application with zero lines of human-written code, generating over one million lines of code across thousands of pull requests. The team developed "harness engineering" principles and Symphony, an Elixir-based orchestration system, to manage multiple coding agents at scale. By removing humans from the code authorship loop and focusing on building infrastructure, observability, and context for agents to operate autonomously, the team achieved 5-10 PRs per engineer per day with agents handling the full PR lifecycle including review, merge conflict resolution, and deployment, ultimately demonstrating that software can be built and maintained entirely by AI agents when proper systems and guardrails are in place.

code_generation poc structured_output +28

Zero Human-Written Code: Harness Engineering for Autonomous AI Agents at Scale

OpenAI 2026

Ryan Lopopolo from OpenAI discusses his team's radical approach to software development where they produce zero human-written code and conduct zero human code reviews, relying entirely on AI agents for implementation. Starting in mid-2025 before reasoning models existed, the team developed "harness engineering" practices to enable autonomous AI agents to write production code. Through careful context management, tool design, automated testing, and asynchronous review loops, the team scaled from producing 3.5 pull requests per engineer per week with GPT-5.2 to 70 PRs per week with GPT-5.5, while maintaining code quality through programmatic guardrails and anti-slop systems. The approach emphasizes specification-driven development where human engineers focus on defining interfaces, system architecture, and functional requirements rather than implementation details.

code_generation data_analysis poc +26