ZenML

Evolution from Context Engineering to Harness Engineering: Philosophical and Practical Approaches to Building Production LLM Systems

Boundary / LangChain / HumanLayer 2026
View original source

This case study presents a comprehensive discussion between engineers from LangChain and creators of the Ralph/Wim Loop system about the evolution of production LLM systems from basic agent loops to sophisticated harness engineering. The discussion addresses the fundamental shift from context engineering (where developers manually craft prompts and tool calls) to harness engineering (where models are reinforcement-learned to work optimally with specific tool sets and execution environments). The participants explore the tradeoffs between building custom harnesses versus using existing frameworks, the importance of evaluation-driven development, and the ongoing tension between automated code generation and deep systems understanding. They conclude that while newer abstraction layers provide faster time-to-value, understanding the underlying primitives remains essential for production engineering excellence.

Industry

Tech

Technologies

Overview

This case study emerges from a live podcast discussion at AI Engineer Miami featuring engineers from LangChain (Viv) and creators of the Ralph/Wim Loop orchestration system (Jeff), along with other AI engineering practitioners. The conversation provides a detailed exploration of how production LLM systems have evolved from simple chat completions to sophisticated harness-engineered agents, with particular focus on the practical and philosophical considerations that teams face when deploying these systems at scale.

The discussion centers on a fundamental architectural evolution in LLMOps: the transition from agents built purely through context engineering to agents that leverage purpose-built harnesses with reinforcement-learned models. The participants represent both framework builders and practitioners, offering insights into when teams should build custom infrastructure versus leveraging existing solutions like Claude Code or LangChain.

Core Architectural Concepts

The conversation establishes a foundational model for understanding agent architectures. The basic 2024-era agent consisted of a recursive loop where developers would send a context window containing tool calls, system messages, and user messages to an LLM. The LLM would output the next step (potentially a tool call), the agent would execute that against a system, append the response to the context, and repeat until reaching a final answer. This architecture required significant custom code to handle tool ingestion, context management, and execution orchestration.

The key insight discussed is that modern harness engineering represents more than just wrapping this basic loop—it involves building or using systems where the model itself has been specifically reinforcement-learned to work optimally with a particular set of tools and execution patterns. Claude Code serves as the primary example: it includes not just the LLM, but batteries-included features like automatic context management, built-in file system operations, Model Context Protocol (MCP) support for extensibility, and specialized editing tools that the model has been specifically trained to use effectively.

The Harness Engineering Revolution

A critical technical distinction emerges around what makes harness engineering fundamentally different from earlier agent engineering. The participants identify that the breakthrough came when teams began reinforcement learning models specifically for their harness toolsets. Claude Code, for instance, uses a find-and-replace editing tool with old-string/new-string syntax, while Codex uses git patch-style editing with plus/minus lines. The models underlying these systems have been RL-trained in specific environments that reward successful use of these particular tools.

This creates a profound implication: if you attempt to use Claude Code’s model in the Codex harness, performance degrades significantly because the model’s weights have been specifically optimized for its native tool signatures. The GPT-OS 12B model excels at applying patches but has no training on old-string/new-string operations. This specialization means that teams who control both the model and the harness have significant advantages—they can co-evolve the tool design and model behavior to achieve superior performance on their specific domain.

The discussion draws an analogy to assembly language programming in high-performance computing. Most developers should trust the compiler (or in this case, frontier labs’ harness teams of 40-50 engineers working on Claude Code or similar systems). However, specific situations warrant custom optimization: when you understand something unique about your data patterns, when you have domain-specific constraints, or when you can measure through evaluations that your custom approach provides material improvement. The hardest part is not building the custom harness—it’s identifying which parts of your system genuinely warrant that investment.

Nested Orchestration and Abstraction Layers

The conversation introduces a compelling mental model for thinking about intelligence levels in LLM systems: nested while loops. At the most basic level, you have a while loop that calls an LLM and executes tools. The next level of abstraction adds another while loop around this, providing an execution environment with features like context compression and MCP integration—this is the harness layer exemplified by Claude Code. The third level adds sub-agents, where the harness can spawn disposable agents for specific subtasks, each with their own while loop. The fourth level introduces orchestrators like Ralph or Gas Town that manage multiple harness instances.

Each additional layer of abstraction provides higher-level intelligence by enabling more autonomous work to happen within each invocation. However, the participants strongly emphasize that developers should exhaust the possibilities at each abstraction level before adding another layer. Adding orchestration layers is fundamentally about throwing more compute at a problem—sometimes necessary, but often a way to avoid the harder work of properly engineering the context, tools, and prompts at the current layer.

Jeff introduces the concept of the “outer harness”—orchestration that sits above even sophisticated systems like Claude Code. This might be as simple as a bash script that continuously runs the agent with specific prompts, or as complex as systems like Ralph/Wim Loop that handle memory allocation, instruction distribution, and coordination across multiple harness instances. The key insight is that the outer harness doesn’t need to resemble the inner harness at all; it’s a distinct engineering problem focused on coordination, permissioning, configuration management, and resource allocation.

The Bitter Lesson and Time Compression

A significant portion of the discussion addresses the “bitter lesson” from AI research—the observation that general methods leveraging computation tend to outperform hand-crafted solutions as resources scale. Multiple participants push back against overly rigid interpretations of this lesson in the context of LLMOps. Their core argument centers on time compression: while models inevitably improve and make custom code obsolete, teams that can effectively harness current models advance faster than model capabilities improve.

The analogy to performance engineering proves instructive. Hardware has dramatically improved over 10-15 years, yet performance engineers command higher salaries today than ever before, and finding skilled practitioners is harder. The reason is that as the baseline improves, the delta between good and excellent engineering becomes more valuable. Similarly, while Claude or GPT models improve every six months, teams that develop systematic approaches to context engineering, evaluation, and harness configuration can maintain a 5-10% advantage that compounds into significant business value.

The participants identify a crucial skill: recognizing which parts of your system warrant deep optimization versus which should use off-the-shelf solutions. Drawing on experience with compiler optimization, they note that even extreme experts rarely beat modern compilers, but occasionally someone understands a data pattern or cache locality issue that enables hand-written assembly to dramatically outperform. The same principle applies to harness engineering—most teams should use Claude Code or similar systems, but specific high-value use cases justify custom development when measurable through proper evaluation frameworks.

Evaluation-Driven Development

The conversation repeatedly returns to evaluation as the cornerstone of production LLM systems. The participants advocate for an evaluation-first approach where teams build metrics before writing optimization code. This enables objective assessment of whether custom context engineering or harness modifications actually improve performance versus just adding complexity.

Auto-research systems receive particular attention as an emerging pattern where evals encode desired behavior and systems automatically explore prompt and tool configurations to optimize performance. However, the participants caution against treating this as fully automated. They’ve observed cases where auto-research systems overfit to evaluation sets, essentially enumerating specific cases in system prompts rather than learning generalizable patterns. The solution is keeping humans in the loop—engineers should examine the actual data, review the prompts being generated, and ensure the system is learning principles rather than memorizing examples.

The discussion references development practices from companies like Facebook and Google, where every feature deployed to production has an associated metric and engineers are expected to actively monitor at deployment time. This creates accountability and rapid feedback loops. The equivalent in LLM systems is having clear metrics for agent behavior, deploying incrementally, and having engineers engaged with production traces to understand when things break and why.

Practical Engineering Principles

Several concrete engineering recommendations emerge throughout the discussion. First, every engineer working with LLMs should build a basic agent from scratch at least once—registering tools, crafting system prompts, and implementing the execution loop. This foundational understanding proves essential even when using higher-level abstractions, because you cannot effectively debug or optimize systems whose primitives you don’t understand.

Second, the participants advocate for designing systems to be “easy to delete.” In a rapidly evolving field where models improve every six months, code that seemed essential often becomes obsolete. Teams should be thoughtful about what capabilities they expose to users as product features versus internal implementation details, because user-facing features create dependencies that prevent deleting technical debt when models improve.

Third, context engineering and harness configuration should be approached with the same rigor as traditional systems engineering. This means understanding file system abstractions, memory management, message passing patterns, and how to compose tools effectively. The participants note that frameworks like Claude Code are essentially rebuilding concepts from Erlang (disposable heaps of memory, message passing between processes) but in the context of LLM agents. Understanding these classical computer science patterns helps engineers reason about modern agent systems.

Fourth, flexibility and continuous experimentation matter more than locking into specific frameworks. The field evolves too rapidly to commit fully to any particular approach. Engineers should continuously try techniques that feel inefficient or futuristic, because this is how you maintain up-to-date understanding of model capabilities. What seems impossible today may be trivial in three months.

Context Management and Skill Engineering

The participants discuss sophisticated approaches to context engineering beyond basic prompting. They emphasize that the harness layer handles many context management decisions that previously required custom code: what happens when you run out of context (compaction strategies, offloading to memory systems), how to structure tool schemas for optimal model performance, and how to surface capabilities through extension points like MCP.

One compelling pattern involves making existing systems conform to tool signatures that models are already excellent at using, rather than fine-tuning models for your custom tools. If Claude Code has been RL-trained on file system operations, you can map your own data systems to present file-like interfaces rather than teaching the model entirely new tool patterns. This leverages the massive investment frontier labs have made in optimizing their models for specific tool schemas.

The discussion also touches on skills and instruction sets that get bundled into the context. Rather than having the model rediscover approaches, teams encode proven patterns as skills (in some frameworks called Claude MDs or system prompts) that get loaded automatically. This provides consistency and reduces the exploration space the model needs to navigate during execution.

Domain Specialization vs. General Purpose

An interesting tension emerges around whether to build general-purpose agents or domain-specific systems. Viv argues strongly that the focus should be on building systems that do useful work for specific customer problems rather than pursuing generality. The measure of success is whether the system solves the target task well, not whether it performs adequately across unrelated domains.

This connects to the evaluation-driven approach: if you have high-quality evaluations derived from production traces that capture the actual work your customers need done, you can fit your harness configuration to pass those evaluations. As models improve, if your evaluations truly capture desired behavior, the system should automatically benefit. This requires accepting that your harness might be specialized and perform poorly on unrelated tasks—which is fine if it excels at what matters for your business.

Advice for New Engineers

The participants offer guidance for people entering the field. The fundamental primitives remain essential: understanding the tool calling loop, being able to draw sequence diagrams showing how inference works, and being able to design tools effectively. These skills define baseline competency in modern AI engineering.

Beyond that, they advocate for depth-driven learning—picking a specific area of AI engineering and going extremely deep on it, potentially writing detailed blog posts and sharing learnings publicly. In today’s environment with rapidly evolving best practices, someone who grinds on a narrow problem for 1-2 months can reach top-20% expertise in that specific area. This depth provides foundation for branching out and builds distribution for your ideas.

Traditional software engineering fundamentals remain relevant: understanding functional programming concepts, hexagonal architecture, property-based testing, library design, and modularity. These matter even more now because LLMs that generate code will replicate whatever patterns exist in your codebase, so thoughtful architecture prevents propagating bad patterns throughout your system.

Finally, they emphasize pair programming and learning from other builders. LLMOps involves significant tacit knowledge and intuition that transfers poorly through documentation but effectively through collaboration. Building in public, sharing approaches, and learning what others discover accelerates growth far beyond individual exploration.

Production Deployment Considerations

While less emphasized than architectural discussions, several production deployment considerations surface. The participants reference permissioning systems, sandboxing for safe execution environments, configuration management, and secret provisioning as harness responsibilities that sit alongside the core agent loop.

The conversation touches on the importance of looking at actual production data rather than relying purely on automated evaluation. Engineers should examine the prompts being generated, review the tool calls being made, and understand the failure modes directly. This human-in-the-loop validation catches issues that metrics miss and builds intuition about model behavior.

The discussion also acknowledges that code generation capability changes the economics of what’s worth optimizing. If you can rapidly generate new implementations to test against evaluations, the cost of writing code decreases substantially relative to the cost of designing the right architecture and metrics. This shifts engineering effort toward the “philosophy engineering” of defining what success means and how to measure it, rather than implementation details.

More Like This

Building Custom Agents at Scale: Notion's Multi-Year Journey to Production-Ready Agentic Workflows

Notion 2026

Notion, a knowledge work platform serving enterprise customers, spent multiple years (2022-2026) iterating through four to five complete rebuilds of their agent infrastructure before shipping Custom Agents to production. The core problem was enabling users to automate complex workflows across their workspaces while maintaining enterprise-grade reliability, security, and cost efficiency. Their solution involved building a sophisticated agent harness with progressive tool disclosure, SQL-like database abstractions, markdown-based interfaces optimized for LLM consumption, and a comprehensive evaluation framework. The result was a production system handling over 100 tools, serving majority-agent traffic for search, and enabling workflows like automated bug triaging, email processing, and meeting notes capture that fundamentally changed how their company and customers operate.

chatbot question_answering summarization +50

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57

Building a Software Factory with AI Agents at Scale

Cursor 2026

Cursor, a developer tool company, shares their journey of building what they call a "software factory" where AI agents handle increasingly autonomous software development tasks. The presentation outlines how they progressed through levels of autonomy from basic autocomplete to spawning hundreds of agents working asynchronously across their codebase. Their solution involves establishing guardrails through rules that emerge dynamically, creating verifiable systems with automated testing, and building skills and integrations that enable agents to work independently. Results include engineers managing fleets of agents rather than writing code directly, with some features being developed entirely by agents from feature flagging through testing to deployment, though significant work remains in observability, orchestration, and preventing agents from going off-track.

code_generation code_interpretation chatbot +36