Langchain's approach to production AI agents focuses on "harness engineering" - the practice of wrapping LLMs with context engineering, prompting, tools, verification systems, and orchestration logic to solve specific tasks. The team has developed open-source infrastructure including Deep Agents and comprehensive evaluation frameworks to help developers build task-specific agents that improve over time through continual learning loops. By treating agents as "model plus harness," they've achieved significant improvements on benchmarks like SWE-bench (moving from top 30 to top 5 on Terminal Bench 2.0 through harness optimization alone) while emphasizing that production success requires custom harnesses tailored to specific customer use cases rather than relying solely on frontier model capabilities.
Langchain has developed a comprehensive approach to building production-ready AI agents that centers on the concept of “harness engineering.” This case study, presented through a technical discussion with a lead engineer on Langchain’s open-source agents team, provides deep insights into how the company thinks about LLM productionization, agent architecture, and continuous improvement systems for AI applications at scale.
The core philosophy articulated is that an agent equals a model plus a harness, where the harness encompasses every piece of code, configuration, and execution logic that isn’t the model itself. This framing helps teams systematically think about what needs to be optimized when building production agents, since the model is largely a black box computation unit that processes context and generates tokens, while everything surrounding it can be engineered and improved.
Langchain’s harness engineering approach rests on several key pillars that have emerged from extensive production experience and research. The first and most critical element is prompting and instruction design. Despite proclamations that “prompting is dead,” the team has found that prompting has become more important than ever with more capable models. This extends beyond system prompts to include tool descriptions, skill front matter that explains when and how to use various capabilities, and sub-agent specifications that guide orchestration decisions.
The second pillar is verification systems built into the harness itself. Drawing from their work on coding agents and Terminal Bench evaluations, they’ve found that agents are highly susceptible to taking shortcuts in self-verification, often testing only trivial cases rather than thorough validation. The solution involves teaching agents better verification primitives and injecting external verification signals through hooks that run evaluation suites before allowing agents to exit tasks. This creates feedback loops that force agents to confront failures and iterate toward correct solutions.
The third major pillar is orchestration and context engineering for problem decomposition. The team has found that effective long-horizon task completion requires breaking problems into manageable subtasks that can be handled within context windows of roughly fifty thousand to two hundred thousand tokens. Beyond this range, models enter what they call the “dumb zone” where performance degrades significantly. This necessitates intelligent problem decomposition with sub-agents serving as specialized compute units for different aspects of complex tasks.
One of the more surprising findings from Langchain’s production work is the critical importance of file systems as a harness primitive. Models have proven exceptionally good at using file system interfaces, likely due to extensive post-training on such patterns. File systems serve multiple crucial functions in production agents: they provide persistent storage that agents can access and write to outside the computational boundary of the context window, they offer excellent context management by allowing agents to offload information rather than polluting the context window, and they enable multi-agent collaboration by serving as shared workspaces where sub-agents can write outputs that main agents can selectively read.
The team has generalized this concept into “virtual file systems” that expose file system operations while using various backend storage systems like S3 or Postgres. This abstraction allows agents to use familiar file system patterns while the underlying storage can be optimized for specific use cases. The emphasis on file systems reflects a broader principle: agents need mechanisms to manage information flow across the computational boundary of their context window, and file systems provide an interface that models already understand well.
Langchain’s production experience has led to nuanced views on model selection that challenge simplistic assumptions about always using the most powerful models. Their work on Terminal Bench revealed that running the most advanced reasoning models throughout entire tasks can actually perform worse than strategic model selection. They found that using extended reasoning modes throughout can lead to timeouts because the models think too long, and the approach is neither compute-efficient nor cost-efficient.
The solution is what they call a “reasoning sandwich” - using more powerful reasoning models for planning and verification phases while employing faster, cheaper models for execution. For example, they’ve found GPT models particularly strong at planning tasks, while Gemini Flash offers excellent performance for multimodal work at much better speed and cost characteristics. This selective model deployment based on task phases represents a more sophisticated approach than simply throwing the most powerful model at every problem.
A major focus of Langchain’s current work involves building infrastructure for continual learning - enabling agents to improve over time based on the data they generate during operation. This addresses a fundamental challenge: agents running in production generate massive amounts of trace data, potentially millions of tokens every few minutes, and this data needs to be systematically mined for insights that can improve future performance.
The continual learning approach operates on two parallel tracks. The first involves context engineering to selectively inject relevant learnings from past traces into agent context windows at appropriate times. This requires sophisticated search capabilities to sift through vast amounts of trace data and distill useful patterns into “memory nuggets” that can be retrieved when similar situations arise. The team emphasizes that not all trace data is valuable - much of it is noise that must be filtered out to extract actionable insights.
The second track involves fine-tuning open models using feedback distilled from traces. Langchain has leaned heavily into open models like DeepSeek’s GLM series, which can be fine-tuned with task-specific data to create specialized agents that outperform general-purpose frontier models for particular use cases while offering better cost and latency characteristics. The process involves using trace data to generate training examples, often applying reinforcement learning techniques to imbue models with vertical-specific skills.
The team is careful to note that both approaches - context engineering and fine-tuning - will coexist rather than one replacing the other. Some capabilities like fact-based retrieval are better handled through search and context injection, while others benefit from being embedded in model weights through fine-tuning. The key is building infrastructure that makes both paths easy to execute based on the nature of the capability being optimized.
Langchain takes a pragmatic stance on evaluation that balances the use of public benchmarks with the critical importance of custom evaluations. While they actively track performance on benchmarks like Terminal Bench and SWE-bench, they emphasize that these serve primarily as rough proxies for general capabilities like problem-solving or long-horizon planning. The team tags all evaluations to capability axes like retrieval, planning, or tool use to understand what different benchmarks actually measure.
However, the real value comes from building custom evaluation suites that map directly to customer use cases. Public benchmarks rarely align precisely with what production agents need to accomplish, so teams building serious agent systems must invest in creating representative evaluation sets that reflect their actual tasks. These custom benchmarks serve multiple purposes: they ground auto-research and harness optimization loops, they provide verifiable specifications that define agent behavior, and they enable measuring improvement over time in ways that matter for actual customers.
The evaluation infrastructure at Langchain centers on Langsmith, their tracing and evaluation platform. All agent traces flow into Langsmith where they can be analyzed for errors, segmented by performance characteristics, and used to generate new evaluations. This creates a flywheel where production usage generates data that informs evaluation creation, which drives harness improvements, which improve production performance.
Beyond high-level architectural principles, Langchain has developed specific technical primitives that make harness engineering more practical. One of the most underrated features they’ve developed is middleware or hooks - the ability to inject deterministic code execution at specific points in the agent’s execution flow. This allows developers to control or augment model behavior in systematic ways.
Hooks prove particularly valuable for triggering verification steps, managing context flow, and preventing bad model behavior before it causes problems. For example, a hook might automatically trigger a self-verification routine when an agent attempts to complete a task, or it might implement context management strategies like taking only the head and tail of large tool outputs while storing the full output in a file system for later access if needed.
Tool call offloading represents another critical harness primitive. When agents execute shell commands or other tools that generate massive outputs, naively piping all that content into the context window can quickly lead to context pollution and degraded performance. Instead, harnesses can deterministically process tool outputs to extract key information, store full outputs externally, and inform the model about where to find additional details if needed. This protects the precious computational boundary of the context window from noise.
As tasks become more complex and long-horizon, Langchain has invested heavily in multi-agent orchestration patterns. The fundamental insight is that long-horizon tasks are essentially sequences of shorter subtasks, and the key challenge is decomposing problems appropriately and verifying intermediate steps. Multi-agent systems allow this decomposition by assigning specialized sub-agents to handle bounded subtasks while main agents coordinate overall progress.
The file system primitive becomes especially important in multi-agent contexts, serving as a shared collaboration space where sub-agents can write outputs that main agents selectively consume. This prevents the main agent’s context window from being polluted with all the intermediate work while still allowing access to results when needed. The team has found that effective decomposition requires keeping subtasks within manageable context window sizes and selecting appropriate models for different types of work.
Skills represent another key abstraction for multi-agent coordination. Skills encode workflows with clear descriptions of when and how to use them, and the shareable nature of skills has unexpectedly led to better documentation than traditional tool descriptions. Developers seem more motivated to write clear skill descriptions when they know others will use them, resulting in better agent performance. The challenge with skills remains triggering the right skill at the right time, which again comes down to instruction following and is measured through skills-specific evaluations.
Langchain’s approach to product development reflects a strong open-source philosophy where learnings from open-source work inform commercial product development and vice versa. The team spends significant time on open-source projects like Deep Agents, Langchain, and Langgraph, publishing research and engaging with the community to understand what builders need. This generates valuable feedback that shapes both open-source libraries and commercial products like Langsmith.
The emphasis on developer experience and direct feedback channels is central to their growth strategy. Rather than focusing narrowly on revenue metrics, they prioritize making it easy for developers to build impressive applications using their tools. This manifests in comprehensive documentation, open research publications sharing evaluation results and methodologies, and active engagement on platforms like Twitter where developers share their experiences and challenges.
This philosophy extends to their stance on model openness. The team is highly bullish on open models, actively using and recommending models like GLM-5 and Minimax for production workloads. They see open models as increasingly viable alternatives to frontier closed models, especially when combined with fine-tuning for specific tasks. The ability to customize model weights through fine-tuning represents a significant advantage for teams with good data in vertical domains.
A recurring theme throughout Langchain’s production work is the challenge of context management and avoiding what they call “context rot” - the degradation of model performance as context windows fill up. Research has shown that models perform progressively worse as context utilization increases, making it critical to protect the context window from unnecessary information.
Langchain employs multiple strategies to combat context rot. Context compaction involves summarizing or extracting key information from longer texts before injecting into context. Tool call offloading stores large tool outputs externally while providing summaries to the model. Progressive disclosure reveals information incrementally as needed rather than front-loading everything. Sub-agent decomposition distributes cognitive load across multiple context windows rather than overloading a single agent.
The team also emphasizes deterministic context management strategies that don’t rely on model intelligence. For example, automatically taking only the first and last portions of bash command outputs, storing full outputs in files, and informing the model about the file location. These simple rules protect against common failure modes while giving models escape hatches to access full information when truly needed.
Langchain’s infrastructure supports multiple deployment patterns reflecting different production needs. Some customers run background coding agents that work for hours on complex refactoring tasks, processing codebases and generating changes over extended periods. Others deploy customer service agents that need fast responses and integration with existing data systems. Still others build research agents that combine web search, document analysis, and synthesis over long time horizons.
The unifying pattern across successful deployments is the investment in custom harnesses tailored to specific tasks. Teams that succeed don’t simply deploy general-purpose agents and hope for good results. Instead, they carefully engineer prompts, select appropriate models for different phases of work, implement verification systems relevant to their domain, and build evaluation suites that measure performance on tasks that actually matter to their users.
The team has also observed that the most successful implementations involve tight integration between harness engineering and data flywheel construction. Production usage generates traces, traces are monitored for errors and edge cases, errors inform evaluation creation, evaluations drive harness improvements, and improved harnesses lead to better production performance. This cycle of continuous improvement, enabled by proper instrumentation and evaluation infrastructure, separates successful production deployments from experimental proofs of concept.
Looking forward, Langchain sees several key trends shaping the future of production AI agents. The first is continued growth in vertical specialization, with more companies building task-specific agents fine-tuned on domain data rather than relying solely on general-purpose models. This reflects the reality that today’s models, while impressive, still require significant customization to excel at particular tasks.
The second trend is the maturation of continual learning infrastructure. The team expects significant advances in systems that automatically mine trace data, generate evaluations, and improve agents with minimal human intervention. While current systems require human oversight, the vision is increasingly autonomous improvement loops that maintain human alignment while reducing manual effort.
The third trend is the convergence of harness optimization and model post-training. Projects exploring meta-harnesses that automatically optimize harness configurations, combined with systems that generate fine-tuning data from production traces, point toward a future where the boundaries between harness engineering and model training become increasingly blurred. Both serve the same goal of fitting systems to specific tasks, and tools that seamlessly integrate both approaches will likely emerge.
Finally, the team anticipates continued growth in simulation and environment-based evaluation. As agents tackle more complex, long-horizon tasks, evaluations will increasingly involve full simulated environments rather than simple input-output pairs. This shift requires better infrastructure for building, running, and analyzing results from environment-based evaluations, an area where Langchain is actively investing.
The Langchain experience offers several critical lessons for teams building production LLM systems. First, model capabilities alone are insufficient - successful production systems require thoughtful engineering of everything surrounding the model. Second, custom evaluations mapped to actual use cases matter far more than performance on generic benchmarks. Third, file systems and similar primitives for managing information flow across context boundaries are foundational to effective agent design. Fourth, strategic model selection based on task phases often outperforms using the most powerful model throughout. Fifth, building infrastructure for continual learning and improvement is essential for long-term success.
Perhaps most importantly, the team emphasizes starting simple with basic harness engineering before investing in more complex approaches like fine-tuning. Many problems can be solved with good prompting, appropriate tools, and basic verification systems. Only when these approaches prove insufficient should teams invest in heavier-weight solutions like model fine-tuning or elaborate multi-agent orchestration. This pragmatic, iterative approach to building production systems reflects hard-won lessons from extensive real-world deployment experience.
Notion, a knowledge work platform serving enterprise customers, spent multiple years (2022-2026) iterating through four to five complete rebuilds of their agent infrastructure before shipping Custom Agents to production. The core problem was enabling users to automate complex workflows across their workspaces while maintaining enterprise-grade reliability, security, and cost efficiency. Their solution involved building a sophisticated agent harness with progressive tool disclosure, SQL-like database abstractions, markdown-based interfaces optimized for LLM consumption, and a comprehensive evaluation framework. The result was a production system handling over 100 tools, serving majority-agent traffic for search, and enabling workflows like automated bug triaging, email processing, and meeting notes capture that fundamentally changed how their company and customers operate.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This case study captures insights from Lance Martin, ML engineer at Langchain, discussing the evolution from traditional ML to LLM-based systems and the emerging engineering discipline of building production GenAI applications. The discussion covers key challenges including the shift from model training to model orchestration, the need to continuously rearchitect systems as foundation models rapidly improve, and the critical importance of context engineering to manage token usage and prevent context degradation. Solutions explored include workflow versus agent architectures, the three-part context engineering playbook (reduce, offload, isolate), and evaluation strategies that emphasize user feedback and tracing over static benchmarks. Results demonstrate that teams like Manis have rearchitected their systems five times since March 2025, and that simpler approaches with proper observability often outperform complex architectures, with the understanding that today's solutions must be rebuilt as models improve.