Anthropic: Building Production AI Agents: Lessons from Claude Code and Enterprise Deployments

## Overview

This case study presents Anthropic's comprehensive learnings from building and operating AI agents in production environments throughout 2024 and into 2025. The speaker, Cal, leads Anthropic's Applied AI team, which helps enterprise customers build products on top of Claude models. The presentation draws from two critical sources of operational experience: building Claude Code (Anthropic's terminal-based coding assistant) and working directly with enterprise customers across various industries including finance, healthcare, legal, and technology sectors.

The case study tracks the evolution of LLM applications from simple single-turn prompts and RAG-based Q&A chatbots in early 2024 to sophisticated multi-agent systems capable of running for hours on complex tasks. This evolution was driven by rapid model improvements, with Claude 3 Opus marking a turning point as a frontier model, followed by Claude 3.5 Sonnet which demonstrated strong coding capabilities, and culminating in Claude Opus 4.5 which scores 80% on the SWE-bench coding benchmark.

## Architectural Evolution: From Workflows to Agents

A central theme of this case study is the architectural shift from workflows to agents. In early 2024, when models were less capable, most production systems used workflow architectures consisting of multiple chained LLM calls interspersed with deterministic logic. Each LLM call handled a specific subtask with its own prompt. The speaker describes working with a customer who had built a system with 50 different prompts chained together for customer support use cases.

Workflows presented two fundamental limitations for production systems. First, they could only handle scenarios explicitly coded into the workflow structure, making them unsuitable for open-ended tasks. Second, they lacked robust error recovery mechanisms—if something unexpected occurred mid-workflow, the system would typically complete execution but produce poor final outputs.

Anthropic defines an agent as an architecture where an LLM runs in a loop with access to tools and an open-ended problem, determining when the task is complete. This agentic architecture solves both workflow limitations: the model can handle unforeseen edge cases without explicit coding, and Claude demonstrates strong capabilities in error recovery, recognizing unexpected tool results and adapting its approach accordingly.

## Context Engineering: Beyond Prompt Engineering

The speaker emphasizes that while prompt engineering (optimizing individual prompts) was critical in 2024, the field has evolved toward "context engineering" for agentic systems. Context engineering encompasses everything that goes into the model's context window across multiple API calls in an agent loop, including system prompts, user messages, tool definitions, tool responses, and memory management.

The fundamental challenge is the context window limit—Claude supports up to 200,000 tokens, but the API enforces this limit strictly. Beyond the hard limit, the team observed "context rot" where model performance degrades at 50,000-150,000 tokens depending on the task, even before hitting the maximum. This makes context engineering critical for production reliability and accuracy.

### System Prompt Design

For system prompts, the team advocates for finding a "Goldilocks zone" between being too specific and too vague. The speaker describes working with a customer who tried dumping a 32-page SOP PDF directly into the system prompt, which overwhelmed the model. Conversely, being too vague leaves the model without sufficient guidance. The recommended approach is the "best friend test"—if you gave these instructions to a friend unfamiliar with your domain and they couldn't understand what to do, the model likely won't either.

The team recommends an iterative approach that errs toward being too vague initially rather than too specific. This allows teams to test the agent, identify what breaks, and progressively add necessary instructions. Starting with overly specific instructions makes it difficult to determine which rules are actually useful versus noise that degrades performance.

### Tool Design and Progressive Disclosure

Tool design emerged as perhaps the most critical aspect of context engineering for production agents. The team emphasizes several key principles:

**Tool naming and descriptions**: Since tool descriptions are inserted into the system prompt behind the scenes, they should be treated as prompting exercises. The speaker describes a production issue where Claude.ai had both web search and Google Drive search tools, and the model would confuse them—searching the web for things obviously in Google Drive and vice versa. This was resolved by adding clear descriptions about what data lives where and when each tool should be used.

**Examples in tool descriptions**: Teams can include few-shot examples directly in tool descriptions to guide proper usage. Claude Code and Claude.ai extensively use this pattern, showing examples of correct parameter usage and appropriate invocation contexts.

**Progressive disclosure**: Rather than loading all potentially useful information upfront (the old RAG pattern of retrieving documents before the first prompt), production agents should discover information as needed through tool calls. Claude Code exemplifies this—it doesn't load all files in a directory into context at startup. Instead, it tells the model which directory it's in and provides file listing tools, letting Claude read files only when necessary.

The exception to progressive disclosure is information that's always useful. Claude Code always loads the `.claude.md` file (user-specific instructions) upfront rather than making the agent call a tool to read it, because this information is universally relevant regardless of the task.

**Skills architecture**: Anthropic developed a "skills" system implementing progressive disclosure for large instruction sets. Rather than including instructions for building artifacts, creating PowerPoints, and conducting deep research all in one massive system prompt, Claude.ai tells the model that if users ask about specific capabilities, relevant instructions and templates are available in specific locations the agent can access on-demand.

## Long-Horizon Task Management

Production agents often need to work on tasks longer than their context window allows. The team has experimented with several approaches:

**Compaction**: When approaching the 200,000 token limit, the system sends a special user message asking Claude to summarize all progress. The conversation is then cleared, the summary is inserted, and work continues. The speaker notes this is extremely difficult to get right—Claude Code has iterated on compaction prompts approximately 100 times and users still find the experience of getting compacted frustrating.

**Memory systems**: An alternative approach gives agents access to file systems where they can maintain their own memory. Claude Plays Pokemon (a Twitch stream where Claude plays Pokemon Red indefinitely) uses this pattern. Claude is prompted to update markdown files with its plan and learnings as it plays. When the conversation is cleared, Claude simply reads its own notes rather than receiving a summary from a compaction process. Anthropic is working to train this capability directly into models so it happens automatically.

**Sub-agent architectures**: Claude Code includes a sub-agent capability, though not for the originally intended purpose. The team initially thought Claude would delegate work to sub-agents that would work concurrently and report back, but Claude proved poor at breaking tasks into concurrent atomic units. However, sub-agents proved valuable for exploration tasks. When Claude Code needs to understand a codebase, reading many files consumes enormous context. By delegating this research to a sub-agent that can "blow up its context window" and return just a final report, the main agent preserves its context capacity for actual implementation work.

## Performance Optimization and Cost Management

Context engineering directly impacts three production concerns beyond just getting correct results:

**Reliability**: Proper context management prevents API errors when hitting token limits and reduces context rot that degrades accuracy. Production systems must handle these limits gracefully to avoid crashes.

**Cost efficiency**: The speaker highlights an important finding from Claude Opus 4.5 evaluations—the more expensive model achieved higher SWE-bench scores in considerably fewer tokens than Claude Sonnet 4.5. For production deployments, teams must evaluate cost at the task level rather than just comparing per-token pricing. A model with higher list prices might actually cost less per completed task due to efficiency gains.

**Prompt caching**: Production agents make many sequential API calls. If the system prompt, tool definitions, and conversation history remain static between calls (only appending new content), prompt caching can dramatically reduce costs and latency. Context engineering practices must avoid inadvertently busting the cache, such as by unnecessarily swapping tools in and out of the available set or reordering static content.

## The Claude Agent SDK: Productionizing Agentic Infrastructure

After releasing Claude Code, Anthropic received immediate feedback from customers and internal teams wanting programmatic access to the underlying agentic infrastructure without the terminal UI. This led to the Claude Agent SDK, which packages the agent loop, system prompts, tools, permission system, and memory management that power Claude Code into reusable primitives.

The SDK represents Anthropic's approach to LLMOps at scale—building battle-tested infrastructure that handles complex operational concerns (context management, error recovery, security, memory) while allowing teams to focus on domain-specific problems and user experience. Anthropic now uses this SDK internally for all new agentic products, ensuring it receives continuous production testing and improvement.

A key architectural insight is that giving agents access to computers (file systems and code execution environments) generalizes beyond software engineering. Claude creates PowerPoints and spreadsheets by writing code that uses Python and JavaScript libraries, not by calling specialized "create slide" tools. Similarly, financial analysis agents benefit from CSV reading capabilities, marketing agents from visualization tools, and research agents from web search—all framed as interactions with a computer environment rather than highly specialized verticalized tools.

## Model Evolution and Production Impact

The case study tracks how model improvements directly translated to production capabilities. Claude 2.1 had a 200,000 token context window (when competitors topped out at 32,000-64,000 tokens) but wasn't considered a frontier model. Claude 3 Opus became the first Anthropic model widely regarded as best-in-class, causing the Applied AI team's customer engagement to surge dramatically.

Claude 3.5 Sonnet marked a coding capability inflection point. The team noticed it excelled at writing HTML with embedded JavaScript and CSS, leading to the Artifacts feature in Claude.ai. However, early versions had significant limitations—Artifacts would rewrite entire HTML files from scratch rather than editing in place, demonstrating the model still couldn't handle complex state management.

The speaker describes a personal Friday evening experiment with an internal tool called Claude CLI, where they built a note-taking app without touching any code themselves, accomplishing what would have taken days of manual work. This experience convinced them to join the Claude Code team as the AI engineer responsible for system prompts, tool design, and context engineering.

Claude Opus 4.5, released eight days before the presentation, represents the current frontier. Beyond achieving 80% on SWE-bench (up from 49% just one year earlier with Sonnet 3.5 v2), Opus 4.5 demonstrates improved resistance to prompt injection attacks—a critical security concern for production agents that process untrusted user input. The team also emphasizes ongoing work on long-running agents (extending from hours to days or weeks), better computer use via browser and GUI interaction, and domain specialization in cybersecurity and financial services.

## Enterprise Deployment Considerations

Working with enterprise customers revealed several production patterns and anti-patterns. The most common failure mode Cal encounters is poorly written prompts—90% of the time when a system doesn't work as expected, the instructions simply don't make sense when read by someone unfamiliar with the domain. This remains the number one prompt tip for 2025-2026.

The transition from RAG-based Q&A chatbots to agents required significant architectural rethinking. Early 2024 was dominated by systems that retrieved help center articles upfront and stuffed them into prompts. Agent architectures instead provide search tools and let the model discover relevant information progressively as needed.

Enterprise customers working with Anthropic benefit from Claude's lower hallucination rates compared to competing models and Claude's willingness to say "I don't know" rather than fabricating answers—critical for domains like legal, healthcare, and financial services where accuracy is paramount. Anthropic's partnerships, particularly with AWS Bedrock, address enterprise deployment and compliance requirements.

The speaker emphasizes that successful production agents require the right level of abstraction from frameworks and SDKs. Many teams get into trouble using libraries they don't understand sufficiently. Production frameworks must provide control and flexibility—allowing teams to swap prompts, customize tools, and implement multi-agent architectures when needed—rather than imposing overly opinionated scaffolding.

## Future Directions and Predictions

Looking ahead to 2025-2026, the speaker predicts that if 2025 was "the year of agents," 2026 will be "the year of giving agents access to computers." Most business problems can be mapped to computational problems solvable by agents with file system access, code execution capabilities, and browser/GUI interaction tools. This generalizes agent utility far beyond software engineering to any domain where professionals use computers—legal document analysis, financial modeling, scientific research, and beyond.

The field is moving from viewing LLMs as collaborators (the 2025 paradigm) toward LLMs as pioneers capable of working on problems humans haven't solved or don't have time for—potentially making progress on fundamental questions in biology, mathematics, and physics. This vision, articulated in Anthropic CEO Dario Amodei's essay "Machines of Love and Grace," drives the company's research direction and safety focus.

For practitioners, the takeaway is that production LLMOps is maturing from prompt engineering for single calls toward systems engineering for agentic architectures. Success requires understanding context management, tool design, progressive disclosure, error recovery, and cost optimization at the system level rather than optimizing individual prompts in isolation. The infrastructure is becoming sufficiently robust that teams can focus on domain-specific problems rather than rebuilding fundamental agentic capabilities from scratch.

Building Production AI Agents: Lessons from Claude Code and Enterprise Deployments

Industry

Technologies

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Building AI-Native Platforms: Agentic Systems, Infrastructure Evolution, and Production LLM Deployment