Anthropic: Building and Scaling an AI Coding Agent Through Rapid Iteration and User Feedback

Company

Anthropic

Title

Building and Scaling an AI Coding Agent Through Rapid Iteration and User Feedback

Industry

Tech

Link

https://www.youtube.com/watch?v=jmHBMtpR36M

Year

2025

Summary (short)

Anthropic developed Claude Code, an AI-powered coding agent that started as an internal prototyping tool and evolved into a widely-adopted product through organic growth and rapid iteration. The team faced challenges in making an LLM-based coding assistant that could handle complex, multi-step software engineering tasks while remaining accessible and customizable across diverse developer environments. Their solution involved a minimalist terminal-first interface, extensive customization capabilities through hooks and sub-agents, rigorous internal dogfooding with over 1,000 Anthropic employees, and tight feedback loops that enabled weekly iteration cycles. The product achieved high viral adoption internally before external launch, expanded beyond professional developers to designers and product managers who now contribute code directly, and established a fast-shipping culture where features often go from prototype to production within weeks based on real user feedback rather than extensive upfront planning.

Tags

continuous_deployment

devops

cache

anthropic

## Overview This case study examines Anthropic's development and operational approach for Claude Code, an AI-powered coding agent that represents a sophisticated implementation of LLMs in a production developer tool environment. The interview with Cat, the product lead for Claude Code, reveals an organizationally mature yet pragmatically agile approach to building and operating LLM-based products at scale. Claude Code originated as an internal experimental project by Boris, who built it to better understand Anthropic's APIs and explore how much of software engineering could be augmented with AI. The tool gained organic traction within Anthropic, spreading virally from Boris's immediate team to the broader organization, eventually crossing into research teams and even tech-adjacent roles like data science, product management, and product design. This internal viral adoption—reaching approximately 1,000 internal employees who voluntarily opted into the feedback channel—provided the foundation for external launch and continues to serve as a primary testing ground for new features. ## Product Philosophy and Development Process The team's development philosophy centers on radical simplicity combined with deep extensibility. The terminal-based form factor was chosen deliberately because it provides immediate access to nearly everything a developer can do—any CLI tool is immediately available to Claude Code. This minimalist interface constrains the product team to be "brutal" about feature prioritization due to limited screen real estate (ASCII characters only, no buttons), resulting in a lean, focused user experience that avoids overwhelming new users while supporting arbitrary depth for power users. The core product principle is "no onboarding UX" for new features—features should be intuitive via their name and a one-line description, allowing users to get started immediately. This philosophy extends to the overall design goal: making the CLI very simple to onboard to while being extremely extensible to handle the complexity and heterogeneity of different developer environments. This extensibility manifests through hooks, custom slash commands, and sub-agents that allow individual engineers and developer productivity teams to customize Claude Code for their specific setups. ## Operational Model and Iteration Speed The operational model reflects a highly engineer-driven culture with strong product engineering talent maintaining end-to-end ownership of features. Rather than following traditional product management processes with extensive PRDs and specifications, the team operates through rapid prototyping cycles: engineers prototype ideas, ship them to internal dogfooding (the ~1,000-person internal Anthropic community), and gather immediate feedback on whether people understand the feature, find it buggy, or love it. Features that resonate strongly with internal users are fast-tracked for public release, often after two or three internal iterations. Many experimental features are tested internally but never shipped externally. The PM role in this environment is notably different from traditional product management. Cat describes her function as setting broader directional principles (like how much customizability to expose), shepherding features through the process, and handling pricing and packaging—essentially ensuring that developers can focus on creating the best coding experience while the PM ensures accessibility. For larger projects like IDE integrations with VS Code and IntelliJ, formal product review processes exist, but smaller features like the to-do list or plan mode bypass PRDs entirely because the challenge is finding the right form factor and prompt design, not solving integration problems. The team maintains an extremely lean documentation approach—they use very few Google Docs, with most documentation and rationale captured directly in GitHub pull requests. Given Claude Code's small codebase, asking Claude Code itself to search the repository and explain why something was built or how it works is often faster and more accurate than maintaining potentially outdated documentation. This represents a fundamental shift in knowledge management enabled by LLM capabilities: the code becomes the living documentation. ## Feedback Collection and Prioritization Feedback collection operates at massive scale and high velocity. The internal Anthropic Slack channel receives new feedback approximately every 10 minutes from the 1,000+ opted-in employees who are highly vocal about their experiences. Cat emphasizes they "love negative feedback" and explicitly discourage platitudes, wanting to hear what doesn't work. Beyond internal feedback, the team works closely with about 10 early enterprise adopters who are encouraged to be "as loud as possible" about issues, with the team committing to fast service level agreements—typically turning around prioritized fixes within one to two weeks. Prioritization emerges organically from feedback volume and convergence rather than through elaborate ranking systems. When something is truly high priority, it becomes "very very obvious"—a GitHub issue will receive 100 thumbs up, the sales team will escalate it from multiple customers, and the same request will appear repeatedly across channels. Claude Code itself serves as a tool for managing this feedback: the product team uses it to synthesize Slack feedback (asking "which other customers have asked for this?"), cross-reference GitHub issues, and automate maintenance tasks like deduplicating issues and updating documentation (with human review of the final 10%). ## Evaluation Strategy and Challenges The team's evaluation approach is notably pragmatic and acknowledges significant challenges. They focus on two primary types of evaluations, both of which they describe as imperfect: **End-to-end evaluations** involve running benchmarks like SWE-bench on new Claude Code harnesses to ensure performance isn't degrading when making large harness changes or testing new models. However, these evals lack granularity—when SWE-bench scores change, determining the root cause requires manually reading through "pretty gnarly transcripts" to identify themes and failure patterns. The team wants to improve this but acknowledges they don't have a silver bullet solution. **Triggering evaluations** focus on situations where Claude Code must decide whether to use a particular tool (like web search). These are more straightforward to implement because triggering conditions can be clearly articulated and outcomes are binary and easily graded. For example, the team tunes when web search should trigger—it shouldn't happen 100% of the time for any question, but should activate when users ask about the latest React release and its new functionality. The harder evaluation challenge involves **capability assessments**—determining whether a harness is better at specialized work like data science tasks. This requires complex setups with large underlying datasets, the ability to write and iterate on queries, gold standard answers, and unambiguous success criteria—much more difficult than simple triggering evals. Cat acknowledges a fundamental challenge in LLMOps: having "a really good grasp of what the models are capable of" is the hardest and rarest skill for AI PMs. When a model fails at a task, practitioners must diagnose whether the issue stems from wrong context, using an inappropriate model for the task, or the model fundamentally lacking the capability. If it's a capability gap, the further question is whether the model is 80% there (allowing prompt engineering to bridge the gap) or only 10% there (requiring waiting three to six months for model improvements). Despite these evaluation challenges, the team relies heavily on their engaged user community to "keep them honest"—real-world usage provides continuous signals about what's working and what isn't, complementing formal evaluation approaches. ## Notable Features and Technical Implementation Several features illustrate the team's approach to LLMOps in practice: **The To-Do List Feature**: This feature emerged from engineer Sid's observation that users employed Claude Code for large refactoring tasks involving 20-30 changes, but the agent would sometimes complete only the first five changes before stopping prematurely. Sid's solution was forcing the model to write down its tasks (mimicking human behavior) and reminding it that it couldn't stop until all tasks were completed. This simple intervention dramatically improved task completion rates. The feature evolved through multiple iterations—initially tasks would fly past the screen in the transcript, then the team made to-do lists persistent and accessible via a `/todo` command so users could check progress at any point without scrolling. The team experimented with various form factors, including embedding to-do lists in "thinking bubbles," but users preferred the persistent approach. **Plan Mode**: Users repeatedly requested a way to have Claude Code explain what it would do without immediately writing code. The team initially resisted adding an explicit plan mode, hoping to teach users to express this desire in natural language (which the model could already handle). However, after one to two months of persistent user requests for an explicit shortcut, they "caved" and added plan mode. Cat notes that current models are "too eager to code"—even in plan mode, the agent will often suggest starting to code immediately after presenting a plan. The team speculates they might remove explicit plan mode in the future when models better follow user directions about planning versus execution. **Web Search Integration**: Claude Code supports web search with careful attention to triggering behavior—the system shouldn't search the web 100% of the time but should activate for appropriate queries like asking about the latest React releases. This required careful tuning through triggering evaluations and represents the kind of tool-use decision-making that must be carefully calibrated in agentic systems. **The Claude.md File**: This represents Claude Code's approach to context management and agent memory. The Claude.md file functions as the equivalent of memory—it's included in every Claude Code instance and serves as edge onboarding, containing everything you'd tell someone new to the codebase. Users can initialize it with `/init` to scan the codebase automatically, or craft personalized versions. Designer Megan, for example, has a Claude.md that explicitly states "I'm a product designer, you need to overexplain everything to me," establishing a personal interaction style. The system supports multiple layers: repo-specific Claude.md files, individual personal Claude.md files for specific projects, and global personal Claude.md files that apply across all repos regardless of context. **Hooks and Customization**: Hooks were originally inspired by users wanting Slack notifications when Claude Code was waiting for their response (since the agent might work for 10 minutes autonomously while users attend to other tasks). This evolved into a general customization system allowing users to configure behaviors like text message notifications or custom integrations. The team views this extensibility as essential for handling heterogeneous developer environments. ## Cross-Functional Impact and Democratization One of the most striking outcomes is how Claude Code has democratized code contribution within Anthropic. Cat herself, as PM, checks in code for smaller features where it's faster than assigning to engineers—she added a "vibe" command referencing Rick Rubin's writing about vibe coding. Designer Megan, who historically never committed code, now regularly makes pull requests to Console (Anthropic's API management product) and to Claude Code itself. This represents a fundamental shift in how cross-functional teams can contribute: designers and PMs directly impact end-user experience without writing Google Docs that engineers must interpret and implement. Claude Code also facilitates product management work in unexpected ways. Cat uses it to audit complex branching logic across different subscription plans (Max plan, API plans, Claude for Enterprise, prosumer tiers) by asking it to trace code paths and explain exactly what happens in each scenario. This provides high confidence in flow accuracy without manually tracing through code. ## Strategic Direction and Ecosystem Vision Looking forward, the team focuses on three strategic pillars over the "next few months" (Cat explicitly states a year or two is "a really long time" to plan in the current AI landscape): **Maintaining CLI Leadership**: Ensuring the CLI remains the most powerful coding agent while becoming incredibly customizable for any developer environment, integrating with all existing tools, and creating an ecosystem around customizations where developers can share, review, and one-click install each other's custom slash commands, hooks, and status lines. **Growing the SDK**: The team wants Claude Code's SDK to become the best way to build not just coding agents but agents of all types—legal assistants, executive assistants, health assistants, financial assistants. They're already seeing companies building non-coding agents on the SDK and plan to showcase these success stories as products come to market. **Expanding Accessibility**: Moving beyond the core professional developer audience to serve tech-adjacent roles (data science, product management, design) and eventually expanding to marketing and sales teams. The terminal interface remains a barrier for non-technical users, but the core primitives are general enough that the team is exploring form factors that could serve these broader audiences. The team's planning philosophy is notably pragmatic: they try to build products they wish they had today, and since model capabilities change so rapidly, they find it "really hard to predict more than 6 months in the future." They prioritize building for the next generation of models, but those capabilities don't become obvious until a few months ahead of time, making longer-term planning difficult. ## LLMOps Best Practices and Lessons Several LLMOps best practices emerge from this case study: **Demos over docs**: Claude Code makes prototyping so easy that the team asks "is there a way for Claude Code to prototype that feature?" before writing multi-page documentation. This rapid prototyping approach surfaces real usage patterns and user reactions much faster than specification-driven development. **Treating the agent like an eager junior engineer**: Cat advises users to give Claude Code feedback the same way they'd guide a human new grad engineer. When Claude Code makes wrong assumptions, users shouldn't give up but should provide corrective feedback—the agent is "really receptive" and will normally change direction and incorporate input. **Investing in context management**: The Claude.md file represents a systematic approach to providing persistent context. Just as you'd onboard a new hire with architecture details, gotchas, testing practices, and team conventions, Claude.md captures this institutional knowledge for the agent. **Dogfooding at scale**: The 1,000+ internal users providing feedback every 10 minutes creates an unparalleled testing environment. This internal community surfaces issues, validates features, and provides signal about what resonates before public release. **Fast iteration over perfect planning**: Features often ship within weeks based on prototype feedback rather than following extensive planning cycles. The small codebase and engineer-driven culture enable this velocity. **Leveraging the LLM for LLMOps**: Using Claude Code itself to synthesize feedback, update documentation, deduplicate issues, and explain implementation decisions represents a meta-application of the technology—the product helps operate itself. ## Critical Assessment While the case study presents an impressive operational model, several aspects warrant balanced consideration: The evaluation approach, while pragmatic, is acknowledged by the team as having significant gaps. The lack of granular diagnostics for end-to-end performance changes and the difficulty in assessing capability improvements represent real operational challenges. The heavy reliance on user feedback as a de facto evaluation mechanism works well with an engaged community but may not scale or generalize to all LLM applications. The resistance to formal product planning and documentation, while enabling speed, carries risks. Cat mentions they "don't have many Google Docs" and rely on PR descriptions and Claude Code itself to explain rationale. This works with a small codebase and stable team but could create knowledge transfer challenges as the team scales or during personnel transitions. The "no onboarding UX" philosophy, while elegant, may limit accessibility for non-technical audiences. The team acknowledges the terminal interface remains a significant barrier and is exploring alternative form factors, but this represents an inherent tension between power-user optimization and broader accessibility. The planning horizon of "a few months" reflects both pragmatism about AI's rapid evolution and the team's startup-like agility, but may present challenges for enterprise customers requiring longer-term roadmap visibility for adoption planning and integration work. Overall, this case study illustrates a highly mature LLMOps practice that balances rapid iteration with systematic feedback collection, embraces the unique characteristics of LLM capabilities and limitations, and demonstrates how AI-native product development can fundamentally differ from traditional software development processes. The team's willingness to acknowledge evaluation challenges, resist premature feature addition, and let organic usage patterns guide development represents sophisticated thinking about operating LLMs in production environments.

Start deploying reproducible AI workflows today