Anthropic: Building Claude Code: Scaling AI-Powered Development from Terminal Prototype to Production

Overview

Claude Code represents a significant case study in LLMOps, demonstrating how Anthropic built and scaled an AI-powered coding assistant from prototype to widespread production use. The product began in September 2024 as an accidental experiment by Boris Churnney, who simply wanted to understand how to use Anthropic’s API. What started as a basic terminal chat application evolved into a tool that fundamentally changed how engineers work, with some users like Churnney himself writing 100% of their code through Claude Code by the time Opus 4.5 was released.

The case study reveals several critical insights about building production LLM applications: the importance of building for future model capabilities rather than current ones, the value of rapid iteration driven by user feedback, and the necessity of minimizing technical scaffolding that becomes obsolete with each model improvement. At Anthropic internally, productivity per engineer grew 70% despite the team doubling in size, with overall productivity increasing 150% since Claude Code’s launch. External adoption reached remarkable scale, with Mercury reporting 70% of startups choosing Claude as their model of choice and Semi Analysis finding that 4% of all public commits globally were made by Claude Code.

Product Development Philosophy and LLMOps Approach

A core principle underlying Claude Code’s development was building for the model 6 months in the future rather than the model of today. Churnney emphasizes this repeatedly as critical advice for founders building on LLMs. When Claude Code first launched, it couldn’t actually write code effectively - in February 2024, it wrote perhaps only 10% of Churnney’s code. However, the team anticipated that scaling laws would continue to improve model capabilities, and their bet paid off. This forward-looking approach meant accepting that the product wouldn’t be fully functional at launch but would grow into its potential as models improved.

This philosophy directly influenced architectural decisions around scaffolding. The team continuously evaluated whether to build product features to compensate for model limitations or simply wait for the next model release. Churnney notes that scaffolding can typically improve performance 10-20% in a specific domain, but these gains are often wiped out with the next model release. Rather than investing heavily in elaborate workarounds, the team focused on building infrastructure that would remain relevant as models improved. This resulted in a codebase that was completely rewritten every few months, with no code from 6 months prior remaining in production at any given time. Tools were regularly deprecated and new ones added every couple weeks.

Technical Architecture and Tool Use

Claude Code began with the simplest possible architecture - a terminal application that accessed the Anthropic API. Churnney chose the terminal because it was the fastest way to prototype without building a UI, working alone initially. The first major capability came from implementing tool use, starting with a bash tool taken directly from Anthropic’s API documentation examples. The breakthrough moment came when Churnney asked the model what music he was listening to, and Claude wrote AppleScript to query his Mac’s music player. This demonstrated that the model inherently wanted to use tools to interact with the world.

The tool use architecture expanded organically based on observed user behavior. The terminal interface itself, built with React Terminal, went through extensive iteration - even something as simple as the loading spinner saw 50-100 iterations, with 80% never shipping. The verbosity of terminal output became a key design question, with the team constantly balancing user needs for visibility against information overload. Features like file reads and searches were eventually summarized rather than displayed in full, but this only became viable once model reliability improved sufficiently that users didn’t need to constantly verify every action.

A particularly sophisticated capability emerged with sub-agents - recursive instances of Claude Code that could be spawned by the main agent. The team refers to the main instance as “mama Claude” which prompts child agents. This became the primary way agents were launched in production, with users manually requesting parallel sub-agents for complex debugging tasks or the system automatically spawning them for tasks like searching code repositories in parallel. One user reported that Claude Insights recommended using more sub-agents for debugging to search logs and code paths simultaneously. The team built plugins entirely through a swarm topology where a main agent spawned multiple sub-agents that worked from an Asana board over a weekend with minimal human intervention.

Plan Mode and Iterative Prompting

Plan mode emerged as one of Claude Code’s most important features, discovered entirely through latent demand. Users were already asking Claude to create plans without writing code, either to think through ideas or create sophisticated specifications before implementation. Churnney observed this pattern in GitHub issues and internal Slack feedback on a Sunday night at 10pm, implemented the feature in 30 minutes, and shipped it Monday morning. Technically, plan mode simply adds one sentence to the prompt: “please don’t code.”

Despite its simplicity, plan mode became crucial for production use. Churnney reported using plan mode for 80% of his sessions, typically starting multiple plans in parallel across terminal tabs and the desktop app. The workflow involved getting a plan right through iteration, then having Claude execute it. With Opus 4.5 and especially 4.6, once a plan was approved, execution stayed on track almost every time without requiring the constant babysitting that earlier models needed. However, Churnney predicted plan mode might have a limited lifespan - potentially just a month - as models became capable enough to automatically determine when planning was necessary before coding.

The prompting philosophy emphasized minimalism and adaptability. Churnney’s personal Claude MD file contained just two lines: enable automerge on pull requests, and post PRs in the team’s internal Slack channel. Everything else lived in a shared repository-level Claude MD that the entire team contributed to multiple times weekly. When someone made a preventable mistake in a PR, the standard practice was to tag Claude asking it to add an instruction to Claude MD. This collaborative approach to prompt engineering distributed the work of maintaining effective prompts across the team while keeping them concise. Churnney recommended that users who hit token limits on their Claude MD should simply delete it and start fresh, adding back instructions incrementally only when the model went off track. Each new model required fewer instructions, reflecting improving capabilities.

User Feedback and Latent Demand

Latent demand served as perhaps the single most important product principle for Claude Code. Churnney emphasized repeatedly that users will only do things they’re already trying to do - you can make existing behaviors easier, but you cannot get people to adopt entirely new behaviors. This meant carefully observing how engineers actually used the tool rather than imposing a predetermined vision.

The observation methods were remarkably direct. Churnney would walk around Anthropic’s office floor and stand behind engineers watching them use Claude Code, giving feedback loops measured in hours rather than weeks. Internal adoption showed dramatic organic growth - when the team did a launch review in November/December 2024, Dario Amodei asked if engineers were being forced to use it because the internal usage chart was nearly vertical. No mandate existed; engineers simply told each other about it.

Specific features emerged directly from observed usage patterns. Quad MD arose when the team noticed users writing markdown files for themselves and having Claude read them. The verbose mode configuration came from shipping summarized file outputs, dog-fooding it internally for a month, then encountering strong negative feedback on GitHub from users who wanted full visibility. Rather than defending the design decision, the team iterated further to find the right balance. The terminal interface itself persisted much longer than anticipated - Churnney expected it to have a 3-month lifespan a year ago, yet it remained the primary interface because users kept finding it valuable.

The team also learned from edge cases and unexpected use patterns. Internal non-technical adoption was particularly revealing - by the time of the interview, roughly half the sales team used Claude Code, with many switching to Co-work (a Claude Code wrapper with a GUI and VM sandboxing) for easier access. The entire finance team coded, designers coded, product managers coded. This blurring of traditional role boundaries informed the development of Co-work, which was built in just 10 days as a 100% Claude Code-generated wrapper to serve non-technical users who were jumping through hoops to install terminal tools.

Production Operations and Monitoring

Claude Code’s production deployment demonstrated sophisticated operational practices. The team used Claude agents built with the Claude Agents SDK to automate nearly every aspect of development operations. These agents handled code review, security review, issue labeling, and shepherding code to production. One particularly sophisticated pattern involved Claude automatically messaging engineers on Slack based on git blame information when it needed clarification while building a feature, then continuing work after receiving responses.

Logging and observability became crucial for debugging. Users learned to provide detailed logs when reporting bugs, with Claude able to search through logs, examine code paths, create production tunnels, and even query production databases to diagnose issues. This transformed bug fixing from a tedious manual process into something users actually enjoyed. The workflow became: go to Sentry, copy the error markdown, paste it into Claude, and let it investigate. Users anticipated this would evolve toward fully automated bug fixing with direct MCP integration.

The team tracked productivity metrics rigorously. The simplest measure was pull requests, cross-checked against commits and commit lifetimes. These metrics showed 70% productivity growth per engineer even as the team doubled, and 150% total productivity improvement since Claude Code’s launch. For context, Churnney noted that in his previous role leading code quality at Meta, a 2% productivity improvement required a year of work by hundreds of people. The 100%+ improvements from Claude Code represented an entirely different magnitude of impact.

Deployment and release practices emphasized rapid iteration. The codebase was completely rewritten every few months, with perhaps 80% of code being less than a couple months old at any given time. Tools were added and removed every couple weeks. This extraordinarily short code lifetime reflected both the rapid model improvements requiring architectural changes and the team’s willingness to throw away scaffolding as it became obsolete. Features like the terminal spinner went through 50-100 iterations with 80% not shipping, but the speed of iteration using Claude Code itself made this exploration feasible.

Model Evolution and Capability Anticipation

The relationship between model improvements and product capabilities formed a central theme. Early versions using Claude 3.5 Sonnet could perform basic tasks like reading files with cat commands and writing AppleScript, but couldn’t effectively write production code. By February 2024, Claude wrote perhaps 10% of Churnney’s code. The breakthrough came with Opus 4.5 and especially 4.6, when the model became reliable enough for Churnney to uninstall his IDE and write 100% of his code through Claude Code, landing 20 PRs daily.

This model improvement trajectory validated the core bet of building ahead of current capabilities. The team knew vaguely they wanted to do something in coding but had no high-confidence specific vision. Early internal use focused on lower-risk activities like automating git commands and writing unit tests. Users developed workarounds like writing markdown instruction files that evolved into Claude MD. As reliability improved, the product could hide implementation details - file reads and searches could be summarized rather than displayed because the model rarely read the wrong thing.

The team also anticipated future capabilities would eliminate current features. Churnney predicted plan mode might disappear within a month as models became capable of automatically determining when to plan versus when to code directly. Claude Code already began entering plan mode automatically in some situations. The broader principle was that any scaffolding or feature designed to compensate for model limitations should be viewed as temporary, with a clear expectation that upcoming models would simply handle those cases natively.

This philosophy extended to how the team thought about competing products. When asked about IDEs like Cursor and Windsurf, Churnney noted the team was in explore mode without pressure to match specific form factors. They continuously experimented with different interfaces - Claude Code existed in terminal, web, desktop app, iOS, Android, Slack, GitHub, VS Code extensions, and JetBrains plugins. However, the terminal persisted as the primary interface despite expectations it would be replaced, suggesting that simpler interfaces might remain relevant even as capabilities grew.

Team Dynamics and Hiring

The team’s composition and hiring practices reflected the product philosophy. Anthropic’s Claude Code team valued extreme generalists - people who spanned product and infrastructure, product and design, product and user research, or product and business. Churnney described himself as an average engineer who didn’t use fancy tools like Vim, preferring VS Code for its simplicity. This meant Claude Code was designed for engineers like him rather than requiring deep expertise in terminal tools, Tmux, SSH, or other specialized knowledge.

The hiring approach looked for people who could think scientifically and from first principles rather than relying on accumulated opinions. A key behavioral interview question asked for examples of when candidates were wrong, evaluating whether they could recognize mistakes in hindsight, claim credit for them, and learn from them. Churnney acknowledged being wrong about half the time on his ideas, noting that trying things, giving them to users, learning from feedback, and iterating toward good ideas represented the essential skill.

The team also valued people who did “weird stuff” - work that might have seemed like a warning sign in traditional contexts but demonstrated creative thinking. One engineer, Daisy, transferred to the team after putting up a PR that first gave Claude Code a tool to test arbitrary tools, then used that to have Claude write its own tool rather than implementing it herself. This kind of meta-level thinking about enabling the model to help itself became highly valued. The team included both extreme specialists like Jared Sumar, who conducted a crusade eliminating memory leaks with deep systems knowledge, and generalists who moved fluidly across disciplines.

Internally, the team experimented with novel hiring evaluation methods. They discussed accepting Claude Code transcripts as part of the interview process, analyzing how candidates worked with the agent - whether they looked at logs, corrected the agent when it went off track, used plan mode appropriately, thought about systems, included tests, and demonstrated product sense. The idea was to create spider graphs visualizing different skill dimensions like systems thinking, testing, user behavior, design, and product sense based on transcript analysis.

Scaling Challenges and Tradeoffs

Several technical tradeoffs emerged as Claude Code scaled. The verbosity question remained contentious - displaying all bash output versus summarizing it. Early attempts to hide bash output led to user revolt because developers wanted to see Kubernetes job output and other detailed information. The team eventually shipped summarized file reads and searches but added a verbose mode in the configuration after negative GitHub feedback. Finding the right balance required continuous iteration based on user feedback rather than imposing a single vision.

Terminal design constraints posed unique challenges. Working within 80x100 characters, 256 colors, one font size, and no mouse interactions forced creative solutions. Mouse interactions were technically possible in terminals but required virtualizing scrolling, which felt bad in practice. The team had to discover UX principles for terminal applications that didn’t exist in modern documentation, as big terminal applications from the 1980s-2000s using ncurses looked janky by modern standards. This design exploration was only feasible because Claude Code itself enabled rapid prototyping - trying 20 variations in a couple hours rather than building 3 prototypes in two weeks with tools like Figma or Framer.

The team consciously limited some features to avoid premature optimization. Claude MD files were kept relatively short, perhaps a couple thousand tokens for the team’s shared file. Churnney recommended users delete their Claude MD entirely and start fresh when hitting token limits, adding back instructions incrementally only when the model went off track. This minimalist approach meant doing the minimal possible thing to keep the model on track rather than over-engineering comprehensive instruction sets that would become obsolete with the next model release.

Security and safety considerations varied by user type. The core Claude Code terminal application assumed technical users who understood what they were executing. Co-work, targeting non-technical users, added substantial protections including running all code in a virtual machine, deletion protections, permission prompting, and other guardrails. This segmentation allowed the terminal product to remain lean for expert users while still enabling broader adoption through a more protected alternative.

Business Impact and Adoption Metrics

The production impact extended far beyond Anthropic’s internal use. External metrics showed 70% of startups choosing Claude as their model of choice according to Mercury, and 4% of all public commits globally being made by Claude Code according to Semi Analysis. Notable users included NASA using it to plot the course for the Perseverance Mars rover, which the team found so exciting they printed commemorative posters.

Use cases spanned far beyond software development. Users deployed Claude Code for monitoring tomato plants, recovering wedding photos from corrupted hard drives, and various finance applications. This diversity of use validated the latent demand observation - people were finding ways to use the tool that the creators never anticipated. The finance team at Anthropic, entire data science team, and all designers used it extensively, often for non-coding tasks.

The productivity improvements represented transformational change in software development. Steve Yegge’s post claimed an Anthropic engineer averaged 1000x more productivity than a Google engineer at Google’s peak, though this comparison requires careful interpretation. The more grounded internal metrics of 70% per-engineer productivity growth with team doubling, and 150% total productivity improvement, still represented unprecedented gains compared to traditional developer productivity initiatives that might achieve 2% improvements after a year of work by hundreds of people.

Looking forward, Churnney predicted that coding would be “generally solved for everyone” within the year, with the title “software engineer” potentially disappearing in favor of “builder” or “product manager.” Engineers would spend less time coding and more time writing specs, talking to users, and working across functions. Every function at Anthropic already coded - PMs, designers, engineering managers, finance staff - and this pattern would spread broadly. The more extreme possibility involved hitting ASL4 (Anthropic Safety Level 4) where models became recursively self-improving, requiring stringent safety criteria before release.

Philosophical Foundations and Cultural Context

The development philosophy drew explicitly from Rich Sutton’s “The Bitter Lesson,” which Anthropic framed and hung on the wall in the Claude Code work area. The core insight is that more general models will always beat more specific models, and thus one should never bet against the model. This principle manifested in the constant tradeoff between building scaffolding now for 10-20% performance improvements versus waiting for the next model to handle the capability natively. The team consistently chose to wait when possible, leading to the regular rewriting of the entire codebase.

Churnney’s background provided relevant context. He had previously worked at Meta leading code quality and improving developer productivity, giving him deep expertise in measuring and improving engineering effectiveness at scale. Before Anthropic, he wrote a book about TypeScript in the early 2010s when the language was emerging. The TypeScript parallel proved instructive - TypeScript succeeded by adapting to how JavaScript programmers actually wrote code rather than forcing them to adopt rigid typing patterns. The type system incorporated novel concepts like literal types and conditional types that even academic languages like Haskell didn’t have, purely to handle real-world JavaScript patterns. Claude Code followed similar principles, building the tool that users wanted rather than imposing theoretical ideals.

The decision to join Anthropic came from mission alignment around AI safety. Living in rural Japan and reading Hacker News every morning, Churnney became convinced that AI capabilities were reaching a critical point. Meeting Ben Mann and the Anthropic team, he was struck by how the organization operated primarily as a research lab where the model mattered more than the product, and how deeply everyone internalized AI safety concerns. Overhearing lunchroom conversations about AI safety rather than product metrics convinced him this was the right place for this phase of AI development. As a science fiction reader, Churnney was acutely aware of how badly AI could go wrong, making the safety-focused culture essential.

The anthropic labs team structure revealed intentional design around the progression toward AGI. The team produced three products - Claude Code, MCP (Model Context Protocol), and the desktop app - that wove together around a unified vision of teaching models to code, use tools, and interact with computers. This progression reflected Anthropic’s core bet that the path to safe AGI runs through coding capabilities, making Claude Code both a product and a research platform for understanding and advancing AI capabilities in a controlled environment.

Lessons for LLMOps Practitioners

Several key lessons emerge for teams building production LLM applications. First, build for the model 6 months in the future rather than today’s model, even if that means the product won’t fully work at launch. This requires strong conviction in scaling laws and willingness to make early-stage users tolerate limitations. Second, minimize scaffolding and be prepared to completely rewrite code every few months as models improve. Any gains from elaborate workarounds will likely be wiped out by the next model release, making simpler architectures that can evolve more sustainable.

Third, latent demand should drive product decisions more than vision or roadmaps. Watch how users actually use the product, including ways that seem wrong or unexpected. The terminal interface persisted because users kept finding it valuable, Claude MD emerged from users writing their own instruction files, and plan mode came from users explicitly asking Claude not to code yet. Features that solve problems users are already working around will succeed; features that try to change user behavior will struggle.

Fourth, rapid iteration enabled by AI tools themselves creates a new product development paradigm. The ability to try 50-100 variations of something like a loading spinner, shipping only the best 20%, only became feasible because Claude Code could generate these variations quickly. This enables a level of polish and exploration previously impossible within reasonable time and budget constraints. Teams should use LLMs to build LLM products, accelerating iteration cycles dramatically.

Fifth, productivity measurement and operational metrics remain essential even with AI-powered development. The team rigorously tracked pull requests, commits, and commit lifetimes to validate productivity improvements and understand patterns. Logging and observability became more rather than less important, as users needed to understand what Claude was doing to provide effective feedback and correction. The ability to share transcripts of sessions for debugging and learning purposes added a new dimension to operational visibility.

Finally, the team composition and hiring criteria should emphasize adaptability, scientific thinking, and willingness to be wrong over accumulated expertise and strong opinions. In a rapidly evolving field, the ability to recognize mistakes, learn from them, and change direction matters more than having worked with specific tools or technologies. Generalists who can work across functions and think creatively about enabling models rather than replacing them represent the most valuable team members in this new paradigm of human-AI collaborative development.

Building Claude Code: Scaling AI-Powered Development from Terminal Prototype to Production

Industry

Technologies