Scaling AI-Driven Code Automation and Engineering Productivity at Spotify

Spotify / Anthropic 2026
View original source

Spotify faced a critical challenge where their codebase was growing seven times faster than their engineering headcount, creating an unsustainable maintenance burden that threatened their ability to ship new features. To address this, they developed a comprehensive AI-driven automation platform called Honk, built on Claude's agent SDK, which automates code migrations, modifications, and even enables non-engineers to build functional prototypes. The results have been dramatic: a 75% improvement in PR frequency, with 73% of PRs now being AI-authored, while maintaining quality metrics through robust test automation and verification loops. This transformation has fundamentally changed how Spotify's 2,900 engineers work, with deployments increasing to 4,500 per day and engineers shifting from manual implementation to higher-level problem-solving and rapid prototyping.

Industry

Media & Entertainment

Technologies

Overview

Spotify’s journey into production LLM usage represents one of the most comprehensive transformations in engineering productivity documented in the industry. The case study centers around a fundamental scaling challenge and how the company evolved from traditional automation approaches to sophisticated AI-driven systems that now handle the majority of their code changes.

The core problem Spotify faced was straightforward but existential: their codebase was growing at seven times the rate of their engineering headcount. With approximately 2,900 engineers managing what became over 20 million lines of code in their backend monorepo alone, plus thousands of smaller polyrepos, the maintenance burden was becoming untenable. The company had an “endless source of ideas” for user-facing features but was increasingly bogged down by routine maintenance work like framework migrations, library updates, and API changes.

Evolution from Traditional Automation to LLM-Based Systems

Spotify’s approach began roughly five to six years ago with what they called “fleet management” - infrastructure designed to automate code changes across their entire codebase. The traditional approach involved creating deterministic scripts that would automatically generate PRs for routine migrations. Before this automation, teams would receive migration tutorials and manually update their components, with each migration taking months to complete across hundreds of teams and thousands of components. The company could barely manage ten migrations per year.

However, this deterministic script-based approach quickly hit a ceiling. The API surface of code is enormous, and even seemingly simple changes like switching out a method call become extremely complex when accounting for the five or more different ways that method might be invoked. Each script would balloon to thousands of lines of code just to handle edge cases, and more sophisticated operations requiring variable tracking and state analysis became impractical.

The transition to LLM-based automation began in the early days of GPT models, before Claude existed. Initial attempts were naive - essentially putting code in front of the model and asking for one-shot changes. These early experiments didn’t work well, but they provided crucial insights about the direction of the technology. The company iterated extensively, experimenting with LLMs-as-judges to verify outputs, breaking down problems into smaller pieces, and developing increasingly sophisticated prompting strategies.

The Honk Platform Architecture

What eventually became Honk went through approximately eight major iterations, though only V2 was publicly released as such. The platform’s architecture is notably straightforward, which speaks to the maturity of the underlying technologies. At its core, Honk runs Claude’s agent SDK in Kubernetes pods, with access to a curated set of tools.

In the original version of Honk, tools were a predefined, allowlisted set that the company trusted enough to provide to the agent. However, in V2, this restriction was removed - users can now add their own tools, and the agent has access to any of Spotify’s internal tooling. This represents a significant trust expansion in the system’s capabilities.

The most critical tool in Honk’s arsenal is the ability to run verification and CI builds on both Linux and macOS environments. The macOS capability is particularly important for iOS development. The system goes beyond simple compilation - in some cases, it integrates with iOS simulators, enabling workflows where Claude can go directly from Figma designs to UI implementations. This capability has been used for porting TV apps from iOS implementations.

Interestingly, the judge or verifier component that was crucial in early versions of Honk has been removed. In early iterations, the LLM-as-judge pattern was transformative, taking success rates from 20-30% to approximately 80%. However, by the time Claude Opus 4 or 5 arrived around November-December of the previous year, the base models had become good enough that the additional verification step was no longer necessary. This represents an important lesson about LLMOps architecture: what’s essential at one stage of model capability may become redundant as models improve.

Infrastructure Investments and Engineering Practices

A critical theme throughout the case study is that successful LLM deployment in production required significant investments in traditional software engineering infrastructure. The company had to fundamentally strengthen their test automation practices. Previously, because every team was in the loop for changes to their owned components, test coverage could be somewhat lax - humans could always check PRs manually if needed. But once automated systems began generating and auto-merging PRs without human review, comprehensive test automation became non-negotiable.

This shift in engineering practices represents a broader insight: the trade-off between reliability and speed is often a false dichotomy. Going faster requires automating quality practices so they’re encoded in tests, scripts, and documentation rather than living in engineers’ heads. This investment in infrastructure is what enables true productivity gains, not simply working more hours.

The company’s emphasis on standardization and consistency proved equally important. Spotify had been driving toward more consistent codebases and alignment on frameworks and tools primarily to simplify things for human engineers. This investment paid unexpected dividends when agents entered the picture. When Claude looks at other code in the monorepo for inspiration - a capability that works remarkably well - having consistent patterns means the agent isn’t confused by ten different approaches to the same problem. The more standardized the codebase, the better the agents perform.

The Workflow Transformation

The personal workflow transformation described by Spotify’s CTO illustrates how dramatically LLM tools have changed day-to-day engineering work. The tipping point came specifically with Claude Opus 4 or 5 in late November or December. Before that moment, the workflow involved having models write perhaps 70-80% of the code, then switching to an IDE for final edits - essentially a smart autocomplete. After that inflection point, the “last mile” editing step largely disappeared.

The current workflow involves running multiple Claude sessions in tmux terminals - typically five to ten tabs with various panes, creating a matrix of Claude sessions matched with terminals across multiple Git worktrees. The bulk of work happens in Spotify’s monorepos, with temporary sessions spun up as needed for polyrepo work. At any given time, there might be five or more agents running in the background working on different aspects of a problem.

This shift from interactive coding to orchestrating multiple parallel agents represents a fundamental change in how engineering work gets done. Engineers increasingly spend their time on problem decomposition, talking to customers, and rapid prototyping rather than implementation details.

Production Scale and Impact Metrics

The quantitative results of Spotify’s LLM deployment are striking. The company now sees a 75%+ improvement in PR frequency directly attributable to AI tooling. Approximately 73% of PRs are now AI-authored. The company executes roughly 4,500 production deployments per day, with typical time from idea to production deployment now measured in hours rather than the weeks or months it took previously.

These improvements represent a different order of magnitude from traditional engineering productivity gains, which historically might achieve single-digit percentage improvements if you were lucky enough to measure them accurately. The scale of improvement is now so obvious that it doesn’t require sophisticated measurement to detect, though the company is still investing in more precise attribution as the technology matures.

The cost-benefit analysis has evolved as well. Early ROI discussions were straightforward because the improvements were so dramatic and visible. As the technology matures and becomes more expensive to operate at scale, there’s increasing pressure for precision in measuring both the productivity gains and the token/compute costs required to achieve them. Spotify is working on connecting the chain from individual PRs to work items to A/B tests and rollouts, ultimately attributing back to user value and revenue impact.

Democratizing Development Through Prototyping

One of the most interesting developments has been the emergence of rapid prototyping capabilities accessible to non-engineers. As people throughout the organization - not just engineers - began discovering they could express ideas in natural language and have Claude implement them, they naturally tried to work in Spotify’s real production apps. These are complex codebases, but people started seeing signs they could be productive even without traditional engineering backgrounds.

In response, Spotify built infrastructure to make this simple and safe, creating easy pathways for anyone to build end-to-end prototypes in mobile apps and backends. They established an internal app store for these prototypes where people can share and test each other’s ideas. The user base for this capability extends from engineers unfamiliar with mobile development all the way up to one of Spotify’s co-CEOs, who has prototypes in the store.

This democratization of development capability represents a fundamental shift in how ideas get validated. Previously, getting an idea built required motivating and coordinating a team of engineers. Now, someone with an idea can produce a working prototype within an hour or two, using real data and running in the actual app environment, allowing for immediate user feedback. Ideas that would have taken weeks or months to validate can now be tested in a day.

Critical Success Factors and Architectural Decisions

Several architectural decisions proved critical to Spotify’s success with LLMs in production. The shift to monorepos, while initially concerning from a performance standpoint, actually worked extremely well. There were worries about indexing performance and other issues based on experience with prior tools, but Claude handles the 20+ million line backend monorepo remarkably well. The ability for the model to find inspiration from other code in the repository - seeing similar patterns and approaches - became a significant advantage of the consolidated codebase structure.

The company’s investment in breaking down their codebase into thousands of well-defined components, each with clear ownership, provided crucial structure for both human and AI-driven changes. Each component has comprehensive ownership - the owning team designs, implements, and operates it. This clarity of ownership and boundaries makes automated changes more tractable.

The verification loop through CI/CD infrastructure proved to be perhaps the single most important capability. While many organizations talk about verification in the context of agentic coding systems, there’s a tendency to underinvest in making that verification loop robust and comprehensive. Spotify’s case demonstrates that in closed-loop development where agents work autonomously on tasks without humans in the loop, the quality of the verification system becomes paramount.

Evolving Nature of Engineering Work

The case study provides valuable perspective on how engineering work itself is changing. From a personal standpoint, Spotify’s CTO describes being initially worried about whether he would miss the “hard mental challenge of solving problems” - the aspect of coding he genuinely enjoyed, even doing competitive programming for fun in his spare time. The reality turned out quite different: what he actually enjoys is solving problems, and the specific method of solving them (writing code directly versus orchestrating agents) turned out to be less critical than anticipated.

The new mode of working enables solving problems that were previously inaccessible - jumping into unfamiliar codebases and contributing meaningfully in situations that would have taken days or weeks to ramp up on previously. The shift in how time is spent moves from implementation work to thinking about what’s next, customer conversations, and surprisingly, even more prototyping than before - both for external products and internal automations.

This represents important guidance for engineers navigating the transition: focus on the types of problems you’re able to solve rather than fixating on the specific methods of solving them. The fundamental satisfactions of engineering work - problem-solving, creating value, building things that work - remain intact even as the implementation details change dramatically.

Lessons and Broader Implications

The Spotify case study offers several important lessons for organizations deploying LLMs in production. First, fundamental engineering practices - test automation, standardization, clear ownership boundaries - remain as important as ever and in some ways become more critical when AI agents are operating in your codebase. These aren’t optional nice-to-haves but essential infrastructure for successful LLM deployment.

Second, the iterative nature of the journey matters. Spotify didn’t achieve their current state in one leap; it required many cycles of experimentation, failure, learning, and refinement over several years. The willingness to start experimenting early, even when results were poor, built institutional knowledge and positioned them to take advantage of improvements in model capabilities.

Third, the combination of improved models and improved system design compounds in powerful ways. The removal of the judge component from Honk wasn’t just about models getting better - it was models getting better in the context of an entire system that had been refined through multiple iterations. Organizations should expect to continuously re-evaluate their architectures as capabilities improve.

Finally, the case demonstrates that truly transformative productivity gains from LLMs require rethinking workflows and processes, not just adding AI tools to existing practices. Spotify’s success came from fundamentally restructuring how code changes flow through their organization, how prototypes get built, and how engineers spend their time - not from simply giving engineers better autocomplete.

The scale of impact - 75% improvement in PR frequency, 73% AI-authored PRs, 4,500 daily deployments - suggests that organizations successfully deploying LLMs in production can expect order-of-magnitude improvements rather than incremental gains, but achieving those results requires corresponding order-of-magnitude investments in supporting infrastructure and process changes.

More Like This

AI-Powered Engineering Management and Autonomous Development Workflows

Notion 2026

Ryan Nestrom, an Engineering Manager at Notion, demonstrates how AI has transformed engineering team management and software development workflows. The case study covers three primary use cases: automated meeting preparation using Notion AI custom agents that compile 24-hour activity updates from Slack, GitHub, Honeycomb metrics, and meeting transcripts to eliminate manual standup prep; background coding agents integrated via at-mentions that trigger virtual machines to autonomously generate pull requests from brief task descriptions; and spec-driven development where comprehensive markdown specifications serve as the source of truth, enabling coding agents like Aider to one-shot entire feature implementations. These approaches have eliminated meeting prep overhead, accelerated development velocity, and shifted engineering focus from implementation to architecture and verification, while maintaining high-quality output through automated testing and review processes.

code_generation summarization chatbot +26

AI-Driven Development at Scale: Building a Firecracker MicroVM Platform with Autonomous Agents

Atlassian 2026

Atlassian built Fireworks, a Firecracker microVM orchestration platform on Kubernetes, in just four weeks using their Rovo Dev AI agent system with minimal human-written code. The challenge was to create a secure execution engine for Atlassian's AI agent infrastructure with advanced features like 100ms warm starts, live migration, and eBPF network policy enforcement—a project that would have been considered too complex and time-consuming for a traditional development approach. By treating AI agents as full engineering team members with end-to-end access to development, deployment, testing, and CI/CD pipelines, and establishing robust validation through AI-written e2e tests and progressive rollouts, they successfully delivered a production-ready platform that demonstrates how agentic workflows can fundamentally transform software development velocity and scope.

code_generation code_interpretation poc +20

Building a Software Factory with AI Agents at Scale

Cursor 2026

Cursor, a developer tool company, shares their journey of building what they call a "software factory" where AI agents handle increasingly autonomous software development tasks. The presentation outlines how they progressed through levels of autonomy from basic autocomplete to spawning hundreds of agents working asynchronously across their codebase. Their solution involves establishing guardrails through rules that emerge dynamically, creating verifiable systems with automated testing, and building skills and integrations that enable agents to work independently. Results include engineers managing fleets of agents rather than writing code directly, with some features being developed entirely by agents from feature flagging through testing to deployment, though significant work remains in observability, orchestration, and preventing agents from going off-track.

code_generation code_interpretation chatbot +37