Spotify faced the challenge of migrating approximately 1,800 direct downstream data pipelines across multiple frameworks to accommodate deprecated user datasets—work that would have required an estimated 10 engineering weeks manually. The company deployed their internal background coding agent "Honk" (built on Claude) in conjunction with their Backstage developer platform and Fleet Management tools to automate the migration process. The solution successfully generated 240 automated migration pull requests, particularly for standardized frameworks like BigQuery Runner and dbt, though it encountered challenges with less standardized frameworks like Scio and revealed the importance of comprehensive context engineering and automated testing infrastructure for successful agent-driven migrations.
This case study examines Spotify’s deployment of “Honk,” their internal background coding agent built on Claude (from Anthropic), to automate large-scale dataset migrations across their data infrastructure. As part 4 in Spotify’s series on background coding agents, this piece focuses specifically on how LLMs in production were used to handle the deprecation of two heavily-used user datasets affecting approximately 1,800 direct downstream data pipelines and several thousand more indirectly. The migration needed to be completed within six months and would have consumed an estimated 10 engineering weeks of manual effort.
The deployment represents a practical application of LLMs in production for software maintenance at enterprise scale, integrating agent-based code generation with existing developer tooling (Backstage, Fleet Management) to tackle repetitive engineering work. However, the case study is also notably transparent about limitations encountered and lessons learned, particularly around the importance of standardization and testing infrastructure for successful agent deployment.
Honk operates as a background coding agent that integrates with Spotify’s existing developer platform infrastructure. The system works in conjunction with Backstage (Spotify’s open-source developer portal) and their Fleet Management tools, particularly the Fleetshift plugin designed for orchestrating migrations across multiple repositories.
The workflow begins with Backstage’s endpoint lineage and Codesearch plugins, which provided visibility into dataset dependencies across Spotify’s GitHub Enterprise landscape. The endpoint lineage feature displayed all downstream consumers of deprecated datasets, giving immediate insight into migration scope. Codesearch enabled query-based identification of target repositories that would need code changes. The Fleetshift plugin then orchestrated the actual migration process across the identified repositories, providing a centralized UI for monitoring progress, viewing automated pull requests, and facilitating communication with repository owners.
At the time of these migrations, Honk had intentional design constraints: it did not have access to Claude skills (like Model Context Protocol/MCP tools) or custom configurability during execution. This was a deliberate guardrailing decision to establish boundaries around possible outcomes during migrations. The agent could not, for instance, dynamically fetch dataset schemas or read external documentation beyond what was provided in its context. This placed significant emphasis on upfront context engineering.
The case study identifies context engineering as “the part of the build that took the most time and iteration to get right, and also where we learnt the most.” This represents a key LLMOps insight: the quality and comprehensiveness of context provided to LLM agents directly determines their effectiveness in production scenarios.
Spotify faced the challenge of handling three different data pipeline frameworks with varying degrees of standardization:
The initial approach attempted to use Claude to repurpose human-written migration guides into agent context. This proved insufficient—Honk made incorrect assumptions about field mappings when context was not comprehensive enough. The team learned that without access to external skills or tools, the prompt and context had to be exhaustively detailed.
For the more standardized frameworks (BigQuery Runner and dbt), success came from creating fine-grained instructions using tables to explicitly map field transformations, removing ambiguity about how to migrate from one dataset to another. The context also included explicit instructions about when not to attempt automated migration—for cases requiring use-case-specific judgment, Honk was instructed to leave fields unchanged but add comments with links to human migration guides, effectively creating a hybrid automation approach.
For Scio pipelines, the lack of standardization combined with Honk’s inability to access external context made comprehensive prompts “very unwieldy.” Rather than force a suboptimal solution, the team made the pragmatic decision to exclude Scio migrations from the automated approach at that time. This represents sound LLMOps judgment: recognizing when agent limitations make human intervention more appropriate.
A significant LLMOps challenge emerged around automated verification. Honk includes built-in capability to verify its work and adjust based on results—a critical feature for production LLM systems to ensure quality and correctness. However, this feature depends on the existence of build-time unit tests in target repositories.
Unlike Scio pipelines, the BigQuery Runner and dbt repositories across Spotify rarely implemented build-time unit testing. This meant Honk’s self-verification capability was unavailable, forcing reliance on downstream teams to perform manual testing before merging automated pull requests. This limitation highlights an important LLMOps principle: the effectiveness of autonomous coding agents depends heavily on existing engineering infrastructure and practices.
The case study identifies this as a lesson for the future—Spotify recognizes the need to “enforce requirements for testing and validation across repositories so that agents like Honk can verify their work in an automated fashion.” This represents mature thinking about LLMOps deployment: successful agent adoption requires organizational investment in complementary infrastructure, not just the agent itself.
Despite the limitations, the deployment successfully generated 240 automated migration pull requests using Fleetshift. While the text doesn’t provide detailed merge rates or quality metrics (which would strengthen claims about effectiveness), the system demonstrably reduced manual effort from an estimated 10 engineering weeks.
The Backstage and Fleetshift integration proved particularly valuable for operational management. The Fleetshift plugin provided a centralized overview UI showing migration progress across all repositories, with easy navigation to individual automated PRs. The case study emphasizes this was “invaluable for troubleshooting, progress monitoring, and facilitating communication with the owning teams”—highlighting that LLMOps success requires not just generation capability but also tooling for monitoring, management, and human oversight.
The case study articulates clear strategic insights about what’s needed for background coding agents to succeed at scale:
Standardization requirements: The stark difference in success between standardized frameworks (BigQuery Runner, dbt) and variable frameworks (Scio) demonstrates that LLM agent effectiveness in production depends significantly on underlying codebase consistency. Spotify identifies “the strategic push to consolidate and standardise our data landscape” as critical for future agent success.
Testing infrastructure: The inability to use Honk’s self-verification features due to lack of unit tests reveals that autonomous agent deployment requires investment in automated testing infrastructure across the organization.
Agent capability evolution: The Honk roadmap includes features for agents to gather their own context (reading JIRA tickets, documentation) before making code changes, reducing the burden of comprehensive upfront context engineering. This represents evolution toward more autonomous LLM systems that can navigate information environments more independently.
The case study mentions underlying improvements in “Claude Code capabilities” over time, acknowledging that LLMOps effectiveness will improve as foundation models themselves improve—a factor somewhat outside Spotify’s direct control but important to account for in long-term planning.
While this case study comes from Spotify and includes promotional elements (mentioning available Fleetshift product and webinars), it demonstrates notable technical honesty about limitations:
The estimated “10 engineering weeks” saved should be viewed as approximate—the text doesn’t provide methodology for this calculation or account for time spent on context engineering, prompt iteration, PR review overhead, or fixing failed migrations. The 240 automated PRs represent a subset of the 1,800 total pipelines needing migration, suggesting significant manual work likely remained.
The case study represents a maturing LLMOps practice that recognizes coding agents as tools requiring careful deployment within appropriate contexts, rather than universal solutions. The emphasis on organizational changes (standardization, testing requirements) alongside agent improvements reflects sophisticated understanding that successful LLMOps requires sociotechnical system design, not just deploying powerful models.
This case study exemplifies several important LLMOps patterns for production deployment:
Context engineering as a discipline: The significant time investment in creating comprehensive context files, the iteration required to get them right, and the explicit use of structured formats (tables) to reduce ambiguity all point to context engineering as a specialized skill for LLMOps practitioners.
Guardrails and constraints: The deliberate choice to limit Honk’s capabilities (no external tools/skills) during migration represents a risk management approach—constraining agent behavior to predictable patterns even if it reduces flexibility.
Hybrid automation: Rather than attempting full automation everywhere, the approach of having Honk add helpful comments and links for human engineers in complex cases represents pragmatic human-agent collaboration.
Infrastructure dependencies: The case clearly demonstrates that LLM agent success depends on surrounding infrastructure—lineage tracking, code search, testing frameworks, monitoring UIs, and standardized codebases all contribute to effectiveness.
Knowing when not to automate: The decision to exclude Scio migrations represents mature judgment about agent limitations and appropriate use cases.
The integration with Backstage and Fleet Management also demonstrates the importance of connecting LLM capabilities with existing developer workflows rather than creating isolated agent systems. The ability to monitor progress, view PRs, and communicate with teams through familiar tooling likely contributed significantly to adoption and trust.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Ramp, a finance automation platform serving over 50,000 customers, built a comprehensive suite of AI agents to automate manual financial workflows including expense policy enforcement, accounting classification, and invoice processing. The company evolved from building hundreds of isolated agents to consolidating around a single agent framework with thousands of skills, unified through a conversational interface called Omnichat. Their Policy Agent product, which uses LLMs to interpret and enforce expense policies written in natural language, demonstrates significant production deployment challenges and solutions including iterative development starting with simple use cases, extensive evaluation frameworks, human-in-the-loop labeling sessions, and careful context engineering. Additionally, Ramp built an internal coding agent called Ramp Inspect that now accounts for over 50% of production PRs merged weekly, illustrating how AI infrastructure investments enable broader organizational productivity gains.
Cara, a healthcare software platform company, used Claude Code (Opus 4.6) to autonomously execute 66 software tickets across 2 repositories, write 536 tests, and deliver a composable 5-layer architecture for their healthcare app platform in under 4 hours. The problem was a flat list of 25 scaffolds with no composition model, making it impossible to automatically assemble applications from component parts. The solution involved implementing a structured execution framework called RePPITS (Research, Propose, Plan, Implement, Test, Secure) with persistent memory, parallel subagents, phase gates, and comprehensive security audits. This required approximately 20-25 hours of preparation including codebase structuring, instruction file refinement, and epic planning. The autonomous execution produced approximately 20,000 lines of code organized into 53 scaffolds across 5 architectural layers (Foundation, Runtime, Capability, Adapter, Specialty), with 2 critical bugs and 10 other issues caught and fixed through automated security audits, resulting in zero deferred issues and only one minor production incident that was resolved in under 5 minutes.