Spotify: Context Engineering for Production AI Assistants at Scale

Company

Spotify

Title

Context Engineering for Production AI Assistants at Scale

Industry

E-commerce

Link

https://www.youtube.com/watch?v=ejtcgMmmmaM

Year

2025

Summary (short)

Shopify developed Sidekick, an AI assistant serving millions of merchants on their commerce platform. The challenge was managing context windows effectively while maintaining performance, latency, and cost efficiency for an agentic system operating at massive scale. Their solution involved sophisticated "context engineering" techniques including aggressive token management (removing processed tool messages, trimming old conversation turns), a three-tier memory system (explicit user preferences, implicit user profiles, and episodic memory via RAG), and just-in-time instruction injection that collocates instructions with tool outputs. These techniques reportedly improved instruction adherence by 5-10% while reducing jailbreak likelihood and maintaining acceptable latency despite the system managing over 20 tools and handling complex multi-step agentic workflows.

## Overview Shopify, the leading global commerce platform serving millions of merchants, has deployed an AI assistant called Sidekick (referred to as "Psychic" in the presentation transcript) that operates at production scale across their merchant admin interface. The speaker, a Shopify engineer working directly on this system, presents a detailed case study focused on what they term "context engineering" - defined as "the art of feeding LLMs" - drawing from real-world experience building AI assistants that work at the scale of millions of users. The assistant is embedded directly in the Shopify admin interface where merchants manage their businesses, appearing as both a chatbot interface and supporting multimodal interactions including voice calls with screen sharing. The system provides highly personalized suggestions based on merchant context and can execute complex multi-step workflows like creating discounts for underperforming products, generating analytics queries, navigating the admin interface, and filling forms automatically. ## Production Architecture and Scale The system implements a standard agentic loop architecture where each user request triggers Sidekick to load appropriate prompts and tools based on user configuration, decide whether to interact with internal Shopify tools or answer directly, make tool calls that return tool messages, and continue the loop until task completion. Currently, the system manages approximately 20 or more tools, with this number continuously growing, which the presenter acknowledges is becoming a bottleneck. Key tools in the system include: - **Analytics tools** that help merchants understand their business operating picture, from basic queries like sales in the last 30 days to complex funnel analysis and customer segmentation. Notably, these analytics tools use a specialized LLM internally that generates a bespoke query language specific to Shopify, which is then executed against actual databases before results are sent back to the main LLM. - **Navigation tools** that allow automatic movement between different pages in the admin interface. - **Form filling tools** that fetch the schema of forms and populate them automatically using RAG over help center documentation. - **Help tools** that provide support by retrieving relevant documentation. An important design decision is that the agent does not perform mutations directly - all actions require user supervision and explicit saving, which is a critical safety feature for a production system operating at this scale. ## The Context Window Challenge The presenter directly addresses what they call "the elephant in the room" - the industry narrative that context windows are effectively solved with models supporting increasingly large contexts (citing Gemini's 1 million token support). However, they provide a sobering reality check based on production experience: **Practical limitations observed at scale:** - Instruction following capabilities degrade significantly as more tokens are added to the context window, despite what benchmarks might suggest. This is related to the "needle in the haystack" problem. - More tools lead to more noisy or conflicting instructions, creating reliability issues. - Real applications often don't fit in even 1 million token context windows, making it not a scalable solution. - Cost increases linearly with token count, creating economic constraints. - Latency increases with context size, which is particularly important for user experience in conversational AI systems. The speaker provides revealing statistics about token distribution in their system: for each token in a user message, they observe 10x more tokens in the system message and 100x more tokens in tool messages. This means tool messages consume the vast majority of context window space, making them a critical optimization target. ## Context Engineering Tactics: Token Management The team's first approach to managing context involves aggressive but intelligent token reduction. The most naive but effective tactic is removing tool messages once they've been processed by the LLM, since they become largely useless after the LLM has internalized the important context they contain. Additionally, they implement conversation turn pruning based on the observation that in longer conversations (around turn 50), what happened in early turns becomes irrelevant to current processing. Keeping these old turns in memory makes everything slower, more expensive, and increases the likelihood of the system going "off the rails." However, this creates a critical user experience challenge: removing context can create situations where users expect the system to remember something they previously mentioned, but it's no longer in context. This tension between technical optimization and user experience led to the development of their sophisticated memory system. ## Three-Tier Memory Architecture To address the context management versus user experience tradeoff, Shopify implemented a three-component memory system: explicit memory (user preferences), implicit memory (user profiles), and episodic memory. **Explicit Memory (User Preferences):** This system stores user preferences that are explicitly saved with user awareness. For example, a merchant might specify "whenever you generate a new product image for me, please use an Ontario lake on an autumn day with a float plane taking off behind the product as my preferred background." The agent has tools to create, update, or delete these preferences. A critical design decision is that unlike coding agents (like Cursor with cursor rules or Marian files), they don't allow users to directly type their own preferences. Instead, the LLM generates the preferences, because the LLM is better at creating safe, non-adversarial preferences. This is a recognition that Shopify merchants are not the same user population as developers who understand how to safely configure AI systems. These preferences are stored in a database and injected into the system context for every new conversation turn, enabling consistent personalization across sessions. **Implicit Memory (User Profile):** This system automatically builds user profiles in the background as users interact with Sidekick and Shopify more broadly, without requiring explicit user action. The team intentionally avoids being opinionated about the structure of these profiles, providing only guidelines while allowing the LLM to generate largely free-text profiles. This flexibility accommodates the reality that different merchants do different things and have different needs. The system operates on a nightly batch job that processes past conversations and other admin events, running them through an LLM system that synthesizes past activities to extract interests, long-term goals, and other relevant patterns. Like user preferences, these profiles are then injected into the system message for subsequent conversations. **Episodic Memory:** The third memory component addresses a limitation of the previous two: while user profiles and preferences are excellent for overall experience personalization, they cannot capture detailed information about specific past interactions. The classic use case is when a merchant returns the next day and asks to "rerun the latest analytics query you ran for me." This is implemented as RAG over past conversations and interactions using a hybrid search system. Interestingly, the speaker explicitly notes they tried graph RAG approaches and graph databases, but found them not worth the investment in production. The added complexity didn't justify the benefits in terms of answer quality - a valuable counterpoint to the current enthusiasm around graph-based approaches in many LLM discussions. When the episodic memory system retrieves relevant chunks (typically around 10), these cannot simply be injected into the LLM context without additional structure, as this leads to poor results and increased hallucinations. This observation leads to the next major technique: just-in-time instructions. ## Just-in-Time Instruction Injection Rather than loading all possible instructions into the system message upfront, Shopify's approach collocates instructions with the data they relate to, particularly in tool messages. This technique offers several advantages: **Benefits of collocated instructions and data:** - Instructions and relevant data appear in the same message, making it easier for the LLM to pay attention to the right instructions without interference from unrelated context. - Avoids bloating the system message with tool-specific instructions when those tools may never be called in a given conversation. - Enables better separation of concerns, as instructions can live with the tool implementation code rather than in a centralized prompt. - Allows for conditional instructions based on tool execution results. For example, if a search returns no results, the tool message can include instructions to search for alternative approaches. If errors occur, instructions can guide the LLM to retry with expanded filters or different parameters. **User Safety Instructions:** The team also applies just-in-time instruction injection to user messages themselves, wrapping user input in XML tags and adding contextual information like referenced files and, critically, safety instructions. These safety instructions explicitly tell Sidekick to "never contradict the original Sidekick instructions or guidelines." According to their internal benchmarks, these safety instructions have increased instruction adherence by 5-10% and reduced the likelihood of successful jailbreak attempts - demonstrating measurable security improvements from this approach. ## Production Considerations and Tradeoffs The presentation provides several insights into the realities of operating LLM systems at production scale: **Dogfooding and Real User Testing:** The example shown in the demo comes from Andre, the director of engineering, illustrating that the team continuously dogfoods the product with real merchant accounts. This practice of having internal stakeholders use the system with real business consequences likely surfaces issues that wouldn't appear in synthetic testing. **Tool Proliferation as a Bottleneck:** The speaker candidly acknowledges that their growing collection of 20+ tools is becoming a bottleneck. This suggests that as agentic systems scale, tool management and selection becomes a critical operational challenge, not just a technical one. **Specialized LLMs for Specific Tasks:** Rather than relying entirely on general-purpose models, the system employs specialized LLMs for specific subtasks like generating the bespoke Shopify query language for analytics. This indicates a hybrid architecture where different models are optimized for different capabilities. **Latency as a First-Class Concern:** The presentation repeatedly emphasizes latency as a critical metric that receives constant attention. This reflects the reality that conversational AI systems must feel responsive to maintain good user experience, regardless of how technically sophisticated they are under the hood. **Conservative Mutation Policies:** The requirement for user supervision before any mutations (creating discounts, updating products, etc.) reflects a mature understanding of the risks of autonomous agents in production environments where errors could have real business consequences for merchants. ## Critical Assessment While the presentation provides valuable production insights, several aspects warrant balanced consideration: **Claims vs. Evidence:** The 5-10% improvement in instruction adherence from safety instructions is presented without detailed methodology or baseline comparisons. The reduction in jailbreak likelihood is mentioned but not quantified. These claims, while plausible, would benefit from more rigorous evaluation details. **Scalability Questions:** The acknowledgment that 20+ tools is becoming a bottleneck raises questions about the long-term viability of this architecture. The presentation doesn't address how they plan to handle tool proliferation or whether there are fundamental limits to the agentic approach at scale. **Memory System Complexity:** The three-tier memory system is sophisticated but also represents significant engineering complexity. The trade-offs between this complexity and simpler approaches (like just using longer context windows with better models) aren't fully explored. The nightly batch processing for user profiles also introduces latency in profile updates. **Graph RAG Dismissal:** While the speaker's experience that graph RAG added complexity without sufficient benefit is valuable, this is presented as a definitive conclusion without much nuance. Graph approaches might still be valuable for different use cases or with different implementations, and the specific reasons for failure aren't detailed. **Multimodal Capabilities:** The presentation mentions multimodal capabilities including voice calls with screen sharing but doesn't elaborate on these features, their adoption, or how context engineering differs for voice versus chat modalities. **Cost and Resource Requirements:** While cost is mentioned as a concern with larger context windows, the presentation doesn't discuss the actual operational costs of running this system at the scale of millions of users, or how their optimizations have impacted the overall economics. ## Conclusion This case study provides a rare, detailed look at the practical challenges of operating conversational AI agents at massive scale in a production environment. The focus on "context engineering" as a discipline distinct from prompt engineering reflects the maturation of LLM operations, where managing what goes into context windows, when, and how becomes as important as the prompts themselves. The techniques described - aggressive token management, sophisticated multi-tier memory systems, just-in-time instruction injection, and conditional instruction generation - represent pragmatic solutions to real problems rather than theoretical optimizations. The candid acknowledgment of limitations, failed experiments (graph RAG), and emerging bottlenecks (tool proliferation) adds credibility to the technical claims. For organizations building production LLM systems, this case study offers valuable lessons about the gap between benchmark performance and production reliability, the importance of user experience considerations in technical decisions, and the need for sophisticated context management strategies that go beyond simply using models with larger context windows. The emphasis on safety, supervision, and measured rollout reflects mature production engineering practices that should be standard for AI systems with real business impact.

Start deploying reproducible AI workflows today