AWS (Alexa): Transforming a Voice Assistant from Scripted Commands to Generative AI Conversation at Scale

Overview

This case study describes how Amazon’s Alexa team completely rearchitected their voice assistant infrastructure to incorporate large language models and transform it from a scripted, command-based system into Alexa Plus, a generative AI-powered conversational assistant. The presentation was delivered by Brittany Hurst (AWS relationship lead), Sairu Pauguti (Alexa AI product lead), and Lou Tran (engineer on Alexa Plus) at what appears to be an AWS conference.

The fundamental challenge was extraordinarily complex: evolve a production system serving over 600 million devices from deterministic, scripted commands to natural, generative AI-powered conversations without breaking any existing integrations or compromising the reliability customers expect. This wasn’t a greenfield project but rather a transformation of a decade-old, massive-scale production system that customers depend on daily. Original Alexa launched in 2014 with just 13 skills from a single developer (Amazon) and has grown to integrate with over a billion connected devices globally.

The presentation is notable for its candor about the challenges encountered and the iterative, sometimes counterintuitive nature of working with LLMs in production at scale. The team emphasized that many traditional assumptions about software engineering don’t apply when working with non-deterministic LLMs, and they had to invent new techniques and approaches along the way.

Core Technical Challenges

The team identified four major technical challenges that they believe are relevant to anyone building generative AI applications with LLMs in production: accuracy, latency, determinism, and model flexibility. These weren’t isolated concerns but deeply interconnected problems that often required balancing competing trade-offs.

Accuracy Challenges and Solutions

Getting LLMs to consistently do what’s needed in a production voice assistant proved far more difficult than initially anticipated. While LLMs excel at understanding natural language and determining intent—which might be sufficient for a chatbot—Alexa takes real-world actions like controlling smart home devices, playing music, and booking tickets. The stakes are much higher when real-world actions are involved, and in a complex system like Alexa with many sequential steps, errors compound quickly. This made it critical to drive up accuracy at every stage, especially during LLM inference cycles.

The first major accuracy challenge involved routing. When a customer makes a request like “let me know when Daisy doesn’t get fed by noon,” the LLM must select from a universe of tools, experts (Alexa’s term for domain-specific integrations), and agents to determine which can handle the request. This might be a notification expert, reminder expert, or calendar expert. The team emphasized this is the most critical and difficult step because getting routing wrong makes downstream recovery nearly impossible.

After routing, the LLM must determine which specific APIs that expert offers should be invoked and what parameters those APIs require. For a reminder, this includes when to fire it, what conditions trigger it, the frequency, the target for notification, and the message content. All these values must be retrieved at runtime by the system as orchestrated by the LLM. This represents complex planning that the LLM performs through multiple inference cycles, each of which can add to or reduce overall accuracy.

The team’s experience with improving accuracy revealed several counterintuitive lessons. Initially, they provided examples or exemplars showing how to invoke APIs, which helped in the beginning when state-of-the-art LLMs weren’t as capable. However, as they encountered bugs and kept adding more examples, accuracy actually decreased. They learned that overloading the context and prompts with too much information causes LLMs to overfit or act too specifically for particular use cases. Like humans, LLMs have limited attention spans—overloading them with information, especially irrelevant information, makes them “forgetful” and prone to errors. Conflicting examples from different bug fixes could contradict each other, creating particularly tricky situations. Ultimately, they had to remove many examples and exemplars to improve accuracy.

The team also discovered a “balloon squeezing” effect where fixing problems in one area created problems elsewhere. For smart home use cases, providing context about which lights exist in the household helps the LLM understand which light the customer means when they say “turn on the lights.” However, this same context about lights is irrelevant when the customer says “play some music” and actually reduces accuracy for music playback. This led to extensive work on context engineering—carefully determining what context to provide for different use cases.

API refactoring helped address some accuracy issues by making APIs more obvious so the LLM could figure them out without extensive examples. However, the team learned that being too obvious could be harmful. For instance, if an API or expert is called “create reminder,” adding a prompt instruction saying “use this API to create a reminder” is redundant and actually harmful due to overfitting and prompt overload.

The presentation emphasized that the early excitement when they first got the system working gave way to the realization that far more work lay ahead than they had completed. This honest assessment reflects the reality that LLMOps at scale involves continuous iteration and problem-solving that extends well beyond initial prototypes.

Latency Optimization

After achieving acceptable accuracy, the team faced a new challenge: everything was too slow. For a voice assistant, customers expect near-instantaneous responses, especially for actions like turning on lights. Unlike chatbots where users can tolerate gradual text streaming and “thinking” indicators, Alexa needed to respond within a second or two to avoid awkward pauses that break the conversational flow.

The team started with traditional latency reduction techniques: parallelization (calling independent APIs simultaneously rather than sequentially), streaming (changing finish-to-start dependencies into start-to-start dependencies so processing can begin as soon as possible), and prefetching (loading the right context proactively). With streaming, Alexa could begin selecting an appropriate expert as soon as it heard “let me know when” without waiting for the rest of the utterance. Prefetching allowed Alexa to gather device information, timezone configuration, account details for personalization, and other context as soon as the wake word “Alexa” was spoken, before the actual request was completed.

However, these traditional techniques quickly reached their limits. The team learned a crucial lesson about LLM inference: there’s a massive difference between processing input tokens and generating output tokens. Output token generation is literally multiple orders of magnitude more expensive in terms of time than processing input tokens. This insight led them to be meticulous about output tokens.

Chain-of-thought reasoning, where you prompt the LLM to “think out loud” by generating reasoning steps in its output, can improve accuracy and is invaluable for debugging during development. However, generating all those reasoning tokens has enormous latency impacts in production. The team compared leaving chain-of-thought on in production to turning on trace-level logging in services and flushing to disk on every request—great for troubleshooting and development, but unacceptable for production systems at scale.

For input tokens, caching became essential. Many customer utterances are similar (like “Alexa, stop”), and much of the prompt remains constant across utterances (identity, capabilities, available tools and experts, instructions). The team had to invent prompt caching early in their development because it wasn’t yet available from model providers—though it’s now ubiquitous and taken for granted. When implementing caching, ordering matters critically because the LLM takes different paths depending on input tokens. The most stable information must come first, with changing information pushed toward the end to maximize cache hits.

The team also optimized prompts through minification and instruction tuning. Minification compresses input tokens without affecting LLM behavior—for example, replacing long unique identifiers with shorter ones (since the LLM doesn’t care about the specific value) and restoring them on output. This also helped with caching because you don’t want “Alexa stop” to cache miss due to customer-specific identifiers. However, minification requires caution because different models and even different versions from the same vendor may use different tokenizers, and tight coupling to specific tokenization approaches can create fragility.

Instruction tuning involves refining the instructions sent to the LLM, sometimes using the LLM itself to provide feedback on how to convey the same meaning with fewer words or fewer examples, thereby reducing input tokens and processing time.

Model-level techniques like speculative execution provided additional latency benefits. This involves using a high-recall but potentially lower-accuracy model with fewer parameters (and thus faster inference) to get an initial answer quickly while simultaneously querying a higher-accuracy model with more parameters and higher latency. If the results match, you’ve gotten a head start on calling the identified expert’s APIs. If they differ, you discard the initial result and use the high-accuracy model’s answer. The customer still gets what they want, and in many cases where the fast model was correct, latency is significantly reduced. A critical nuance here is ensuring that APIs called speculatively are idempotent or don’t have harmful side effects, since you might need to undo them if the speculation was wrong.

The single most impactful latency optimization was reducing the number of times the system invoked LLM inference cycles. API refactoring played a crucial role by combining sequences of fine-grained APIs into single or small numbers of coarse-grained APIs that could be predicted in fewer inference cycles. Fine-tuning foundational models for Alexa’s specific use cases also created specialized models that could operate more quickly given expected customer traffic patterns.

Balancing Determinism and Creativity

As the team tuned the system to be more accurate and efficient, they encountered an unexpected problem: Alexa became increasingly robotic and lost the personality and creativity that makes LLM-driven AI engaging. Traditional Alexa systems were reliable and consistent but lacked conversational naturalness. With LLMs’ inherent non-determinism and statistical nature, the team faced the challenge of building a system that’s both deterministic and creative.

For use cases that function like tools—turning on lights, playing specific music on specific speakers—Alexa must work correctly 100% of the time, not “most of the time.” Customers expect absolute reliability for these functional interactions. However, for other interactions like “Alexa, I’m bored,” there’s room for variety. Sometimes Alexa might offer to play music, other times suggest revisiting a previous conversation about travel destinations. This variability makes the assistant feel more human and engaging.

The team had to deliberately dial back some determinism to re-inject creativity and personality while maintaining perfect reliability for critical functions. This represents one of the core tensions in LLMOps for production systems: balancing reliability and consistency with the engaging, creative qualities that make LLM-based systems compelling.

Context engineering emerged as critical for achieving this balance. The team learned that models exhibit recency bias, giving more weight to instructions toward the end of prompts, just as humans tend to remember recently told information better than earlier information. Ordering within prompts matters not just for caching but for accuracy and behavioral balance. Deciding what context to include and exclude required extensive iteration—past conversations for continuity, smart home device information, personalization data, all while avoiding negative impacts on latency and accuracy. Summarization helps manage context length, but there’s much more subtlety involved.

Parametric answers (answers from the model based on its training data) are only as good as the data available during training, so they don’t include current events or updated information like personalization data or real-time knowledge bases. For these cases, the team used standard retrieval augmented generation (RAG) techniques for grounding. However, they had to carefully balance straightforward factual answers with embellishments that add personality—like trolling the Yankees if you’re a Red Sox fan or responding to “I’m bored” with the dad joke “Hi, bored. I’m Alexa.”

A critical constraint is preventing hallucinations about real-world entities. The LLM cannot hallucinate that a household has 67 speakers when it really has 5 and then fail to play music because it’s attempting to use a non-existent speaker. This requires careful context engineering to ensure the model has accurate information about actual devices and capabilities without overwhelming it with irrelevant details.

Safety proved non-negotiable throughout the system. The team took a “belts and suspenders” approach with guardrails at multiple layers. They didn’t trust that everything going into the model was safe or that everything coming out would be safe. While they prompted models to behave safely, they also implemented additional guardrails to catch issues the models missed. This defense-in-depth approach is essential for production systems serving hundreds of millions of users.

Multi-Model Architecture

Model flexibility emerged as perhaps the most important architectural decision. Early on, the team decided not to rely on a single model for all use cases, and this turned out to be crucial. In a system as diverse as Alexa—with an enormous customer base and wide range of experiences—no single model could optimally handle everything while balancing accuracy, latency, determinism, personality, capacity, and GPU cost.

The multi-model architecture was initially born from necessity because early models weren’t as capable. It was convenient to swap out models during development. However, the breakthrough insight was realizing they didn’t need to turn this off for production—they could leave it on and dynamically select the right model for each use case at runtime. This eliminated the need to find a one-size-fits-all model and enabled selecting the optimal model for each specific job.

Working with AWS Bedrock made this multi-model approach practical by making it easy to swap underlying models on the backend at runtime whenever needed. The team also recognized that not every challenge requires an LLM. Simple use cases like “Alexa, stop” don’t need the sophistication of an LLM—using one would be overkill. Similarly, when users email PDFs of their child’s school schedule and later ask “When is my son’s homework due?” an LLM could handle this by including the PDF in input prompts, but that creates excessive tokens and latency concerns. Instead, a purpose-built, bespoke non-LLM traditional ML model handles this more efficiently.

The system incorporates multiple LLMs for different use cases plus multiple non-LLM ML models, all orchestrated to handle the full range of Alexa capabilities. AWS SageMaker facilitated building these bespoke models. However, having many models creates a new challenge: how to choose the right one? One approach is using all models simultaneously (like speculative execution) and selecting the fastest, most accurate, or cheapest answer. But calling all models in parallel improves latency at the cost of capacity and GPU resources. The optimal approach involves using a combination of techniques with multiple models strategically selected based on the specific request.

Example Use Case: Complex Multi-Step Planning

The presentation walked through a sophisticated example that illustrates the complexity of what Alexa Plus must accomplish. A user with a dog named Daisy (a cream golden retriever) has someone drop by to feed her while at work. The user wants to be notified if Daisy isn’t fed by noon each day, so they simply tell Alexa: “Let me know if Daisy is not fed by noon every day.”

Executing this seemingly simple request requires multiple complex steps:

Speech recognition to understand the utterance
Intent understanding to comprehend what the user actually wants
Retrieving personal context (the user has a dog named Daisy, a cream golden retriever)
Understanding “if she’s not fed” requires visual recognition of a dog eating
Identifying relevant devices (Ring cameras pointed at the food bowl area)
Setting up daily monitoring that checks before noon
Computer vision processing to detect a cream golden retriever eating
Sending a notification if the condition isn’t met by the deadline

These represent significant challenges for modern LLMs. The system must excel at accuracy to correctly identify all components, achieve extremely low latency (acknowledging the request within a couple seconds to maintain conversational context), handle complex multi-step planning and orchestration, integrate multiple tools and services (reminders, notifications, computer vision, smart home devices), and execute reliably day after day without human intervention.

This example demonstrates that Alexa Plus goes far beyond chatbot-style conversation to actual real-world action execution—what the team called “getting things done.” While many AI assistants stumble when trying to make reservations, book experiences, or interact with the real world, Alexa Plus was designed from the ground up to bridge the gap between conversation and action reliably at massive scale.

Implementation Patterns and Learnings

Several key patterns and learnings emerged from the Alexa Plus development that have broad applicability to LLMOps:

Iterative Experimentation is Essential: Traditional optimization techniques weren’t sufficient. The team needed to continuously experiment with new approaches like prompt caching, API refactoring, speculative execution, and context engineering. What works in theory doesn’t always work in production, and what works in production at small scale may not work at the scale of 600 million devices. Building experimentation into process and mental models is critical.

Order Matters Everywhere: Token ordering in prompts affects caching effectiveness, accuracy, and behavioral characteristics due to recency bias. Stable information should come first for caching, but critical instructions might need to come last for accuracy. This creates tensions that require careful balancing.

Less Can Be More: Counterintuitively, providing fewer examples and less context often improves accuracy by avoiding overfitting and prompt overload. The team had to remove many examples they had added to fix bugs because they collectively reduced overall accuracy.

Output Tokens Are Expensive: The multiple orders of magnitude difference between input and output token processing costs means techniques like chain-of-thought reasoning, while valuable for development and debugging, must be disabled in production despite their accuracy benefits.

Context Engineering Is Critical: Carefully curating what context to provide—past conversations, device information, personalization data—without overwhelming the model or introducing latency requires extensive iteration and experimentation. Different use cases need different context, and providing wrong or irrelevant context actively harms accuracy.

Defense in Depth for Safety: Multiple layers of guardrails are essential. Don’t trust that inputs are safe, don’t trust that model outputs are safe, and include guardrails even when models are prompted to behave safely.

Peeling the Onion: Fixing one problem consistently uncovered new challenges. The team described this as peeling layers of an onion—each solution revealed another set of issues requiring attention. This is the nature of complex production LLM systems at scale.

Right Model for Right Job: Rather than seeking one perfect model, finding or building specialized models for specific use cases and orchestrating them yields better results across multiple dimensions (accuracy, latency, cost, capacity).

Not Everything Needs an LLM: Purpose-built traditional ML models or even simpler rule-based systems are sometimes more appropriate, efficient, and reliable than using LLMs for every task.

Production Considerations

The presentation emphasized several production-specific considerations that distinguish development from deployment at scale:

Backward Compatibility: With 600 million devices in the field and countless existing integrations (skills, routines, smart home setups), breaking changes weren’t acceptable. The rearchitecture had to maintain perfect compatibility with everything customers already relied on daily.

Reliability Over Novelty: While creativity and personality are valuable, core functional use cases must work 100% of the time. Customers depend on Alexa for daily routines, accessibility needs, and home automation that must be absolutely reliable.

Latency as a Feature: For voice interactions, sub-2-second response time isn’t just a nice-to-have—it’s essential to maintaining conversational flow and meeting user expectations. Every millisecond matters at this scale.

Scale Changes Everything: Techniques that work in development or small-scale production may not work at 600 million devices. Capacity, GPU costs, and infrastructure considerations become first-order concerns rather than afterthoughts.

Continuous Evolution: The field of LLMs is moving extremely fast. Techniques like prompt caching that the team had to invent are now ubiquitous and taken for granted. Production systems must be architected for continuous evolution and model updates without disruption.

AWS Services and Tools

The case study highlights several AWS services that enabled the Alexa Plus transformation:

AWS Bedrock: Facilitated the multi-model architecture by making it easy to swap underlying models at runtime without extensive re-engineering
Amazon SageMaker: Enabled building and deploying the numerous purpose-built, bespoke ML models used throughout the system for specialized tasks

The presentation positioned these tools as essential enablers of the flexible, multi-model architecture that proved critical to success, though one should note the presenters work for AWS and have natural incentives to highlight their own platform’s capabilities.

Assessment and Balanced Perspective

This case study offers valuable insights into real-world LLMOps challenges at massive scale, and the presenters demonstrated commendable transparency about difficulties, false starts, and counterintuitive learnings. The technical depth around latency optimization, accuracy challenges, and the balance between determinism and creativity provides genuinely useful patterns for others building production LLM systems.

However, several caveats deserve consideration. The presentation was delivered at an AWS conference and naturally emphasizes how AWS services enabled success while not discussing potential limitations, challenges with those services, or alternative approaches. The business model creates incentive to present AWS tools as essential rather than just helpful. Additionally, while the team describes “inventing” prompt caching and speculative execution, these weren’t necessarily novel techniques—they may have implemented them before they were widely available from model providers, but framing this as pure invention rather than early adoption or independent discovery may overstate originality.

The case study doesn’t address several important questions: What were the actual accuracy metrics achieved? How much did latency improve from baseline to final system? What was the cost (computational and financial) of the transformation? How long did the rearchitecture take, and how many engineers were involved? What percentage of use cases actually use LLMs versus traditional ML versus rules? These quantitative details would help others assess applicability to their own contexts.

The presentation also doesn’t discuss failure modes, rollback strategies, A/B testing approaches, or gradual rollout strategies in depth. For a system serving 600 million devices, understanding the risk management and deployment strategy would be highly valuable but isn’t covered in detail.

Despite these limitations, the case study provides genuinely useful technical patterns and honest reflections on the complexity of transforming an established production system to incorporate LLMs. The emphasis on multi-model architectures, context engineering, and the interplay between accuracy, latency, and determinism offers practical guidance for others working on similar challenges. The candid acknowledgment that fixing one problem often creates others and that counterintuitive solutions (like removing examples to improve accuracy) sometimes work best reflects real-world engineering rather than idealized marketing narratives.

Transforming a Voice Assistant from Scripted Commands to Generative AI Conversation at Scale

Industry

Technologies