Company
Whoop
Title
AI-Powered Transformation of AWS Support for Mission-Critical Workloads
Industry
Tech
Year
2025
Summary (short)
AWS Support transformed from a reactive firefighting model to a proactive AI-augmented support system to handle the increasing complexity of cloud operations. The transformation involved building autonomous agents, context-aware systems, and structured workflows powered by Amazon Bedrock and Connect to provide faster incident response and proactive guidance. WHOOP, a health wearables company, utilized AWS's new Unified Operations offering to successfully launch two new hardware products with 10x mobile traffic and 200x e-commerce traffic scaling, achieving 100% availability in May 2025 and reducing critical case response times from 8 minutes to under 2.5 minutes, ultimately improving quarterly availability from 99.85% to 99.95%.
## Overview This case study documents AWS Support's comprehensive transformation from a reactive support model to an AI-powered proactive support system, featuring a detailed customer implementation by WHOOP, a health wearables company. The presentation covers both AWS's internal LLMOps journey in operationalizing AI for their support organization of 15,000 people across 50 countries supporting 250+ AWS services, as well as how WHOOP leveraged these AI-enhanced support capabilities to achieve mission-critical uptime during a major product launch in May 2025. ## Business Context and Problem Statement AWS Support identified that traditional "firefighting" support models were becoming unsustainable due to several key factors. The complexity of modern cloud workloads has evolved dramatically from ten years ago - today's applications are distributed, multi-region, microservices-based with dependencies across accounts, regions, ISVs, and partners. Only a quarter of support interactions involve actual break-fix scenarios where AWS services fail; the majority involve customers either using services incorrectly or making regrettable configuration errors that could have been prevented. The key challenges identified included migration delays, skills gaps, slow recovery times, limited visibility into potential issues, high incident recurrence rates, multiple handoffs between teams, manual troubleshooting processes, and security alert fatigue. Customers were demanding near-zero downtime, maximum resilience, zero-impact maintenance windows, and the ability to handle scale without manual intervention. AWS recognized that human-driven operations alone could not deliver this proactive experience at scale. ## AI Architecture and Technical Implementation ### Multi-Tiered AI Approach AWS Support implemented a three-tiered AI strategy: AI assistance (chatbots with knowledge bases), AI augmentation (where AI supports humans who remain in control), and autonomous AI (where AI acts independently with configurable confidence thresholds). The architecture emphasizes that context is paramount - AI is only as effective as the data and context behind it. Rather than just training on historical data, AWS built systems to capture and hydrate application-specific context from customer workloads and feed this to both human experts and AI agents. ### Core Technology Stack The foundation relies on Amazon Bedrock for agent capabilities and Amazon Connect for customer interaction management. Bedrock Agent Core provides runtimes and gateways that convert tools into Model Context Protocol (MCP) format, which became critical for tool integration. A custom context service was built to ensure agents can build necessary context to solve customer issues in a personalized, secure, and privacy-conscious manner. ### SOP to Agent Conversion Pipeline AWS invested heavily in converting unstructured Standard Operating Procedures (SOPs) and runbooks into structured, machine-executable formats. The process involved taking informal documentation with commands and tribal knowledge and transforming it into complete, deterministic workflows and tools that agents could reliably execute. This required not just format conversion but ensuring completeness and adding integration tests. They built an authoring platform that allows experts to describe workflows in natural language, which are then converted into formats optimized for model consumption. Critically, each converted SOP includes comprehensive evaluation sets. For example, an EKS node issue evaluation might use infrastructure-as-code to set up a test scenario with permission problems in an AWS account, have the agent attempt resolution, implement the proposed solution, and verify if the problem is actually solved. This creates a complete, executable test that validates the agent's capability. ### Graph-Based Knowledge Retrieval AWS moved beyond simple RAG (Retrieval Augmented Generation) to graph RAG for richer knowledge retrieval. They scan corpuses including existing knowledge bases, partner documentation, historical cases and tickets, existing SOPs and automation to create data structures (graphs or weighted probability matrices) that map from symptoms to root causes. This mirrors how experienced on-call engineers troubleshoot - seeing a symptom, pulling diagnostic information, following hunches down different decision trees until reaching root cause. This graph-based approach, linked via MCP to executable tools, enables confirmation from tool outputs that the correct path is being followed. The system can validate whether the diagnosed root cause matches actual data signals and customer inputs based on probabilities derived from past issues. This significantly improves complex issue resolution. ### Fine-Tuning and Model Optimization For specific use cases, AWS employed fine-tuning using LoRA (Low-Rank Adaptation) as an accessible starting point. A key challenge they addressed was MCP tool selection latency and accuracy - when hundreds of tools are available, models can struggle with both speed and choosing the correct tool. Fine-tuning helps optimize tool selection for AWS's specific context with hundreds of available tools. They also utilize reinforcement learning approaches, mentioning NOAA Forge for training models with reinforcement learning, alongside fine-tuning for specialized improvements. The emphasis throughout is on "durable investments" - while technologies, models, and frameworks change rapidly, evaluations and context remain valuable long-term assets that can continuously improve systems. ### Orchestration and Multi-Agent Systems Rather than a single monolithic agent, AWS built an orchestration layer that selects appropriate agents for each job. For complex issues, multiple agents may be needed to provide comprehensive responses - one agent might resolve the immediate problem while another provides optimization recommendations to prevent recurrence. The orchestration layer broadcasts requests across multiple agents and reconciles their outputs into coherent, rich responses for customers. ### Guardrails and Safety Automated reasoning checks using mathematical proofs help ensure responses are safe and accurate. This works particularly well for non-overlapping rules and provides near-deterministic validation that responses don't violate safety boundaries. Bedrock's guardrails incorporate these automated reasoning capabilities. AWS also launched Bedrock Agent Core evaluations to provide out-of-the-box evaluation capabilities. ## Production Operations and Integration ### AWS Support Console Experience When customers come to AWS Support Center and select an issue type, the system captures context from the case, surfaces any similar past issues, prompts for missing critical information (like specific resource identifiers), and then visibly shows agents checking relevant resources. For example, with an ECS out-of-memory issue, the system demonstrates checking cluster health using agent outputs to reach root cause diagnosis and provide actionable recommendations. ### AWS DevOps Agent Launch Launched during the event (December 2025 timeframe), the AWS DevOps agent integrates deeply with support, incorporating lessons from AWS's support experience. It extracts application topology to understand component relationships, integrates with signals from AWS and third-party partners (Dynatrace, Splunk, Datadog, ServiceNow), and uses these signals combined with historical issue patterns to diagnose root causes and provide preventive recommendations. Customers can escalate directly to human support experts from within the DevOps agent interface. ### Human-AI Collaboration Model AWS emphasizes that AI augments rather than replaces human experts. Internal support staff have access to the same AI capabilities customers do, but with additional context and diagnostic tools. When customers like WHOOP need expert assistance, those experts are equipped with AI-enhanced diagnostics that help them reach root causes faster. This creates a tiered support model where AI handles what it can reliably solve, augments human troubleshooting for complex issues, and seamlessly escalates when human expertise is needed. ## WHOOP Customer Case Study ### Company and Challenge Context WHOOP is a 24/7 health wearables company focused on continuous health monitoring and behavior change. Their platform team manages high-scale infrastructure with strict uptime requirements while maintaining cost efficiency, developer experience, and security. In May 2025, WHOOP launched two new hardware products (the 5.0 and MG straps) with three subscription tiers - a massive expansion from their previous single-hardware-purchase model from 2021. Their previous launch (WHOOP 4.0 in September 2021) saw demand they couldn't handle - systems couldn't load for purchases and shipping delays occurred. While they reached a new baseline, there was a significant "valley of woe" where they lost sales momentum. For the 2025 launch, they aimed to eliminate this valley and capture full market excitement. ### Scaling Requirements and Preparation WHOOP's analytics and product teams determined they needed to scale mobile traffic (from existing members constantly uploading heartbeat data) by 10x and e-commerce sites by 200x above baseline. The preparation timeline began with AWS Countdown Premium engagement (part of Unified Operations) involving comprehensive risk assessment exercises. These structured sessions helped identify risks across the launch, inspiring WHOOP to apply similar exercises to mobile and internal systems. They initiated a trial of AWS's new Unified Operations offering specifically for the quick response time SLOs - on launch day, every minute matters. Six days before the May 8th launch, WHOOP conducted a 15-hour marathon load testing session, repeatedly finding and fixing bottlenecks. Crucially, AWS provided RDS Postgres and container experts who remained available throughout this session, providing real-time guidance on configuration changes and optimization. This load testing ran against real production workloads (baseline member traffic), requiring careful design to simulate organic traffic while being sensitive to system strain. ### Launch Day Execution and Results On May 8th, 2025, WHOOP's team assembled in their war room wearing WHOOP jerseys, with a dedicated AWS room where Technical Account Managers (TAMs) connected with multiple Unified Operations experts throughout the day. Any concerns could be immediately addressed by opening the door, whether for technical questions or quota increases. The launch achieved 100% availability not just on launch day but for the entire month of May 2025. This success enabled excellent experiences for customers during the initial spike, when friends heard about the product and visited the website, and during the unboxing and first connection experience. More significantly, they didn't just hit a new baseline - they achieved a new growth rate trajectory. WHOOP also experienced seamless Black Friday traffic and gained confidence going into the Christmas season and New Year's resolution period. ### Ongoing Value and Migration Projects The partnership continued delivering value beyond launch. WHOOP migrated a self-managed Kafka cluster (their most critical one, the first stop for all data streaming from wearable straps) to MSK Express, achieving approximately $100K in monthly savings and reducing broker recovery time from hours (rehydrating terabytes) to under one minute due to MSK Express's separation of data from compute. Unified Operations provided expert guidance for plan review, solution design validation, and influenced the MSK roadmap with WHOOP's specific requirements. An Aurora migration project currently in progress involves over 200 RDS Postgres databases, expected to save $50-100K monthly. More importantly, this eliminates the need for four-hour maintenance windows for minor version upgrades that required company-wide coordination (and avoiding conflicts with activities like Patrick Mahomes photo shoots). Aurora enables near-zero downtime upgrades. The same Postgres expert who assisted with load testing became deeply involved in this migration, again demonstrating context retention and feeling like an extension of WHOOP's team. ### Quantified Operational Improvements In H2, WHOOP's availability was 99.85%. By focusing on Unified Operations capabilities, they achieved 99.95% availability in Q3 - a 70% reduction in downtime. Critical case response time from AWS dropped from just under 8 minutes to under 2.5 minutes. Faster response combined with context-aware experts who already understood WHOOP's systems led to faster resolution times. The proactive support model meant migration projects were less likely to cause downtime at all. ## Key LLMOps Lessons and Principles ### Context as the Foundation The case study repeatedly emphasizes that context is king. AI systems are only as good as the data and context behind them. AWS invested heavily in systems to capture, maintain, and hydrate context about customer workloads, feeding this to both human experts and AI agents. This context includes application topology, business logic, failure modes, runbooks, and ongoing updates as systems evolve. Without this context, AI recommendations remain generic and potentially harmful. ### Durable Investments Over Transient Technology Tipu emphasized making "durable investments" - while models, frameworks, and technologies change rapidly, evaluations and context remain valuable over time. Building comprehensive evaluation sets and maintaining rich context about systems and workloads provides lasting value that transcends specific AI implementations. This long-term perspective guides AWS's investment decisions. ### Evaluation-Driven Development Complete, executable evaluations are central to AWS's approach. Rather than just testing if an agent produces reasonable-sounding output, they create full integration tests where agents must actually solve problems in test environments. Success means the problem is genuinely resolved, not just that the response looks plausible. This rigorous evaluation approach builds trust in autonomous AI capabilities. ### Structured SOPs Enable Automation Converting tribal knowledge and informal documentation into structured, complete SOPs is essential for reliable agentic execution. This requires not just reformatting but ensuring completeness, adding integration tests, and creating machine-optimized formats while maintaining human readability for the authoring process. ### Multi-Agent Orchestration for Complexity No single agent can handle all scenarios across 250+ services and countless customer contexts. Orchestration layers that select appropriate specialized agents and reconcile outputs from multiple agents enable comprehensive responses that address both immediate problems and longer-term optimization opportunities. ### Human-AI Symbiosis Not Replacement The most effective model combines AI automation for what it can reliably handle, AI augmentation for complex scenarios where human expertise is needed, and seamless escalation paths. Human experts equipped with AI-enhanced diagnostics and rich context can resolve issues faster than either humans or AI alone. ### Graph-Based Knowledge for Complex Troubleshooting Moving from simple RAG to graph RAG with probability-weighted paths from symptoms to root causes better mirrors human expert troubleshooting. Linking these graphs to executable tools via MCP enables validation that the diagnostic path is correct based on actual system outputs. ### Proactive Over Reactive The entire transformation centers on shifting left - preventing issues rather than responding to them. This requires AI systems that can detect architectural gaps, resilience issues, and configuration problems before they impact production. While reactive support will always be needed, proactive guidance and prevention deliver vastly better customer outcomes and lower costs. ## Critical Assessment and Tradeoffs While the case study presents impressive results, several considerations warrant examination. The WHOOP results are compelling but represent a single customer implementation, albeit a substantial one. The 100% availability claim for May 2025 is notable but covers only a one-month period - longer-term data would provide more confidence in sustained outcomes. The reduction in downtime and response times is significant but doesn't fully account for the cost of the Unified Operations service itself versus the value delivered. The technical approach is sophisticated but also represents substantial engineering investment. Converting SOPs to structured formats, building authoring platforms, creating comprehensive evaluation sets, developing context management systems, and orchestrating multi-agent systems requires significant resources. Organizations considering similar approaches should realistically assess whether they have the engineering capacity and LLMOps maturity for this level of sophistication. The emphasis on context is well-founded but creates dependencies - the system's effectiveness relies on continuously updated, accurate context about customer workloads. Context drift, incomplete information, or customers who don't engage in thorough onboarding may limit AI effectiveness. The case study doesn't deeply explore failure modes or situations where the AI provided incorrect guidance. The move from simple RAG to graph RAG and the use of fine-tuning represent advanced techniques, but the case study doesn't provide quantified comparisons showing the incremental improvement of each technique. Claims about automated reasoning providing "near deterministic" safety guarantees should be viewed carefully - the caveat about non-overlapping rules suggests limitations in complex scenarios with interdependent rules. The integration of so many components (Bedrock, Connect, custom context services, orchestration layers, MCP, partner integrations) creates a complex system with many potential failure points. The operational overhead of maintaining this infrastructure isn't discussed. The partnership model between AWS and WHOOP appears very hands-on, with experts available during load testing sessions and launches - this level of support may not scale to all customers. Nevertheless, the case study demonstrates a thoughtful, comprehensive approach to LLMOps at scale. The emphasis on evaluation, context, durable investments, and human-AI collaboration reflects mature thinking about production AI systems. The quantified improvements for WHOOP and the evolution of AWS Support's entire operating model suggest this is more than just marketing narrative - it represents genuine operational transformation enabled by carefully implemented AI systems.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.