## Overview and Context
This case study presents a comprehensive technical analysis of how Amazon has operationalized advanced fine-tuning techniques across multiple business units to deploy production-grade multi-agent LLM systems. Published in January 2026, the article articulates a critical finding from Amazon's enterprise AI work: while many use cases can be addressed through prompt engineering, RAG systems, and turnkey agent deployment, approximately one in four high-stakes applications demand advanced fine-tuning and post-training techniques to achieve production-grade performance. The piece serves both as technical documentation of Amazon's internal approaches and as a positioning document for AWS services, so claims should be evaluated with appropriate skepticism while recognizing the concrete production metrics presented.
The three primary case studies span healthcare (Amazon Pharmacy), engineering operations (Amazon Global Engineering Services), and e-commerce content (Amazon A+ Content), each representing distinct production challenges where model customization proved essential. The article traces the evolution from basic SFT through PPO and DPO to cutting-edge reasoning optimizations like GRPO, DAPO, and GSPO, demonstrating how technical sophistication should align with use case requirements rather than being pursued for its own sake.
## Amazon Pharmacy: Healthcare Safety Through Fine-Tuning
Amazon Pharmacy's journey illustrates the progression from initial experimentation to mission-critical production deployment. The team began two years prior (approximately 2024) with a RAG-based customer service Q&A system that initially achieved only 60-70% accuracy using traditional RAG with foundation models. The breakthrough came through fine-tuning the embedding model specifically for pharmaceutical domain knowledge, which elevated accuracy to 90% and reduced customer support contacts by 11%. This early success established the foundation for more ambitious applications.
The medication safety challenge represents the high-stakes scenario that justified advanced fine-tuning investment. Medication direction errors cost up to $3.5 billion annually to correct and pose serious patient safety risks. Amazon Pharmacy created an agent component that validates medication directions using pharmacy logic and safety guidelines by fine-tuning models with thousands of expert-annotated examples. The technical approach combined SFT, PPO, and Reinforcement Learning from Human Feedback (RLHF), along with advanced RL techniques, to encode domain-specific constraints and safety guidelines into the model. This resulted in a 33% reduction in near-miss events, validated through publication in Nature Medicine—a notable external validation of the approach.
The evaluation criteria for this system demonstrate the complexity of production healthcare AI: drug-drug interaction detection accuracy (percentage of known contraindications correctly identified), dosage calculation precision (correct dosing adjustments for age, weight, and renal function), near-miss prevention rate (reduction in medication errors that could cause patient harm), FDA labeling compliance (adherence to approved usage, warnings, and contraindications), and pharmacist override rate (percentage of agent recommendations accepted without modification by licensed pharmacists). These metrics go far beyond simple accuracy measures and reflect the multi-dimensional nature of healthcare safety evaluation.
By 2025, Amazon Healthcare Services was expanding these capabilities and transforming separate LLM-driven applications into a holistic multi-agent system to enhance patient experience. The fine-tuned models serve as domain expert tools addressing specific mission-critical functions within the broader pharmaceutical services architecture, illustrating the architectural pattern of specialized components within orchestrated agent systems.
## Amazon Global Engineering Services: Operational Efficiency at Scale
The Amazon GES team oversees hundreds of Amazon fulfillment centers worldwide and embarked on their generative AI journey with a sophisticated Q&A system designed to help engineers access design information from vast knowledge repositories. Their initial approach focused on fine-tuning a foundation model using SFT, which improved accuracy measured by semantic similarity score from 0.64 to 0.81—a substantial improvement but still insufficient for production requirements.
To better align with subject matter expert (SME) feedback, the team applied PPO incorporating human feedback data, which boosted LLM-judge scores from 3.9 to 4.2 out of 5. This represents a remarkable 80% reduction in effort required from domain experts—a concrete operational efficiency gain that justified the investment in advanced fine-tuning. The progression from SFT to PPO demonstrates the value of reinforcement learning techniques in capturing nuanced human preferences that go beyond what labeled examples alone can provide.
In 2025, the GES team ventured into applying agentic AI systems to optimize business processes, using fine-tuning methodologies to enhance reasoning capabilities in AI agents. The focus was on enabling effective decomposition of complex objectives into executable action sequences that align with predefined behavioral constraints and goal-oriented outcomes. This architectural evolution from specialized Q&A tools to comprehensive business process optimization illustrates how organizations can progressively expand AI capabilities as they gain experience and validation.
## Amazon A+ Content: Quality Assessment at Massive Scale
Amazon A+ Content powers rich product pages across hundreds of millions of annual submissions, requiring content quality evaluation at unprecedented scale. The challenge involved assessing cohesiveness, consistency, and relevancy—not merely surface-level defects—in a context where content quality directly impacts conversion rates and brand trust.
The A+ team built a specialized evaluation agent powered by a fine-tuned model, applying feature-based fine-tuning to Amazon Nova Lite on SageMaker. Rather than updating full model parameters, they trained a lightweight classifier on vision language model (VLM)-extracted features. This approach, enhanced by expert-crafted rubric prompts, improved classification accuracy from 77% to 96%, delivering an AI agent that evaluates millions of content submissions and provides actionable recommendations.
This case study demonstrates a critical principle articulated in the maturity framework: technique complexity should match task requirements. While the A+ use case operates at massive scale and is high-stakes in terms of business impact, it is fundamentally a classification task well-suited to feature-based fine-tuning rather than requiring the most sophisticated reasoning optimization techniques. Not every agent component needs GRPO or DAPO—selecting the right technique for each problem delivers efficient, production-grade systems.
## Technical Evolution of Fine-Tuning Approaches
The article provides a comprehensive technical overview of the evolution from basic to advanced fine-tuning techniques. SFT established the foundation by using labeled data to teach models to follow specific instructions, but faced limitations in optimizing complex reasoning. Reinforcement learning addressed these limitations with reward-based systems providing better adaptability and alignment with human preferences.
PPO represented a significant advancement with its workflow combining a value (critic) network and policy network. The reinforcement learning policy adjusts LLM weights based on reward model guidance, scaling well in complex environments though introducing challenges with stability and configuration complexity. The technical architecture requires managing multiple networks and careful hyperparameter tuning, which can be operationally demanding.
DPO emerged in early 2024 as a breakthrough addressing PPO's stability issues by eliminating the explicit reward model and working directly with preference data including preferred and rejected responses for given prompts. DPO optimizes LLM weights by comparing preferred and rejected responses, allowing models to learn and adjust behavior accordingly. This simplified approach gained widespread adoption with major language models incorporating DPO into training pipelines. Alternative methods including Odds Ratio Policy Optimization (ORPO), Relative Preference Optimization (RPO), Identity Preference Optimization (IPO), and Kahneman-Tversky Optimization (KTO) emerged as computationally efficient RL methods for human preference alignment, incorporating comparative and identity-based preference structures grounded in behavioral economics.
As agent-based applications gained prominence in 2025, demand increased for customizing reasoning models in agents to encode domain-specific constraints, safety guidelines, and reasoning patterns aligned with agents' intended functions including task planning, tool use, and multi-step problem solving. The objective was improving agents' performance in maintaining coherent plans, avoiding logical contradictions, and making appropriate decisions for domain-specific use cases.
GRPO was introduced to enhance reasoning capabilities, becoming particularly notable through its implementation in DeepSeek-V1. The core innovation lies in its group-based comparison approach: rather than comparing individual responses against a fixed reference, GRPO generates groups of responses and evaluates each against the group's average score, rewarding above-average performance while penalizing below-average results. This relative comparison mechanism creates competitive dynamics encouraging models to produce higher-quality reasoning. GRPO is particularly effective for improving chain-of-thought reasoning, which is the critical foundation for agent planning and complex task decomposition. By optimizing at the group level, GRPO captures inherent variability in reasoning processes and trains models to consistently outperform their own average performance.
For complex agent tasks requiring more fine-grained corrections within long reasoning chains, DAPO builds upon GRPO's sequence-level rewards through several key innovations: employing a higher clip ratio (approximately 30% higher than GRPO) to encourage more diverse and exploratory thinking processes, implementing dynamic sampling to eliminate less meaningful samples and improve training efficiency, applying token-level policy gradient loss to provide more granular feedback on lengthy reasoning chains rather than treating entire sequences monolithically, and incorporating overlong reward shaping to discourage excessively verbose responses that waste computational resources.
When agentic use cases require long text outputs in Mixture-of-Experts (MoE) model training, Group Sequence Policy Optimization (GSPO) supports these scenarios by shifting optimization from GRPO's token-level importance weights to the sequence level. These improvements enable more efficient and sophisticated agent reasoning and planning strategies while maintaining computational efficiency and appropriate feedback resolution.
## Production Architecture and AWS Service Integration
The reference architecture presented demonstrates how fine-tuned models integrate within comprehensive multi-agent systems. Post-trained LLMs play two crucial roles: first, serving as specialized tool-using components and sub-agents within broader agent architectures, acting as domain experts optimized for specific functions; second, serving as core reasoning engines where foundation models are specifically tuned to excel at planning, logical reasoning, and decision-making for agents in highly specific domains.
The architecture encompasses four major component groupings:
**LLM Customization for AI Agents**: Builders can leverage various AWS services for fine-tuning and post-training. For models on Amazon Bedrock, customization approaches include distillation and SFT through parameter-efficient fine-tuning with LoRA for simple tasks, Continued Pre-training (CPT) to extend foundation model knowledge with domain-specific corpora, and Reinforcement Fine-Tuning (RFT) launched at re:Invent 2025. RFT includes two approaches: Reinforcement Learning with Verifiable Rewards (RLVR) using rule-based graders for objective tasks like code generation or math reasoning, and Reinforcement Learning from AI Feedback (RLAIF) using AI-based judges for subjective tasks like instruction following or content moderation.
For deeper control over customization infrastructure, Amazon SageMaker provides comprehensive capabilities. SageMaker JumpStart accelerates customization with one-click deployment of popular foundation models and end-to-end fine-tuning notebooks handling data preparation, training configuration, and deployment workflows. SageMaker Training jobs provide managed infrastructure for executing custom fine-tuning workflows, automatically provisioning GPU instances, managing training execution, and handling cleanup. This supports custom Docker containers and code dependencies housing any ML framework, training library, or optimization technique.
SageMaker HyperPod, enhanced at re:Invent 2025, introduced checkpointless training reducing checkpoint-restart cycles from hours to minutes, and elastic training automatically scaling workloads to use idle capacity and yielding resources when higher-priority workloads peak. HyperPod supports resilient distributed training clusters with automatic fault recovery for multi-week jobs spanning thousands of GPUs, supporting NVIDIA NeMo and AWS Neuronx frameworks.
For infrastructure-free customization, SageMaker AI serverless customization (launched at re:Invent 2025) provides fully managed UI- and SDK-driven experiences for model fine-tuning, automatically selecting and provisioning appropriate compute resources based on model size and training requirements. Through SageMaker Studio UI or Python SDK, users can customize popular models using SFT, DPO, RLVR, and RLAIF with pay-per-token pricing, automatic resource cleanup, integrated MLflow experiment tracking, and seamless deployment to both Amazon Bedrock and SageMaker endpoints.
Amazon Nova Forge, launched at re:Invent 2025, enables building custom frontier models from early model checkpoints, blending proprietary datasets with Amazon Nova-curated training data and hosting custom models securely on AWS.
**AI Agent Development Environments and SDKs**: Development occurs in IDEs such as SageMaker Studio, Amazon Kiro, or local machines using specialized SDKs and frameworks abstracting orchestration complexity. Strands provides a Python framework purpose-built for multi-agent systems with declarative agent definitions, built-in state management, and native AWS service integrations handling low-level details of LLM API calls, tool invocation protocols, error recovery, and conversation management.
**AI Agent Deployment and Operation**: Amazon Bedrock AgentCore handles agent execution, memory, security, and tool integration without requiring infrastructure management. AgentCore Runtime offers purpose-built environments abstracting infrastructure management while container-based alternatives (SageMaker jobs, Lambda, EKS, ECS) provide more control for custom requirements. AgentCore Memory enables agents to remember past interactions for intelligent, context-aware conversations, handling both short-term context and long-term knowledge retention. AgentCore Gateway provides scalable tool building, deployment, discovery, and connection with observability into usage patterns, error handling for failed invocations, and integration with identity systems. AgentCore Observability provides real-time visibility into agent operational performance through CloudWatch dashboards and telemetry for metrics including session count, latency, duration, token usage, and error rates using the OpenTelemetry protocol standard.
**LLM and AI Agent Evaluation**: Continuous evaluation and monitoring ensure high quality and performance in production. For models on Amazon Bedrock, Bedrock evaluations generate predefined metrics and human review workflows. For advanced scenarios, SageMaker Training jobs fine-tune specialized judge models on domain-specific evaluation datasets. AgentCore Evaluations (launched at re:Invent 2025) provides automated assessment tools measuring agent or tool performance on completing specific tasks, handling edge cases, and maintaining consistency across different inputs and contexts.
## Phased Maturity Approach and Decision Framework
The article presents a critical decision framework demonstrating that organizations using a phased maturity approach achieve 70-85% production conversion rates compared to 30-40% industry average, with 3-fold year-over-year ROI growth. The 12-18 month journey from initial agent deployment to advanced reasoning capabilities delivers incremental business value at each phase, with the key being allowing use case requirements, available data, and measured performance to guide advancement rather than pursuing technical sophistication for its own sake.
**Phase 1: Prompt Engineering** (6-8 weeks) suits starting agent journeys, validating business value, and simple workflows. It achieves 60-75% accuracy while identifying failure patterns, requiring minimal prompts and examples with investment of $50K-$80K (2-3 FTEs).
**Phase 2: Supervised Fine-Tuning** (12 weeks) addresses domain knowledge gaps, industry terminology issues, and situations requiring 80-85% accuracy. It achieves 80-85% accuracy with 60-80% SME effort reduction, requiring 500-5,000 labeled examples with investment of $120K-$180K (3-4 FTE and compute).
**Phase 3: Direct Preference Optimization** (16 weeks) handles quality/style alignment, safety/compliance critical situations, and brand consistency needs. It achieves 85-92% accuracy with over 20% CSAT improvement, requiring 1,000-10,000 preference pairs with investment of $180K-$280K (4-5 FTE and compute).
**Phase 4: GRPO and DAPO** (24 weeks) addresses complex reasoning requirements, high-stakes decisions, multi-step orchestration, and situations where explainability is essential. It achieves 95-98% accuracy for mission-critical deployment, requiring 10,000+ reasoning trajectories with investment of $400K-$800K (6-8 FTE and HyperPod).
This phased approach emphasizes strategic patience, building reusable infrastructure, collecting quality training data, and validating ROI before major investments. The framework explicitly counters the tendency to pursue advanced techniques prematurely, advocating for alignment between technical sophistication and actual business needs.
## Critical Assessment and Production Considerations
While this case study presents impressive production metrics, several considerations merit attention. The article originates from AWS and serves partly as positioning for their service portfolio, so claims about capabilities and ease of use should be evaluated with appropriate skepticism. However, the concrete production metrics from named Amazon business units (33% medication error reduction, 80% effort reduction, 77% to 96% accuracy improvement) provide substantive validation beyond marketing claims.
The emphasis on "one in four" high-stakes applications requiring advanced fine-tuning is notable—this suggests that 75% of enterprise use cases may be adequately served by simpler approaches, which aligns with broader industry observations about the importance of starting simple before increasing complexity. The phased maturity framework with associated timelines and investment levels provides valuable benchmarking data for organizations planning their own LLMOps journeys, though actual results will vary significantly based on organizational capability, data availability, and use case complexity.
The architectural patterns presented—specialized fine-tuned components within broader agent orchestrations, dual roles as domain experts and reasoning engines, integration of evaluation throughout the lifecycle—reflect emerging best practices in production LLM systems. The emphasis on evaluation criteria that extend far beyond simple accuracy (as demonstrated in the Amazon Pharmacy example) illustrates the maturity of production thinking these teams have developed.
The technical evolution from SFT through PPO, DPO, GRPO, DAPO, and GSPO demonstrates real progression in the field's capabilities for training reasoning-optimized models. However, the operational complexity of implementing these advanced techniques should not be underestimated—the investment levels cited for Phase 4 ($400K-$800K with 6-8 FTE and HyperPod infrastructure) represent substantial commitments that only make sense for truly high-stakes applications where the business value justifies the investment.
The integration with AWS services is comprehensive but creates vendor lock-in considerations—organizations should evaluate whether the convenience of integrated tooling outweighs the flexibility of more open approaches. The recent launches at re:Invent 2025 (RFT, serverless customization, HyperPod enhancements, Nova Forge, AgentCore Evaluations) demonstrate AWS's commitment to reducing operational burden in LLMOps, though production validation of these new capabilities remains limited given their recent introduction.
Overall, this case study provides valuable production learnings from Amazon's internal deployment of advanced fine-tuning techniques across diverse high-stakes domains, offering both technical depth and practical guidance for organizations pursuing similar capabilities.