## Overview
Amazon's Compliance Screening system represents a sophisticated production deployment of LLM-powered multi-agent systems operating at massive scale. The system processes approximately 2 billion transactions daily across more than 160 global businesses, screening against sanctions lists including OFAC's SDN list and the UK's HMT list. This case study demonstrates how Amazon has operationalized GenAI technology to transform compliance operations from a human-intensive bottleneck into an automated, highly accurate system that maintains regulatory adherence while improving customer experience.
The business context is critical: sanctions enforcement has intensified significantly, with US regulators alone imposing $2 billion in penalties since 2023. Manual review processes created substantial operational challenges, with review cycles taking days and directly impacting customer experience through delayed transactions, account holds, and order fulfillment disruptions. The scale of operations required thousands of human experts, creating an unsustainable model as transaction volumes continued growing.
## System Architecture and Three-Tier Approach
The production system implements a sophisticated three-tier architecture that balances competing requirements of speed, accuracy, and thoroughness. The first tier employs a screening engine using advanced fuzzy matching algorithms combined with custom vector embedding models developed and deployed on Amazon SageMaker. This foundation layer prioritizes high recall to capture all potential matches, accepting higher false positive rates to ensure no genuine risks slip through.
The second tier applies traditional machine learning models to filter low-quality matches and reduce noise, significantly decreasing false positives by analyzing match quality signals. This allows compliance teams to focus resources on genuine risks rather than wading through obvious false alarms.
The third tier, which forms the focus of this case study, deploys specialized AI agents built on Amazon Bedrock AgentCore Runtime to conduct comprehensive investigations of remaining high-quality matches. This tier represents the production LLMOps innovation, systematically gathering relevant information, analyzing it holistically according to established Investigation Standard Operating Procedures (SOPs) curated by Amazon's Compliance team, and generating detailed case summaries with recommendations.
## Agent Architecture and Production Design
The production system consists of multiple specialized AI agents, each designed to handle specific investigation aspects. The agents are built using the Strands agent framework, powered by large language models hosted on Amazon Bedrock, and deployed on Amazon Bedrock AgentCore Runtime. Each agent uses specialized tools to interact with external systems, access data, and perform designated functions.
The **name matching agent** analyzes name variations, transliterations, and cultural naming conventions across multiple languages and scripts. This agent must handle complex real-world scenarios like recognizing that "Abel Hernandez, Jr." and "Hernandez Abel" may reference the same individual, while also processing non-Latin scripts including Arabic, Chinese, Japanese, and Cyrillic. The agent understands cultural naming conventions such as different ordering of given names and surnames across cultures. For example, when investigating matches between "李明" (Li Ming) and "Ming Li," the agent recognizes both transliteration variation and different name order conventions between Chinese and Western formats.
The **address matching agent** focuses on geographic data, understanding address variations, abbreviations, and international formatting differences. It identifies when "123 Main St., New York, NY" and "123 Main Street, New York City, New York" refer to the same location. The production implementation handles international address formats and local abbreviations, validates addresses against geospatial data, and detects potential address obfuscation techniques. The agent can determine that complex addresses like "Flat No. 502, Sai Kripa Apartments, Plot No. 23, Linking Road, Bandra West, Mumbai – 400050, Maharashtra, India" and "Sai Kripa Apts., 5th Floor, Flat 502, 23 Linking Rd., Bandra (W), Mumbai, MH 400050" reference the same location despite formatting differences.
The **entity type inference agent** determines whether entities are individuals or organizations by first examining data from Amazon's internal systems using a data aggregation tool. When internal evidence proves insufficient, it analyzes naming patterns and corporate indicators (LLC, Inc., Ministry) across multiple languages to classify entity types appropriately.
The **Verified Customer Information (VCI) agent** examines customer-provided information including identification documents, business registrations, and account verification records, providing critical evidence for investigation conclusions.
The **recommendation agent** synthesizes findings from all investigative agents, applying risk-weighted analysis to generate final recommendations with detailed justifications based on gathered evidence. It aggregates evidence from all agents, applies risk scoring based on multiple factors, and generates comprehensive case summaries including all evidence, analysis from each agent, risk assessment, and clear recommendations with supporting rationale. Cases are classified as false positive or true positive with confidence levels.
The **orchestration agent** coordinates workflow between all agents, managing dependencies, optimizing investigation sequences based on case complexity, handling parallel execution where appropriate, and monitoring agent progress while handling exceptions. The orchestration agent is implemented using the Strands Graph Multi-Agent Pattern, which enables deterministic execution order based on graph structure—a critical requirement for compliance and auditability.
## Tools and External Integration
The production system extends agent capabilities through carefully designed tools that enable interaction with external systems and data access. The **data aggregator tool** serves as the critical bridge between AI agents and Amazon's extensive data ecosystem, retrieving and consolidating information from multiple internal sources including Know Your Customer (KYC) systems, transaction history systems, account verification records, party profile information, and historical compliance data.
The **maps tool** provides geospatial intelligence and address validation capabilities, performing address validation and standardization, geocoding and reverse geocoding, jurisdiction analysis, and distance calculations between locations. It integrates multiple geographical APIs to help agents verify location-based information and detect address discrepancies—essential functionality for the address matching agent.
The **open source data tool** aggregates publicly available information from multiple third-party data providers, enabling comprehensive open source intelligence gathering during investigations. This tool connects to various specialized data sources, including corporate registry databases, providing agents with external context beyond Amazon's internal data.
## Compliance-First Design and Governance
The production deployment implements rigorous compliance-first design principles. All agents receive strict guidance and follow Investigation Standard Operating Procedures (SOPs) curated by Amazon's Compliance team, ensuring AI reasoning aligns with regulatory requirements and company policies. Every decision and reasoning step is logged and traceable, creating complete investigation trails that enable auditing and understanding of exactly how AI reached conclusions.
The system implements rigorous checks to verify that agents strictly adhere to assigned SOPs, maintaining consistency across all investigations regardless of case volume or complexity. For example, the recommendation agent cannot execute until it has received results from all investigative agents (name matching, address matching, entity type inference, and VCI agents). When agent confidence scores fall below defined thresholds, cases are automatically escalated to human investigators through human-in-the-loop tasks.
This compliance-first approach reflects thoughtful LLMOps practice, recognizing that autonomous AI systems operating in regulated domains require strong governance frameworks, complete auditability, and clear escalation paths. The system reserves human involvement exclusively for cases where agents cannot reach conclusive determinations due to insufficient evidence or exceptional circumstances falling outside standard investigation parameters.
## Technology Stack and Infrastructure Choices
The agents are built using Strands and deployed on Amazon Bedrock AgentCore Runtime, a combination providing powerful capabilities for agentic development, session management, security isolation, and infrastructure management. This technology choice enables builders to focus on solving core problems and multi-agent orchestration while abstracting complex implementation details.
Strands offers seamless integration with AWS services and external tools through standardized Model Communication Protocol (MCP) and Agent-to-Agent (A2A) constructs, with the entire Amazon Bedrock model selection available. The AgentCore Runtime serverless architecture proves critical for handling unpredictable workloads—when sanctions lists update daily, case volumes spike dramatically. AgentCore Runtime automatically scales up during surges and down during normal operations, providing efficient compute resource usage without manual intervention.
This infrastructure choice reflects mature LLMOps thinking about production requirements beyond model performance alone. The serverless architecture addresses real operational challenges of variable workloads, while the standardized integration protocols (MCP and A2A) enable maintainable, extensible agent systems.
## Production Results and Performance
The production system achieves 96% overall accuracy with 96% precision and 100% recall rates for historical decisions, reportedly outperforming human reviewers. The system automates decision-making for over 60% of case volume, representing substantial operational impact at the scale of 2 billion daily transactions. The AI agents analyze hundreds of data points simultaneously while following complex procedures, enabling thorough investigation at speeds impossible for human reviewers.
While these results appear impressive, it's important to maintain balanced perspective. The case study is authored by Amazon team members and published on AWS blogs, serving promotional purposes for AWS services. The claimed "outperformance" of humans should be understood in context—the system achieves 100% recall by design (catching all true positives that humans identified historically), and the 96% precision indicates 4% false positive rate, which may still represent substantial absolute numbers at this transaction volume. The 60% automation rate means 40% of cases still require human review, suggesting significant complexity remains in compliance screening that current AI systems cannot fully handle autonomously.
## Key LLMOps Learnings and Best Practices
The team shares several valuable LLMOps learnings from production deployment. First, they emphasize starting with clear SOPs, noting that success depends heavily on having well-defined Standard Operating Procedures before implementing AI agents. They invested significant time working with compliance teams to document and formalize investigation procedures upfront. This investment proved crucial, as agents can only be as effective as the procedures they follow—a critical insight for operationalizing LLMs in regulated domains.
Second, they advocate iterative agent development rather than deploying all agents simultaneously. They started with the name matching agent, validated its performance, then progressively added complexity with additional agents. This iterative approach allowed early issue identification and resolution, built stakeholder confidence, and provided valuable lessons about agent design before scaling to more complex multi-agent scenarios.
Third, they learned the importance of balancing autonomy and control. Too much autonomy creates unpredictable behavior—in early iterations, agents skipped address verification when name similarity seemed low, reasoning it was unnecessary. Too much control limits effectiveness. They use different approaches based on function and risk: for the orchestration agent, they implemented Strands Graph Multi-Agent Pattern to enforce deterministic execution order (entity type inference before name matching), guaranteeing investigation of SOP compliance and audit trails. For investigative agents like the name matching agent, model-driven approaches provide reasoning autonomy to handle complex scenarios—Arabic-to-English transliterations, Cyrillic variations, or Chinese romanization—requiring dynamic reasoning through linguistic patterns impossible with rigid rules.
Fourth, they emphasize that tool design is critical. The quality and reliability of tools directly impact agent performance. They learned to design tools with clear interfaces, comprehensive error handling, and detailed documentation so agents can use them effectively. Well-designed tools make the difference between agents that consistently produce valuable insights and those that struggle with basic tasks.
Fifth, they stress building comprehensive observability from day one. Understanding why agents make specific decisions is necessary for compliance and continuous improvement. They can trace every agent action, every tool call, and every reasoning step, which has proven invaluable for troubleshooting, auditing, and optimizing agent behavior. This observability requirement is particularly acute in regulated domains where decisions must be explainable to auditors and regulators.
## Critical Assessment and Trade-offs
While the case study presents impressive results, several considerations merit balanced assessment. The system still requires human intervention for 40% of cases, indicating substantial investigative complexity that current AI systems cannot fully handle. The compliance domain presents unique challenges where errors carry severe consequences—the $2 billion in regulatory penalties since 2023 mentioned in the case study illustrates the stakes involved.
The reliance on documented SOPs as the foundation for agent behavior creates both strengths and limitations. SOPs provide essential structure and auditability, but they also constrain agents to codified procedures that may not capture all investigative nuance. The automatic escalation to humans when confidence scores fall below thresholds is prudent risk management, but determining appropriate thresholds involves inherent trade-offs between automation rates and risk tolerance.
The technology stack choice of Amazon Bedrock and AgentCore Runtime naturally serves AWS promotional interests, though the technical rationale (serverless scaling, standardized integration protocols, session management) appears sound. Organizations evaluating similar systems should consider whether equivalent capabilities might be achieved with alternative technology stacks, potentially at different cost structures or with different operational characteristics.
The claimed performance comparison to humans warrants careful interpretation. The 100% recall rate means the system catches all true positives that human reviewers identified in historical data—essentially learning to replicate human decisions rather than necessarily improving upon them. The real innovation lies in achieving this performance at scale and speed while automating 60% of cases, not necessarily in superior judgment compared to human experts.
## Broader LLMOps Implications
This case study demonstrates several important patterns for operationalizing LLMs in production at scale. The multi-agent architecture with specialized agents reflects growing recognition that complex tasks benefit from decomposition into sub-tasks handled by specialized components rather than monolithic prompt engineering. The orchestration agent using graph-based deterministic execution patterns shows how to combine the flexibility of LLM reasoning with the predictability required for regulated domains.
The emphasis on tools and external integration highlights that production LLM systems rarely operate in isolation—they must interact with existing data systems, APIs, and organizational infrastructure. The quality of these integrations significantly impacts overall system effectiveness, suggesting that LLMOps involves substantial systems engineering beyond prompt engineering and model selection.
The serverless deployment architecture addresses real operational challenges of variable workloads and scaling requirements. As LLM-powered systems move from prototypes to production, infrastructure choices that enable automatic scaling, efficient resource utilization, and operational simplicity become increasingly important. The case study demonstrates mature thinking about these operational requirements.
The compliance-first design with comprehensive logging, tracing, and escalation mechanisms illustrates how high-stakes domains require additional governance layers around LLM systems. Organizations deploying LLMs in regulated domains should expect significant investment in auditability, explainability, and human oversight mechanisms beyond the core model capabilities. This governance overhead is not optional—it's fundamental to responsible deployment in domains where errors carry severe consequences.