## Overview
Coinbase is a cryptocurrency exchange platform with a mission to expand economic freedom to over a billion people by providing secure infrastructure for trading and transacting crypto assets globally. The company serves millions of users across more than 100 countries and manages billions of dollars in trading volume. This case study, presented at AWS re:Invent 2024/2025, details how Coinbase scaled their Gen AI capabilities across three critical operational domains: customer support, compliance investigations, and developer productivity.
The presentation was delivered jointly by Joshua Smith (Senior Solutions Architect at AWS Financial Services) and Varsha Mahadevan (Director of Machine Learning and AI at Coinbase), providing both the infrastructure provider's perspective and the practitioner's real-world implementation details.
## Strategic Context and ML Foundation
Before diving into their Gen AI initiatives, it's important to understand that Coinbase already had extensive machine learning infrastructure in place. Traditional ML models power critical security functions including account takeover detection at login, credit default risk assessment for fiat currency transfers, and fraud detection for blockchain transactions. Additionally, ML drives personalization features like search results, news feeds, recommendations, and price alerts. This ML infrastructure runs on AnyScale, a cloud platform based on the open-source Ray framework, operating on AWS EKS clusters.
Coinbase has also developed innovative blockchain-specific AI solutions including graph neural networks for adaptive risk scoring of blockchain addresses, smart contract auditing combined with ML for ERC20 scam token detection, and predictive models for database scaling ahead of market volatility surges. This established ML foundation provided the technical maturity and infrastructure basis for their Gen AI expansion.
## Gen AI Platform Architecture
Coinbase designed their Gen AI platform with two guiding principles: breadth and depth. For breadth, they built an internal platform that provides standardized access to multiple LLMs and data sources. The platform uses OpenAI's API standards for LLM access and Model Context Protocol (MCP) standards for data endpoint access. This standardization allows any team at Coinbase to leverage and extend AI capabilities for their specific use cases without reinventing integration patterns.
The depth aspect involves making targeted, high-impact investments in specific domains rather than spreading efforts thinly across all possible applications. The three chosen domains—customer support, compliance, and developer productivity—were selected based on business impact potential and technical feasibility given the evolving capabilities of LLMs.
## Customer Support: Multi-Layered Agentic System
### Problem Context
Customer support at Coinbase faces unique challenges. Crypto market volatility can cause user activity to swing up or down by 50% within a single month, making it impossible to scale human support teams quickly enough. The company operates globally with diverse languages and regulatory requirements, and trust is paramount—customers need to feel safe and supported regardless of market conditions. Chat has become the preferred support channel for over 50% of customers.
### Three-Layer Chatbot Architecture
Coinbase built their AI-powered chatbot iteratively in three distinct layers, each adding more sophisticated capabilities:
**Layer 1: RAG-Based FAQ System**
The foundational layer provides simple FAQ-style responses using retrieval-augmented generation. This layer handles straightforward queries about sign-in issues, two-factor authentication, and general how-to questions. The architecture centers on a RAG retriever using Amazon Bedrock Knowledge Bases, where Coinbase help articles are vectorized and stored. They employ Cohere's re-rank models to improve retrieval accuracy. The system includes a vector database serving as the memory layer to maintain conversation history and context.
Response generation uses a mixture of LLMs, prominently featuring Anthropic's Claude models served through Bedrock. Notably, the response generation involves a sub-agent built with an actor-critic architecture to refine outputs. The entire system is bookended by input and output guardrails powered by Bedrock Guardrails to protect against harmful content and PII leakage, supplemented by custom domain-specific filters to minimize prompt injection and reduce hallucination through grounding rules.
**Layer 2: Business Procedure Automation**
As LLM capabilities improved, Coinbase enhanced the chatbot to autonomously execute business procedures beyond simple information retrieval. This layer can conversationally collect information from users and take direct actions on their behalf. For example, it can answer account-specific queries or investigate pending transaction statuses.
The architecture introduces a Business Procedure Classifier that routes queries to specialized sub-agents, each emulating a specific business procedure. The original RAG agent becomes one specialized sub-agent among many, executing a procedure that involves knowledge base lookup. This design creates a single source of truth for business procedures used by both human agents and AI systems, providing exceptional adaptability for training and updates. All data access is standardized through MCP servers.
**Layer 3: Proactive Issue Resolution**
The most advanced layer enables the chatbot to anticipate and resolve issues before users explicitly ask. By tapping into user signals and monitoring active platform incidents, the system can proactively address common problems. This capability is implemented as another React agent that attempts proactive resolution first, falling back to the business procedure classifier if needed.
### Agent Assist for Human Support
For complex cases that escalate to human agents, Coinbase built an Agent Assist tool that provides real-time assistance. The tool draws from account signals, ongoing incident data, past support tickets, and other sources to help agents diagnose issues and suggest precise responses in multiple languages. This is particularly valuable given Coinbase's global operation across 100+ geographies.
### Design Principles and Monitoring
Several factors were central to Coinbase's design decisions. Model selection balanced accuracy, latency, and scalability, and was treated as an ongoing evaluation process rather than a one-time decision as model capabilities evolved. Tool standardization through MCP provided a strong foundation not just for customer support but for other domains. The focus on business procedures as a single source of truth provided crucial business adaptability.
Factual correctness and grounding were paramount. To ensure quality, every chatbot response undergoes "LLM as a judge" evaluation, assessing relevancy, accuracy, potential bias, and hallucinations. These quality metrics are actively tracked and monitored for trends, allowing the team to quickly spot anomalies and intervene as needed.
### Customer Support Results
The impact has been substantial: approximately 65% of customer contacts are now handled automatically by AI systems, saving nearly 5 million employee hours annually. Critically, these automated cases are resolved in a single interaction, typically in under 10 minutes, compared to up to 40 minutes for cases handled by human agents. This represents not just productivity gains and cost savings, but a significant enhancement to user experience.
## Compliance: Automating Complex Investigations
### Compliance Challenges
As a regulated financial entity, Coinbase must uphold strict standards for anti-money laundering (AML), counter financing of terrorism (CFT), and anti-bribery and corruption (ABC). They implement Know Your Customer (KYC), Know Your Business (KYB), and Transaction Monitoring Systems (TMS) processes. These compliance workflows are human-intensive and difficult to scale with market volatility. Regulatory bodies demand thorough investigations with full explainability for all cases. Operating across multiple countries means adapting to diverse regulatory requirements—one size does not fit all.
### Breadth and Depth Strategy
Similar to customer support, Coinbase applied breadth through their Gen AI platform's standardized LLM and MCP interfaces. However, compliance has a distinguishing feature: it also relies on traditional deep learning models for risk detection. These models, built on their AnyScale/Ray ML platform, detect high-risk cases across KYC, KYB, and TMS workflows.
For depth, Coinbase built advanced deep learning models for risk detection and used Gen AI to automate and accelerate the investigation process following detection. The investigations involve gathering and synthesizing data from diverse sources including internal systems and open-source intelligence.
### Compliance Assist Tool and Holistic Review
The Compliance Assist tool provides compliance agents with AI-generated investigation reports. When deep learning risk models trigger alerts on high-risk cases, they initiate a "holistic review"—a comprehensive investigation. The Compliance Auto Resolution (CAR) engine orchestrates this agentic workflow.
The architecture coordinates human-in-the-loop processes with two personas: internal compliance operations agents who review AI findings and provide feedback, and end customers who may be contacted through Requests for Information (RFI) when additional data is needed. Throughout the process, the engine aggregates and synthesizes data from multiple sources via standardized MCP data connectors.
The output is a robust AI-generated narrative summary that presents the evidence and reasoning. However, the final decision—including whether to file a Suspicious Activity Report (SAR) with government authorities—always rests with human compliance agents. This approach combines AI's speed and depth with essential human oversight and accountability, which is critical in a regulated environment.
## Developer Productivity: AI-Powered SDLC
### Code Authoring
Coinbase recognizes that developers are passionate and opinionated about their tools, so rather than mandating a single solution, they offer best-in-class coding assistants as "paved paths." Developers can choose tools like Anthropic's Claude Code (which integrates into IDEs or works from command line) or Cursor (a context-aware intelligent IDE). These tools are powered by Anthropic models served through Bedrock. This approach respects developer preferences while standardizing on the underlying infrastructure.
### Pull Request and Code Review Automation
Coinbase developed a homegrown tool adapted from open-source software and enhanced with Claude models from Bedrock. Implemented as an AI-powered GitHub Action, it automates several aspects of PR review:
- Summarizes the pull request and underlying code changes, addressing a common pain point where PRs contain dozens of changed files without clear context
- Generates clear, natural language review comments similar to what a senior engineer would provide
- Enforces coding conventions automatically, freeing human reviewers from explaining basic standards to newer developers
- Highlights gaps in unit testing coverage
- Provides debugging tips for CI/CD failures
Importantly, this doesn't eliminate human code review but rather handles routine aspects automatically, allowing human reviewers to focus on nuanced architectural and logic issues that provide higher value.
### Quality Assurance: Automated UI Testing
Coinbase built a homegrown AI-powered tool for automated end-to-end UI testing for web and mobile interfaces. The system converts natural language test descriptions directly into autonomous browser actions, essentially testing the UI as a human would. These actions are executed across different form factors using services like BrowserStack and frameworks like Playwright.
When issues are found, the system captures screenshots and generates structured reports, making it easy for development teams to address bugs. This brings significant scale and agility to UI testing, which traditionally requires substantial manual effort.
### Developer Productivity Results
The results demonstrate strong adoption and impact: approximately 40% of all code written daily at Coinbase is now AI-generated or AI-influenced, with a goal to exceed 50%. The automated PR reviewer saves an estimated 75,000 hours annually while raising overall code quality through consistent convention enforcement.
The QA automation results are particularly impressive: it achieves accuracy on par with human testers while detecting 3× as many bugs in the same timeframe. New tests can be introduced in as little as 15 minutes compared to hours of training required for human testers. Cost efficiency shows approximately 86% reduction compared to traditional manual testing. While Coinbase acknowledges that every AI-generated line still needs human review and that AI isn't suitable for every business context, the productivity gains are substantial where appropriately applied.
## Infrastructure and AWS Services
The implementation leverages several AWS services, with Amazon Bedrock playing a central role as a fully managed service for building, deploying, and operating Gen AI applications including agents. Bedrock provides access to foundation models from Anthropic, Meta, Mistral, and Amazon through a single API, with tools for private model customization, safety guardrails, and cost/latency optimization.
Amazon Bedrock Agent Core, a relatively new offering, addresses the operational challenges of production agentic systems. Agent Core provides:
- **Runtime**: Serverless, purpose-built runtime for deploying and scaling agents regardless of framework, protocol, or model choice, supporting long-running workloads up to 8 hours with checkpointing and recovery capabilities
- **Gateway**: Integration with MCP servers and APIs to provide agents with diverse tools
- **Browser and Code Interpreter**: Allow agents to act autonomously in browsers or execute code with controlled rules
- **Identity**: Standards-based authentication with existing identity providers, OAuth support, and secure token vault for frictionless user experiences
- **Memory**: Short and long-term memory storage for complex workflows and continuous learning
- **Observability**: Centralized observability combining logs, traces, and metrics for Gen AI applications
Coinbase's traditional ML infrastructure runs on AWS EKS (Kubernetes) with AnyScale/Ray for training and inference of deep learning models.
## Critical Assessment and Balanced Perspective
While Coinbase's presentation highlights impressive results, several considerations warrant balanced assessment:
**Claimed Impact Verification**: The metrics presented (65% automation, 5 million hours saved, 40% AI-generated code) are company-provided figures without independent verification. The actual calculation methodologies for these savings aren't detailed. For instance, "AI-generated or AI-influenced" code is a broad category that could include minor suggestions alongside complete function generation.
**Complexity and Maintenance**: The multi-layered architecture with numerous specialized sub-agents, business procedure classifiers, and custom guardrails represents significant engineering complexity. The presentation acknowledges the need to modernize systems built just 1-2 years ago due to rapid AI advancement, suggesting ongoing maintenance burden and potential technical debt accumulation.
**Human Oversight Requirements**: While automation handles 65% of customer contacts, the 35% requiring human intervention likely represents the most complex, sensitive, or problematic cases. The presentation doesn't detail false positive rates, escalation patterns, or cases where AI assistance was counterproductive. In compliance especially, the human-in-the-loop requirement means automation provides efficiency gains but doesn't eliminate the fundamental human workload for high-risk decisions.
**Vendor Lock-in and Standardization Claims**: While Coinbase emphasizes standardization through OpenAI API and MCP protocols, the deep integration with AWS Bedrock and specific model providers (particularly Anthropic's Claude) suggests potential switching costs. The true portability of their multi-agent architectures across different infrastructure providers remains unclear.
**Quality Metrics and Hallucination Risks**: The "LLM as a judge" evaluation approach for chatbot quality uses another LLM to evaluate LLM outputs, which introduces potential for correlated errors or blind spots. The presentation doesn't discuss false negative rates where the quality assessment might miss problematic responses, or specific incidents where hallucinations caused customer harm.
**Developer Productivity Nuances**: The claim that QA automation detects 3× as many bugs might reflect different testing strategies rather than pure superiority—automated tests may catch more minor UI variations while missing critical logic errors that experienced human testers would identify. The 86% cost reduction claim likely compares against fully manual testing rather than traditional automated testing approaches, potentially inflating the perceived benefit specific to AI.
**Regulatory and Compliance Validation**: While Coinbase describes their compliance AI systems, the presentation doesn't detail regulatory approval processes, audits by financial regulators, or any compliance incidents related to AI decision-making. For a regulated financial entity, the gap between technical capability and regulatory acceptance is critical.
**Implementation Timeline Reality**: References to work spanning 18-24 months suggest substantial investment timeframes. The presentation format at a vendor conference (AWS re:Invent) inherently emphasizes successes while potentially downplaying failed experiments, architectural dead-ends, or abandoned approaches.
Despite these considerations, Coinbase's implementation represents a substantial real-world deployment of LLMs in production across multiple high-stakes domains. The technical architecture demonstrates thoughtful layering, appropriate human oversight, and practical standardization approaches. The scale of deployment (millions of users, 100+ countries) provides valuable insights into operating Gen AI at production scale in regulated industries.
## Future Direction and Agent Core Adoption
Coinbase's vision extends beyond current implementations to democratizing AI capabilities across the entire organization, empowering every employee to create, experiment, and innovate with AI agents. They're particularly interested in AWS Bedrock Agent Core for their next wave of expansion, citing its secure agent deployment, robust identity and authentication management, powerful memory capabilities, and advanced interoperability as key enabling features.
The acknowledgment that systems built just 1-2 years ago need modernization underscores the breakneck pace of AI advancement and the operational challenge of maintaining Gen AI systems in rapidly evolving landscapes. This modernization need presents both opportunity and risk—opportunity to leverage newer, more capable platforms like Agent Core, but risk of perpetual refactoring cycles that divert resources from new capabilities.
## LLMOps Maturity Indicators
This case study demonstrates several hallmarks of mature LLMOps practice:
- **Multi-modal deployment**: Successfully operating LLMs across diverse use cases (chatbots, investigation assistance, code generation) with domain-specific architectures
- **Standardization and abstraction**: Using OpenAI API standards and MCP for consistent interfaces across the platform
- **Guardrails and safety**: Implementing input/output guardrails, custom filters, and grounding rules to manage hallucination and security risks
- **Human-in-the-loop design**: Appropriate human oversight especially in high-stakes compliance decisions
- **Continuous evaluation**: "LLM as a judge" monitoring and quality metric tracking for production systems
- **Iterative development**: Building chatbot capabilities in three layers rather than attempting full functionality immediately
- **Infrastructure leverage**: Using managed services (Bedrock) rather than building everything from scratch
- **Memory and state management**: Vector databases for conversation history and context maintenance
- **Observability focus**: Centralized logging, tracing, and metrics for Gen AI applications
The combination of breadth (platform approach) and depth (targeted high-impact implementations) represents a pragmatic strategy for enterprise Gen AI adoption that balances innovation with operational sustainability.