ZenML

AI-Powered Multi-Agent Decision Support System for Strategic Business Decisions

Coinbase 2025
View original source

Coinbase developed RAPID-D, an internal AI-powered decision support tool designed to augment their existing RAPID (Recommender, Agree, Perform, Input, Decider) decision-making framework. The system addresses the challenge of cognitive bias and unseen risks in critical strategic decisions by deploying a multi-agent architecture where specialized AI agents analyze proposals, retrieve contextual information from enterprise knowledge bases, challenge assumptions through adversarial analysis, and synthesize recommendations. The solution uses Claude 3.7 Sonnet as the underlying model and implements an asynchronous architecture for complex decisions, with human review benchmarks showing strong accuracy compared to actual decision outcomes. The system incorporates real-time feedback loops where stakeholder comments are analyzed and used to optimize subsequent recommendations within the same decision flow.

Industry

Finance

Technologies

Overview

Coinbase built RAPID-D, an AI-powered decision support system that operates in production to assist with critical strategic decisions at the cryptocurrency exchange company. The case study presents an interesting application of LLMs in an enterprise decision-making context, where the goal is not automation but augmentation of human judgment. The system was published about in November 2025 and represents Coinbase’s broader commitment to becoming an “AI-native” organization.

The core business problem being addressed is the limitation of traditional decision frameworks. While Coinbase had established the RAPID framework (Recommender, Agree, Perform, Input, Decider) for structured decision-making on critical choices, they identified opportunities to systematically surface unseen risks, mitigate cognitive biases, and provide transparent, auditable analysis. The RAPID framework itself involves designated roles where a Recommender proposes a decision, various stakeholders provide input or agreement/disagreement, a Performer executes, and a Decider makes the final call. This structured approach is used for both Type 1 (irreversible) and Type 2 (reversible) decisions.

Architecture and System Design

RAPID-D implements a sophisticated multi-agent architecture rather than a monolithic single-model approach. This design choice is notable from an LLMOps perspective as it represents a more complex orchestration pattern that requires careful coordination between specialized agents. The system consists of four primary agents, each with distinct responsibilities:

The Single Shot Recommender Agent (described as “The Analyst”) performs an initial thorough review of the RAPID document itself. This agent generates a baseline recommendation strictly based on the facts and arguments presented in the primary decision document. This agent appears to function as a straightforward document analysis component without external context retrieval.

The Contextual Recommender Agent (described as “The Seeker”) adds sophistication through enterprise knowledge integration. This agent first generates critical questions about the RAPID document, then leverages Coinbase’s enterprise search tool to find answers across all internal knowledge sources. This represents a retrieval-augmented generation (RAG) pattern where the agent dynamically queries organizational knowledge to inform its analysis. The agent synthesizes these findings to provide recommendations enriched with broader organizational context that might not be explicitly mentioned in the original decision document.

The Contrarian Agent (described as “The Devil’s Advocate”) serves as a deliberate bias mitigation mechanism. Its sole purpose is to construct the strongest possible argument against the initial recommendation. This agent actively probes for weaknesses, unstated assumptions, potential risks, and unintended consequences. From an LLMOps perspective, this represents an interesting application of adversarial prompting in a production system, where the adversarial component is built into the architecture rather than used solely for testing.

The Debate and Decide Agent (described as “The Synthesizer”) acts as the final synthesis layer. This agent evaluates arguments from the Analyst, broader context from the Seeker, and challenges from the Devil’s Advocate. It produces a comprehensive final recommendation for the human Decider, complete with detailed reasoning and trade-off analysis. This orchestration pattern mirrors ensemble approaches but with sequential rather than parallel processing of specialized perspectives.

Technical Implementation Details

The system incorporates a Document Context Retrieval tool that includes a “key question generator” block. This component examines each decision document and formulates targeted questions around critical business areas including security, market impact, cost, user experience, and scalability. For each generated question, the system retrieves relevant information by searching Coinbase’s enterprise knowledge base. This demonstrates a structured approach to RAG where question generation is an explicit intermediate step rather than direct embedding similarity search.

The case study mentions that RAPID-D implements an asynchronous architecture to handle complex decisions that require more processing time. This is a critical LLMOps consideration for production systems where response latency can vary significantly based on the complexity of reasoning, number of retrieval operations, and length of multi-agent exchanges. The asynchronous design allows users to be kept informed without blocking on completion, which is essential for user experience in systems with variable latency.

Model selection settled on Claude 3.7 Sonnet after evaluation. The case study presents a benchmark comparison across what appear to be multiple leading models (though specific competitors aren’t named), with Claude 3.7 Sonnet chosen for “its strong balance of quality, stability, and reliability.” From an LLMOps perspective, this highlights the practical tradeoffs beyond raw accuracy that matter in production deployments, including model stability and reliability over time.

Evaluation and Quality Assurance

The evaluation methodology described involves human review processes where RAPID-D’s final recommendations are compared against actual decisions made by Coinbase’s RAPID Deciders. The case study mentions “benchmark scores across leading models” but doesn’t provide specific numerical accuracy metrics, only stating that evaluations were conducted. This human-in-the-loop evaluation approach is appropriate for decision support systems where the ground truth is the actual human decision rather than an objective correct answer.

More interestingly, RAPID-D implements real-time feedback incorporation directly into its decision process. Comments or corrections provided by users during an active session, or later by any stakeholder in the RAPID document, are captured and analyzed against the assistant’s original recommendation. This evaluation is then used to optimize subsequent recommendations within the same decision flow. This represents a form of online learning or in-context adaptation where the system adjusts based on stakeholder feedback during a single decision episode. It’s unclear from the case study whether this feedback also contributes to longer-term model fine-tuning or prompt optimization, though that would be a natural extension.

Development Process and Iteration

The development journey described follows a thoughtful iterative approach centered on user feedback from organizational leaders. The initial version started with a single agent that analyzed RAPID documents and presented reasoning. Early feedback was gathered manually to understand what Deciders valued most. This user-centered development approach is important for LLMOps in internal tools, where adoption depends heavily on trust and perceived value from actual users.

The evolution to the current multi-agent version with debate mechanisms and deeper explainability represents significant architectural expansion. The case study notes this as a “far more dynamic and interactive experience” compared to the initial single-agent approach. This evolution suggests that the initial single-agent system may have been perceived as too simplistic or not sufficiently thorough in its analysis, leading to the more complex multi-perspective architecture.

LLMOps Considerations and Tradeoffs

Several LLMOps considerations emerge from this case study that warrant balanced assessment:

Transparency and Explainability: The multi-agent debate structure is explicitly designed to make the decision-making process “transparent, consistent, and reproducible.” By surfacing different perspectives (baseline analysis, contextual information, contrarian view, synthesis), the system makes its reasoning more inspectable than a black-box recommendation. However, the actual implementation details of how these agent outputs are presented to users and whether users can drill into the reasoning of individual agents isn’t fully described.

Complexity vs. Value Tradeoff: The multi-agent architecture with enterprise search integration represents significant engineering complexity compared to a simple single-agent approach. This complexity involves orchestrating multiple LLM calls, managing retrieval operations, handling asynchronous workflows, and synthesizing diverse perspectives. The case study presents this as valuable but doesn’t provide quantitative evidence of improved decision quality compared to simpler approaches or the baseline human-only RAPID process. The absence of specific accuracy metrics or decision outcome improvements means we should view the claimed benefits with some healthy skepticism.

Bias Mitigation Claims: The contrarian agent is positioned as directly combating bias, but this claim deserves scrutiny. While adversarial analysis can surface overlooked considerations, it’s not clear that an LLM-generated contrarian view actually reduces the biases inherent in LLM outputs themselves. The contrarian agent may still reflect the biases present in its training data or prompt design. True bias mitigation would require evaluation against diverse real-world decision outcomes and analysis of whether the system actually helps avoid systematic decision errors.

Knowledge Retrieval Quality: The effectiveness of the Contextual Recommender Agent depends entirely on the quality of enterprise search and the relevance of retrieved documents. The case study doesn’t discuss how retrieval quality is evaluated, how the system handles contradictory information from different sources, or how it manages outdated or incorrect information in the knowledge base. These are critical concerns for RAG systems in production.

Human-AI Collaboration Model: The positioning as “augmentation not replacement” is appropriate for this use case, but the actual collaborative workflow isn’t fully detailed. Questions remain about whether users tend to anchor on AI recommendations, how often human Deciders diverge from RAPID-D suggestions, and whether the system changes the social dynamics of the RAPID process (for instance, whether other stakeholders feel their input is less valued when AI analysis is available).

Scalability and Cost: A multi-agent system with enterprise search integration potentially involves many LLM API calls per decision. The asynchronous architecture suggests some decisions may take considerable processing time. The case study doesn’t discuss cost management, rate limiting, or whether all decisions warrant this level of AI analysis or if some threshold determines when to invoke RAPID-D.

Feedback Loop Implementation: The real-time feedback incorporation sounds sophisticated but raises questions about prompt injection risks (could adversarial feedback manipulate subsequent recommendations?) and whether feedback from different stakeholders is weighted differently. The mechanism for “optimizing subsequent recommendations within the same decision flow” isn’t technically detailed—it could be anything from simple context addition to in-context learning patterns.

Production Deployment Maturity

The case study indicates RAPID-D is deployed and used internally, making this a genuine LLMOps production system rather than an experiment. The mention of asynchronous architecture, user notification systems, and feedback mechanisms suggests thoughtful production engineering. However, typical LLMOps concerns like monitoring, observability, error handling, graceful degradation, and version management aren’t discussed.

The evaluation approach using human review and comparison to actual decisions represents a reasonable quality assurance strategy for this domain. However, without ongoing metrics about system performance, user satisfaction, decision quality outcomes, or adoption rates, it’s difficult to assess the true production maturity.

Business Impact and Claims

The case study makes strong claims about building “a more intelligent and resilient organization” and operating “at the cutting edge of technology” but provides limited concrete evidence of business impact. There are no metrics about:

This is understandable for an internal tool where sharing detailed metrics might reveal confidential business information, but it means we should interpret the success claims cautiously. The case study functions partly as corporate communications about Coinbase’s AI capabilities, which naturally emphasizes positive framing.

Technical Innovation Assessment

From an LLMOps perspective, RAPID-D represents solid engineering of a multi-agent system with RAG capabilities for an internal enterprise use case. The architectural choices—specialized agents, adversarial analysis, knowledge retrieval, asynchronous processing—are reasonable and reflect good understanding of LLM capabilities and limitations. The iterative development approach based on user feedback demonstrates product thinking appropriate for internal tools.

However, the technical innovation is primarily in application and orchestration rather than novel techniques. Multi-agent systems, RAG, and debate-based architectures are established patterns in LLM applications. The specific application to decision support in a corporate framework is valuable but not technically groundbreaking. The case study would be stronger with more technical detail about prompt engineering strategies, retrieval algorithms, agent coordination mechanisms, and failure handling.

Conclusion

RAPID-D represents a thoughtful production deployment of LLMs for enterprise decision support, demonstrating how multi-agent architectures can be applied to augment human judgment in critical business processes. The system shows attention to important LLMOps concerns including explainability, asynchronous processing, model selection, and feedback incorporation. However, the case study provides limited technical depth and quantitative evidence of impact, making it more of a high-level architectural overview than a detailed technical case study. The strong promotional framing should be balanced against the absence of concrete metrics and the inherent limitations of LLM-based systems in handling biased training data, retrieval quality issues, and the complexities of human-AI collaboration in high-stakes decisions.

More Like This

Building a Microservices-Based Multi-Agent Platform for Financial Advisors

Prudential 2025

Prudential Financial, in partnership with AWS GenAI Innovation Center, built a scalable multi-agent platform to support 100,000+ financial advisors across insurance and financial services. The system addresses fragmented workflows where advisors previously had to navigate dozens of disconnected IT systems for client engagement, underwriting, product information, and servicing. The solution features an orchestration agent that routes requests to specialized sub-agents (quick quote, forms, product, illustration, book of business) while maintaining context and enforcing governance. The platform-based microservices architecture reduced time-to-value from 6-8 weeks to 3-4 weeks for new agent deployments, enabled cross-business reusability, and provided standardized frameworks for authentication, LLM gateway access, knowledge management, and observability while handling the complexity of scaling multi-agent systems in a regulated financial services environment.

healthcare fraud_detection customer_support +48

AI-Powered Content Curation for Financial Crime Detection

LSEG 2025

London Stock Exchange Group (LSEG) Risk Intelligence modernized its WorldCheck platform—a global database used by financial institutions to screen for high-risk individuals, politically exposed persons (PEPs), and adverse media—by implementing generative AI to accelerate data curation. The platform processes thousands of news sources in 60+ languages to help 10,000+ customers combat financial crime including fraud, money laundering, and terrorism financing. By adopting a maturity-based approach that progressed from simple prompt-only implementations to agent orchestration with human-in-the-loop validation, LSEG reduced content curation time from hours to minutes while maintaining accuracy and regulatory compliance. The solution leverages AWS Bedrock for LLM operations, incorporating summarization, entity extraction, classification, RAG for cross-referencing articles, and multi-agent orchestration, all while keeping human analysts at critical decision points to ensure trust and regulatory adherence.

fraud_detection regulatory_compliance content_moderation +33

Building Observable, Debuggable, and Durable Agentic Systems with Orchestration

Union 2026

Union's Chief ML Engineer shares lessons learned from productionizing agentic systems at scale, addressing the critical infrastructure challenges that arise when deploying LLM agents in production environments. The presentation introduces six design principles for building crash-proof, durable agents using the Flyte 2.0 orchestration platform, focusing on how agents can recover from multi-layer failures (infrastructure, network, logical, semantic) through proper context engineering and durability mechanisms. A key case study with Dragonfly demonstrates these principles in action, where a tiered agent architecture processes 250,000+ software products with 200+ steps and 100+ LLM calls each, achieving 2,000+ concurrent runs, 50% reduction in failure recovery time, 30% increased development velocity, and 12 hours per week saved on infrastructure maintenance.

fraud_detection code_generation data_analysis +49