Stripe: AI Agent-Powered Compliance Review Automation for Financial Services

Overview

Stripe, a global payment processing platform handling $1.4 trillion in volume annually (representing 1.38% of global GDP), developed a sophisticated AI agent system to transform their compliance review operations. The presentation was delivered by Hassan Tariq (AWS Principal Solutions Architect), Chrissy, and Christopher (Data Scientist at Stripe) and provides an extensive technical deep-dive into deploying LLM-powered agents at enterprise scale in a highly regulated environment.

The business context is critical: Forrester research shows enterprises globally spend approximately $206 billion annually on financial crime operations, with compliance requirements growing by 35% year-over-year in some European jurisdictions. Experian surveys indicate up to one-third of compliance tasks could be automated, potentially returning 8-12 hours per week to compliance analysts. Stripe’s compliance function serves two dimensions: ecosystem integrity (KYC/KYB, anti-money laundering, sanctions screening) and user protection (GDPR, UDAAP compliance). The challenge was scaling Enhanced Due Diligence (EDD) reviews without linearly scaling headcount, while maintaining the operational excellence and auditability required in regulated financial services.

Problem Statement and Manual Review Challenges

The presenters identified two primary blockers in the manual review process. First, expert reviewers were spending excessive time as “navigators” rather than analysts—gathering and locating information across fragmented systems rather than making high-value decisions. Second, the cognitive overhead of jurisdiction-switching created significant scalability challenges. Reviewers might move from assessing an entity in California (relatively straightforward) to evaluating complex corporate structures in UAE or Singapore, where risk definitions, ownership transparency requirements, and regulatory thresholds vary dramatically. This constant context-switching across ever-shifting regulatory rulesets created a demanding, complex, and error-prone environment.

The presentation emphasizes that simply scaling up workforce linearly with complexity was not a viable solution. The team needed a way to maintain operational excellence while navigating this fragmented jurisdictional landscape, handle growing case volumes, reduce review turnaround time, and leverage technology innovation to transform the manual process fundamentally.

Solution Architecture: ReAct Agents with Rails

Rather than attempting to automate entire workflows end-to-end—which the presenters explicitly describe as a “fairy tale” that wouldn’t work—Stripe took a measured approach. They decomposed the complex compliance review workflow into a directed acyclic graph (DAG) of bite-sized tasks, with agents operating within “rails” defined by this structure. This approach prevents agents from spending excessive time on low-priority areas while ensuring regulatory requirements are comprehensively addressed.

The team selected ReAct (Reasoning and Acting) agents as their core architecture. Christopher explains the ReAct pattern clearly: given a query (e.g., “10 divided by pi”), the agent enters a thought-action-observation loop. It thinks about what it needs, calls an action (like a calculator tool), receives an observation (the result), and determines whether it has enough information to provide a final answer or needs additional iterations. For analytics problems, this might involve multiple query iterations, progressively refining understanding through repeated loops.

A critical architectural decision was maintaining humans in the driver’s seat. Agents perform investigation and data analysis, but human reviewers make all final decisions. This human-centric validation approach includes configurable approval workflows, with agents serving as assistants rather than decision-makers. The presenters repeatedly emphasize this design choice as fundamental to operating in a regulated environment where decision outcomes carry significant weight.

Infrastructure: Agent Service Development

A fascinating aspect of the case study is the infrastructure journey. Initially, the team attempted to integrate agentic workflows into Stripe’s existing traditional ML inference system, but this was “shot down quickly for very good reasons.” The requirements for agent workloads differ fundamentally from traditional ML:

Traditional ML inference characteristics:

Compute-bound (requires GPUs for LLMs, multiple CPUs for XGBoost)
Consistent latency profiles
Short timeouts
Deterministic control flow (model runs the same way every time)
Expensive machines requiring minimization

Agent workload characteristics:

Network I/O bound (waiting for LLM vendor responses)
Requires many concurrent threads/lanes for waiting
Long timeouts (5-10 minutes vs. 30 seconds)
Nondeterministic execution (variable loop iterations)
Can run on smaller machines but needs high concurrency

Recognizing these fundamental differences, Stripe built a dedicated Agent Service. The development timeline is instructive:

Early Q1: Service didn’t exist; attempted hack into ML inference system failed
Q1 (within ~1 month): Bootstrapped minimal viable service as a monolith with primitive synchronous API (similar to traditional “predict” endpoint)
Q2: Added evaluation capabilities, tracing for debugging, and remarkably, a no-code UI for building agents with custom tools, enabling mass proliferation
Q3: Hit capacity limits, decomposed monolith to allow each use case to spin up dedicated services, solving the “noisy neighbor” problem
Q4: Extended API to support stateful, streamed interactions for chatbot use cases beyond the original synchronous model

This rapid evolution resulted in over 100 agents deployed across Stripe, though Christopher notes with a data scientist’s skepticism that this number may partially reflect the ease of spinning up new agents rather than fundamental necessity—suggesting that a few well-designed agent types (shallow ReAct, deep ReAct, to-do list with sub-agents) might suffice for most use cases.

LLM Infrastructure: Bedrock Integration and Proxy Pattern

Stripe uses Amazon Bedrock as their LLM provider, accessed through an internal “LLM Proxy Service.” The proxy architecture provides several critical capabilities:

Noisy neighbor mitigation: Centralizing LLM access prevents one team’s testing or scaling from crowding out bandwidth needed by production compliance workloads. This becomes especially important during high-traffic periods (like Black Friday/Cyber Monday, when Stripe processes over 500 million requests daily).

Authorization and routing: The proxy ensures appropriate LLMs are used for specific use cases, potentially routing sensitive data away from models deemed unsuitable while allowing less sensitive workloads to use them.

Model fallbacks: Automatic failover if a primary model provider experiences outages or capacity constraints.

Standardized security and privacy: By vetting AWS/Bedrock once, Stripe avoids the overhead of security reviews for each individual LLM vendor, a significant friction point in large enterprises.

The choice of Bedrock specifically offers several advantages highlighted in the presentation:

Prompt caching: This feature proved crucial for managing costs in the agent’s iterative loops. Christopher illustrates how the thought-action-observation loop creates quadratic cost growth—each iteration requires re-reading the entire conversation history (1+2+3+4+…tokens). Prompt caching effectively makes this linear by avoiding re-reading unchanged context, paying primarily for the incremental prompt rather than the full history each time. Given that input tokens dominate agent costs, this represents significant savings.

Fine-tuning capabilities: While recent vendor models have been strong, fine-tuning offers a strategic advantage around deprecation control. Rather than scrambling to maintain performance when vendor models deprecate on their schedule, Stripe can fine-tune to maintain quality and deprecate on their own timeline, allowing teams to focus on developing new capabilities rather than maintaining old ones.

Multi-model access through unified API: Single integration providing access to multiple vendors/models reduces overhead.

Orchestration and Workflow Integration

The agent system integrates into Stripe’s existing review tooling, which serves as the orchestrator for the entire DAG-based workflow. Critically, agents can “front-run” research before reviews begin—operating asynchronously while analysts work on other cases, analogous to a Roomba cleaning while you’re away. As reviewers progress through cases, additional context becomes available, triggering deeper investigations orchestrated by the review application.

The ReAct agents interact with the ecosystem through two primary interfaces:

LLM Client: Connects to the LLM proxy for model inference with caching, fallbacks, and resource management
Tools: Agents call various tools including MCP (Model Context Protocol) clients, Python functions, database queries, and internal APIs to access the signals used in investigations

Tool calling is emphasized as the primary value proposition of agents over simple LLM queries. The ability to dynamically select and invoke appropriate data sources across “almost an infinite amount of signals” that might inform an answer makes agents particularly valuable for compliance investigations spanning multiple jurisdictions and data systems.

Quality Assurance and Evaluation

Operating in a regulated environment demands rigorous QA, though the presentation notes this critical component isn’t well-represented in architectural diagrams. Stripe employs a “very rigorous QA process” where everything must pass a human quality bar. While LLM-as-judge approaches are popular, the team maintains that human evaluators must remain involved for this use case, at least to determine if quality is “really good enough to ship.”

This quality focus proved essential to adoption. Christopher emphasizes that if the agent is helpful only 20%, 40%, or even 80% of the time, reviewers will learn not to trust it, conduct research themselves, and the system provides zero value despite its complexity and cost. Achieving the 96% helpfulness rating required extensive collaboration with ops teams, control owners, and understanding what human reviewers themselves struggle with, followed by iterative prompt refinement.

The evaluation approach aims for greater systematization, with LLM judges potentially useful for quickly failing obviously bad models, but human evaluation remains necessary for final quality determination in this regulated context.

Results and Business Impact

The quantitative results are substantial:

96% helpfulness rating from compliance reviewers, indicating strong trust and adoption
26% reduction in average handling time across reviews, even with agents only handling front-run research questions (not yet leveraging in-review context)
Complete auditability for regulatory requirements, with detailed trails showing what agents found, how they found it, what tool calls were made, and tool call results
Human reviewers maintained in control for all decision-making, ensuring regulatory compliance and accountability

The 26% efficiency gain is explicitly described as “just scratching the surface,” with expectations of much larger improvements as the system expands deeper into reviews and begins leveraging contextual information that emerges during the review process.

Development Timeline and Journey

The overall development journey provides valuable insights for other organizations:

Early Q1: Notebook scripts demonstrated proof-of-concept, establishing that agents would work and justifying investment in dedicated infrastructure. This “fail fast” approach succeeded quickly (under a month), providing conviction to proceed.

Q1: Bootstrapped Agent Service as monolith, integrated with enterprise LLMs, established minimum viable infrastructure.

Q2: Launched first question into production—a significant milestone requiring not just technical capability but operational alignment. Extensive work with ops teams ensured the agent output met the quality bar where human reviewers would actually depend on it. Development of rails and specific questions (rather than open-ended research interfaces) proved critical, as reviewers need to understand what the agent is actually good at answering.

Q3: Scaled to multiple questions, focused on cost optimization through features like caching as the system moved beyond proof-of-concept to production scale.

Q4: Expanded into context-aware orchestration, enabling agents to leverage information that emerges during reviews for deeper investigation—adding complexity to DAG orchestration but unlocking additional efficiency gains.

Lessons Learned and Design Principles

The presenters distill several key lessons:

Don’t try to automate everything immediately: The instinct to fully automate with agents “is just not how it’s gonna work.” Keeping humans in the driver’s seat, using agents as tools rather than replacements, proves more tractable and achieves significant value.

Decompose into bite-sized tasks: Breaking complex workflows into tasks that fit in agent working memory is critical for evaluation, quality assurance, and incremental progress. Small, judgeable tasks enable the human evaluation approach and allow building upon context through orchestration.

Agents need rails: Without structure, agents may “rabbit hole” on unimportant areas while neglecting regulatory requirements. The DAG-based orchestration provides these rails, ensuring comprehensive coverage while allowing deep investigation where valuable.

Tool calling is the key value: The ability to dynamically access an effectively infinite number of potential signals differentiates agents from simpler LLM applications in this compliance context.

Don’t fear new infrastructure: The dedicated Agent Service proved essential and was delivered quickly (one month for MVP). Traditional ML infrastructure simply cannot efficiently support agent workloads.

Quality bar determines adoption: Sub-95% helpfulness likely results in zero adoption and value, making rigorous QA and human evaluation essential investments.

Future Directions

The team outlines several areas for continued development:

Deeper orchestration: Expanding into more complex portions of the review workflow, leveraging contextual information that emerges during reviews to unlock efficiency gains beyond the current 26%.

Streamlined evaluation: While maintaining human quality bars, exploring whether LLM judges can accelerate failing obviously poor models earlier in development cycles.

Fine-tuning for control: Achieving independence from vendor deprecation schedules and focusing engineering effort on new capabilities rather than maintaining compatibility with evolving base models.

Reinforcement learning: The verifiable nature of compliance answers creates opportunities for end-to-end training loops that might learn superior “brains” with fewer tool calls, reduced context windows, and improved efficiency.

Critical Assessment and Balanced View

The presentation represents a strong case study in pragmatic LLMOps, with several aspects worth highlighting for balanced assessment:

Strengths: The human-in-the-loop approach, emphasis on auditability, focus on quality over automation speed, infrastructure investment, and transparent discussion of challenges (like the failed ML inference integration) all demonstrate mature engineering thinking appropriate for regulated environments.

Vendor relationship considerations: This is an AWS-sponsored presentation featuring Stripe as a Bedrock customer, which may influence the emphasis on Bedrock-specific features. However, the technical details appear substantive and the challenges discussed (prompt caching for cost, fine-tuning for control) represent genuine LLMOps concerns rather than purely marketing content.

Generalizability questions: The “over 100 agents” claim is immediately qualified by Christopher’s skepticism that this reflects ease of creation rather than necessity, suggesting the no-code approach may have enabled proliferation beyond optimal design. The comment that “a few well-designed agent types” might suffice is refreshingly honest.

Metrics limitations: While the 26% efficiency gain and 96% helpfulness are strong, the presentation doesn’t detail cost metrics, false positive/negative rates, or comparative performance against alternative approaches. The claim that efficiency gains are “just scratching the surface” is plausible but remains somewhat aspirational.

Evaluation approach: The reliance on human evaluation, while appropriate for regulated compliance, may limit iteration speed and scalability of quality assurance. The desire for more systematic evaluation suggests this remains a partially unsolved challenge.

Overall, this represents a thoughtful, production-grade implementation of LLM agents in a high-stakes environment, with architectural decisions clearly motivated by real operational constraints rather than following AI hype. The emphasis on incremental value, human oversight, and infrastructure investment provides a valuable counter-narrative to fully autonomous agent visions while demonstrating substantial practical impact.

AI Agent-Powered Compliance Review Automation for Financial Services

Industry

Technologies