Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
Stripe, as the financial infrastructure provider processing $1.4 trillion annually (about 1.3% of global GDP), has made significant investments in transitioning from traditional machine learning to production-grade LLM systems and foundation models. Emily Glassberg Sands, Stripe’s Head of Data & AI, leads efforts to build what the company calls “economic infrastructure for AI,” encompassing both internal AI adoption and external products that enable the AI economy.
One of Stripe’s most significant LLMOps achievements is building a domain-specific foundation model for payments data. This represents a fundamental shift from single-task ML models to a unified transformer-based approach that produces dense payment embeddings to power multiple downstream applications.
The architecture treats each charge as a token and behavior sequences as the context window, creating a payments-specific language model. The system ingests tens of billions of transactions at a rate of 50,000 per minute, incorporating comprehensive feature sets including IPs, BINs (Bank Identification Numbers), transaction amounts, payment methods, geography, device fingerprints, and merchant characteristics. A critical design decision involves focusing on “last-K relevant events” across heterogeneous actors and surfaces, including login attempts, checkout flows, payment retries, and issuer responses.
The challenge in payments data differs from natural language in that relevant sequences aren’t consecutive—the meaningful context could be transactions from a specific IP address, charges on Friday nights with particular credit cards, or patterns across different retailers. This requires the model to capture a broad range of relevant sequences to understand behavioral patterns.
Production Impact: The foundation model is deployed on the critical path, processing every single Stripe transaction in under 100 milliseconds. For card-testing fraud detection on large merchants, accuracy improved dramatically from 59% to 97%. Card testing involves fraudsters either enumerating through card numbers or randomly guessing valid cards, then selling them or using them for fraud. Traditional ML struggled with sophisticated attackers who hide card-testing attempts within high-volume legitimate traffic at large e-commerce sites—for example, sprinkling hundreds of small-value transactions among millions of legitimate ones. The foundation model’s dense embeddings enable real-time cluster detection that identifies these patterns.
Beyond fraud, the model enables rapid development of new capabilities. When AI companies requested detection of “suspicious” transactions that don’t result in fraudulent disputes but represent bot traffic or other undesirable activity, Stripe deployed this capability in days by leveraging the foundation model’s embeddings, clustering techniques, and textual alignment to label different suspicious patterns that merchants could selectively block.
Stripe’s fraud detection system, called Radar, has evolved significantly with AI capabilities. The company faces new fraud vectors emerging from the AI economy itself. Traditional “friendly fraud” (using legitimate credentials but not creating revenue, such as free trial abuse or refund abuse) has become existentially threatening for AI companies due to high inference costs. Where previously a fraudster might abuse a free SaaS trial with minimal marginal cost to the provider, AI service abuse can rack up tens of thousands of dollars in GPU costs through virtual credit cards or chargebacks.
Stripe observed that 47% of payment leaders cite friendly fraud as their biggest challenge. One AI founder claimed to have “solved” fraud by completely shutting down free trials and dramatically throttling credits until payment ability was proven—effectively choking their own revenue to avoid fraud losses. This prompted Stripe to build Radar extensions specifically targeting AI economy fraud patterns.
The foundation model enables a more sophisticated approach than deterministic rule-based systems. Instead of binary “block/don’t block” decisions, the system can generate human-readable descriptions of why specific charges are concerning. The vision is to have humans today, and agents tomorrow, reasoning over these model outputs to make more nuanced decisions. This is particularly important for emerging fraud patterns like virtual credit cards enabling free trial abuse—Stripe cited an example of Robin Hood marketing “free trial cards” that work for 24 hours then expire, which are useful for consumers but devastating for AI businesses when used by fraudsters.
In collaboration with OpenAI, Stripe launched the Agentic Commerce Protocol to create standardized infrastructure for agent-driven commerce. The protocol emerged from observing a gap in the market: consumers wanted to buy through agents, agents were ready to facilitate purchases, and merchants wanted to enable agent-driven sales, but no standard existed for making this work efficiently.
Core Components of ACP:
The shared payment token is coupled with Stripe Link, which has over 200 million consumers as the wallet for credentials. When processed via Stripe, transactions include Radar risk signals covering fraud likelihood, card testing indicators, and stolen card signals.
Strategic Design Decisions: Stripe deliberately designed ACP as an open protocol rather than a proprietary Stripe-only solution. The shared payment token works with any payment service provider, not just Stripe. The protocol is also agent-agnostic—while it launched with ChatGPT Commerce featuring instant checkout from OpenAI, it’s designed to enable merchants to integrate once and work with all future agent platforms that adopt the standard.
Early traction includes over 1 million Shopify merchants coming online, major brands like Glossier and Viore, Salesforce’s Commerce Cloud integration, and critically, Walmart and Sam’s Club adoption—representing the world’s largest retailer validating agent-driven commerce. Stripe notes that 58% of Lovable’s revenue flows through Link, demonstrating the density and network effects in the AI developer community.
Alternative Payment Flows: Beyond the shared payment token, Stripe has implemented other mechanisms for agent payments. A year before ACP, Perplexity launched travel search and booking powered by Stripe’s issuing product, which creates one-time-use virtual cards scoped to specific amounts, time windows, and merchants—comparable to how DoorDash issues virtual cards to drivers. Stripe expects multiple payment mechanisms to coexist: shared payment tokens, virtual cards, agent wallets, stable coins, and stored balances, particularly for microtransactions where traditional card economics don’t work for small-value purchases like 5-50 cent transactions common in AI services.
Stripe has achieved remarkable internal AI adoption with 8,500 employees (out of approximately 10,000) using LLM tools daily and 65-70% of engineers using AI coding assistants in their day-to-day work.
Internal AI Stack Components:
Toolshed (MCP Server): A central tool layer providing LLM access to Slack, Google Drive, Git, the data catalog (Hubble), query engines, and more. Agents can both retrieve information and take actions using these tools. This is preferred over pure RAG approaches because it enables not just information retrieval but actual interaction with systems.
Text-to-SQL Assistants:
Semantic Events and Real-Time Canonicals: Stripe is re-architecting payments and usage billing pipelines so the same near-real-time feed powers the Dashboard, Sigma analytics, and data exports (like BigQuery) from a single source of truth that LLMs can reliably consume.
LLM-Built Integrations: A pan-European local payment method integration that historically took approximately 2 months with traditional development was completed in approximately 2 weeks using LLMs, with trajectory toward completing such integrations in days. The LLM essentially reads Stripe’s codebase, understands the local payment method’s integration documentation, and generates the hookup code—machine-to-machine integration accelerated by AI.
Productivity Measurement Challenges: Stripe acknowledges difficulty in measuring true impact beyond superficial metrics. Lines of code and number of PRs are explicitly rejected as meaningful measures—Emily noted receiving three LLM-generated documents in a single week that “sounded good” but weren’t clearly connected to reality, requesting just the bullet points the authors fed to ChatGPT rather than the eight-page expansions. The concern is that LLMs can produce high volumes of content that appear thoughtful without the deep reasoning that comes from careful manual construction. The company is working toward measuring impact more empirically through developer-perceived productivity rather than output volume metrics.
Stripe also monitors costs carefully, noting that many AI coding models carry non-trivial expenses, and the company is planning how to balance value extraction with cost management going forward.
Engineering Culture and LLM Usage: There’s an ongoing tension around LLM-generated content in organizational workflows. Stripe has established a “non-negotiable” requirement: if LLM was used in content generation, it must be cited. The concern is that writing traditionally forced deep thinking and reasoning from first principles—reading and revising a document 50 times to ensure logical consistency—whereas LLMs allow people to skip that cognitive work while producing polished-sounding output. The fear is that over-reliance on LLMs for generation risks under-investing in depth at precisely the moment when depth becomes most valuable.
Early LLM Infrastructure (2023): When GPT-3.5 launched, Stripe immediately recognized the need for enterprise-grade, safe, easy access to LLMs. They built “GoLLM,” an internal ChatGPT-like interface with model selection and a preset/prompt-sharing feature that enabled employees to save and share effective prompts. Hundreds of presets emerged overnight for use cases like customer outreach generation and rewriting marketing content in Stripe’s tone.
This was followed by an LLM proxy providing production-grade access for engineers to build customer-facing experiences. Early production use cases focused on merchant understanding—when thousands of merchants onboard daily, LLMs help determine what they’re selling, whether it’s supportable through card networks, creditworthiness, and fraud risk.
Evolution to Modern Stack: Stripe has since deprecated GoLLM in favor of open-source solutions like LibraChat, which integrates with the Toolshed MCP servers. This represents their “buy then build” philosophy—initially building when no suitable solution existed, then migrating to open-source or commercial solutions as the ecosystem matures.
Experimental Projects Team: A cross-functional group of approximately two dozen senior engineers tackles opportunities that don’t naturally fit within any single product vertical but are being pulled from the market. This team produced both the Agentic Commerce Protocol and Token Billing. The team operates with an “embedded” model—pairing experimental projects engineers with product or infrastructure team members to ensure smoother handoffs when projects reach escape velocity and need ongoing ownership.
Recognizing that many AI companies are “wrappers” building on underlying LLM providers (a term Stripe uses descriptively, not pejoratively), Stripe built Token Billing to address unit economics volatility. When a service depends on upstream LLM costs that can drop 80% or triple unexpectedly, static pricing creates serious risks—competitors can undercut when costs drop, or businesses can go underwater when costs spike.
Token Billing provides an API to track and price inference costs in real-time using a formula: effective_price_per_unit = base_margin + (model_tokens_used * provider_unit_cost) + overhead. This allows AI companies to dynamically adjust pricing in response to underlying model cost changes while maintaining healthy margins.
Beyond token billing, Stripe supports various monetization models including usage-based billing, outcome-based pricing (like Intercom’s $1 per resolved support ticket), and stable coin payments. Stable coins are particularly relevant for AI companies with global reach and high-value transactions—Shadeform reports 20% of volume now comes through stable coins, with half being fully incremental revenue (customers who wouldn’t have purchased otherwise). The cost advantage is significant: 1.5 percentage points for stable coins versus 4.5 percentage points for international cards on large transactions.
Stripe employs a nuanced approach to build-versus-buy decisions. The general philosophy is “buy then build if a buy exists,” but they actively validate that buy options truly don’t exist before building. They learned from Emily’s previous experience at Coursera, where a team of brilliant first-time employees built custom experimentation platforms, analytics platforms, and ML platforms that had to be painfully ripped out and replaced years later—infrastructure that wasn’t core to Coursera’s competitive advantage.
Spotlight Program: When needing new tooling, Stripe runs RFPs (requests for proposals) through their “Spotlight Program.” For example, when seeking an evaluation platform, they received over two dozen applications, evaluated all of them, narrowed to two finalists, ran POCs, and ultimately selected Braintrust. They employ similar processes for other infrastructure needs.
Preferred Vendors: Stripe’s current stack includes Weights & Biases, Flight, and Kubernetes. For feature engineering, they initially used a homegrown platform, evaluated Tecton but couldn’t justify it on the critical charge path due to latency and reliability requirements (needing six nines of reliability and tens of milliseconds of latency), and ultimately partnered with Airbnb to build Shepard (open-sourced as Cronon).
Multi-Provider Strategy: Stripe deliberately avoids vendor lock-in, particularly for LLM providers. Rather than enterprise-grade wrappers around single providers, they prefer solutions that work across many providers and models, enabling flexible swapping as capabilities and costs evolve.
Stripe is investing heavily in semantic events infrastructure and near-real-time, well-documented canonical datasets, anticipating that within nine months, users won’t want static dashboards but will expect to be fed insights by agents or query real-time data directly.
The re-architecture focuses initially on payments and usage-based billing, ensuring the same data feed powers the Dashboard, Sigma (for queryable analytics), and Stripe Data Pipeline (for exports to systems like BigQuery where customers combine Stripe data with other sources). Historically, these different products consumed different data feeds, creating confusion. The new architecture provides a single source of truth.
Data Quality Challenges: The biggest challenge with Hubert (text-to-SQL) isn’t query generation but data discovery—identifying the right tables and fields. Stripe is addressing this through:
Text-to-SQL works well when data is well-structured and well-documented (like revenue data in Sigma), but most enterprise data lacks this quality, making data discovery the critical bottleneck for LLM-powered analytics.
Stripe’s unique position processing payments for the top AI companies provides visibility into AI economy dynamics. Analyzing the top 100 highest-grossing AI companies on Stripe compared to top 100 SaaS companies from five years prior reveals:
Business Model Insights: AI companies are much more likely to use Stripe’s full stack (payments, billing, tax, revenue recognition, fraud) than SaaS companies were, enabling them to scale with lean teams. They’re also iterating across monetization models more rapidly—fixed subscriptions, usage-based, credit burn-down, outcome-based—because the market is still discovering optimal supply-demand intersections.
The vertical specialization pattern mirrors SaaS (horizontal infrastructure first, then vertical solutions) but is happening faster because companies can build on underlying LLM providers without research investments, and because borderless AI solutions make niche verticals viable at global scale.
Unit Economics Question: While people costs are exceptionally low (enabling strong revenue per employee), inference costs remain the open question. Stripe believes it would be unwise to evaluate these businesses assuming current costs, instead modeling reasonable expectations of declining costs based on historical trends. They note that models improving in capability and declining in cost happens “quite a bit, quite quickly,” making two-to-three-year ROI horizons more appropriate than in-year returns.
Stripe launched “claimable sandboxes” integrated into V0 (Vercel) and Replit, allowing developers to create complete payment systems—defining products, setting prices, running test charges—before even having a Stripe account. When satisfied, they click a button, authenticate with Stripe (or create an account), claim the sandbox, and immediately go live.
This has driven significant business creation, with Stripe’s integration becoming the third-highest used integration in V0 within two weeks of launch. Notably, a significant portion of these new businesses are from non-technical founders who might have struggled with traditional Stripe onboarding. Stripe also built a “future onboarding experience” that learns intent and preferences from tools like Vercel, dropping users into the flow already partially complete based on their sandbox work.
Stripe’s LLMOps journey demonstrates sophisticated production deployment of AI across multiple dimensions: domain-specific foundation models processing billions of transactions in real-time, protocol design for emerging agent-to-agent commerce, massive internal adoption of AI tooling, and infrastructure products enabling the broader AI economy. The company balances building custom solutions when necessary (foundation models, agentic commerce protocol) with adopting best-in-class third-party tools (Braintrust for evals, LibraChat for internal chat), while maintaining a clear focus on enabling their core mission of growing internet GDP. Key lessons include the importance of measuring long-term rather than in-year ROI for AI investments, the critical role of data quality and discovery for LLM success, and the need for humans to maintain deep reasoning capabilities even as LLMs handle increasing cognitive work.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Coinbase, a cryptocurrency exchange serving millions of users across 100+ countries, faced challenges scaling customer support amid volatile market conditions, managing complex compliance investigations, and improving developer productivity. They built a comprehensive Gen AI platform integrating multiple LLMs through standardized interfaces (OpenAI API, Model Context Protocol) on AWS Bedrock to address these challenges. Their solution includes AI-powered chatbots handling 65% of customer contacts automatically (saving ~5 million employee hours annually), compliance investigation tools that synthesize data from multiple sources to accelerate case resolution, and developer productivity tools where 40% of daily code is now AI-generated or influenced. The implementation uses a multi-layered agentic architecture with RAG, guardrails, memory systems, and human-in-the-loop workflows, resulting in significant cost savings, faster resolution times, and improved quality across all three domains.
Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.