ZenML

AI-Native Multi-Agent System for Customer Onboarding and KYC

Brex 2026
View original source

Brex, a financial services company, faced a significant challenge with customer onboarding that took days due to manual Know Your Customer (KYC) and underwriting processes that relied on implicit heuristics and manual judgment. To solve this, they rebuilt their entire onboarding system as an AI-native, multi-agent architecture where specialized agents collaborate through structured reasoning to handle verification, fraud detection, document processing, and underwriting decisions. The results were dramatic: they moved from 0% to 40% auto-approval of card applications in weeks, reduced manual identity reviews by 70% through specialized fuzzy-matching agents, achieved 85% reduction in business address requests for information (RFIs), and enabled most eligible businesses to onboard in minutes rather than days while maintaining or improving accuracy and creating full auditability trails for every decision.

Industry

Finance

Technologies

Overview

Brex is a financial services company that provides corporate credit cards, business banking, and expense management services primarily to startups and growing businesses. This case study describes their comprehensive rebuild of customer onboarding as an AI-native, multi-agent system to address what they call the “onboarding trilemma” - the traditional trade-off between speed, accuracy, and scale in Know Your Customer (KYC) and underwriting processes.

The core problem Brex faced was that their onboarding process could take days, which while fast by traditional banking standards, wasn’t meeting the expectations of their fast-moving customer base. The bottleneck wasn’t lack of rigor but rather the reliance on manual judgment and implicit heuristics that created a ceiling on velocity and scalability. Their analysts were doing exceptional work across fragmented tools - verifying entities, tracing ownership, reviewing financials, and identifying fraud patterns - but the process was difficult to scale because much of the reasoning was rooted in individual experience rather than being encoded in a unified digital system.

Approach to Building the System

Brex’s approach began with a deep study of how their human analysts actually worked. They observed analysts making decisions and asked them to narrate their thought processes, documenting institutional knowledge that had never been formally written down: which signals were trusted, which were overvalued, and which were ignored. They mapped moments where human intuition succeeded and where it failed. This ethnographic approach to understanding the decision-making process was critical to their ability to encode that judgment into an automated system.

From this study, they built a massive dataset covering every type of customer they had ever encountered. Crucially, every case was re-reviewed and labeled under a stricter standard than had been applied originally. This dataset became the benchmark against which agentic performance would be measured. The internal bar they set was notably strict: catch every bad actor that humans catch plus more that humans historically missed, never reject a customer that a human would approve, and always explain why a decision was made. This represents a thoughtful approach to building AI systems that aims to capture the best of human decision-making while correcting for its weaknesses.

Multi-Agent Architecture Design

The system architecture is modular and distributed, with specialized agents owning distinct reasoning domains that can be improved independently. Several key principles guided the design. First, every approval or rejection must leave a clear reasoning trail so customers understand why decisions were made. Second, agents must be able to express uncertainty - just as humans naturally ask colleagues for second opinions, agents needed the same humility to defer when confidence was low. Third, the system had to be repeatable and deterministic: running the same application through the pipeline twice should produce the same outcome, which is how they distinguish reasoning from guessing.

Each agent is narrowly scoped, auditable, and explicit about both what it knows and what it doesn’t know. Agents exchange structured claims, supporting evidence, and confidence levels with each other, which enables deterministic execution, replayable pipelines, and full auditability - all critical requirements for financial services applications.

Agent Specializations

The specific agent types reveal how Brex decomposed the onboarding problem into distinct reasoning domains:

Segmentation agents leverage external data sources like LinkedIn (via a tool called Clay), company websites, and application information to determine early in the application process whether a startup is already professionally invested or likely to receive professional investment. This early classification helps route applications appropriately.

Verification and OCR agents include specialized sub-agents that automatically process and validate high-friction documents in real time. An OCR and classification agent verifies proof of address, articles of incorporation, SAFEs (Simple Agreements for Future Equity), and bank statements with high precision, allowing for instant auto-approval of documents that previously required manual review. The specificity here is notable - they didn’t just build a generic document processor but tailored agents to the specific document types common in their domain.

Identity and fraud agents go beyond simple identity checks to evaluate behavioral signals and anomalies. A particularly interesting specialized component is the fuzzy-match agent that resolves name mismatches on IDs (such as “Johnny” versus “John”). This single specialized agent has reduced manual identity reviews by 70%, demonstrating the power of narrowly scoped agents addressing specific friction points rather than trying to build one general-purpose system.

Underwriting agents automatically reconstruct a company’s financial profile by qualifying and mapping applicants to underwriting segment policies, automating what had been complex manual analysis.

The decision agent serves as the orchestration layer, synthesizing evidence, confidence scores, and Brex policies into a single outcome. Importantly, when confidence is high, decisions are made instantly, but when confidence falls below a defined threshold - for example when data sources conflict - the case is escalated to a human analyst. Those human decisions are then fed back into the system as supervised signals, continuously improving calibration and accuracy over time. This human-in-the-loop design is positioned not as a fallback for system failures but as an integral part of how the system’s reliability compounds over time.

Production Deployment and Operations

The case study emphasizes several aspects of production deployment that are particularly relevant to LLMOps. The focus on auditability and explainability isn’t just about technical capability but reflects regulatory requirements in financial services. Every agent’s reasoning must be inspectable, and this governance is built into the system architecture rather than being bolted on afterward. Each human intervention doesn’t just solve an individual case - it creates a labeled data point that sharpens future automation, creating a continuous improvement loop.

The emphasis on repeatability and determinism is another production consideration. In financial services, being able to replay decisions and understand exactly why a particular outcome occurred is critical both for regulatory compliance and for debugging when things go wrong. The structured exchange of claims, evidence, and confidence levels between agents provides this capability.

The confidence scoring mechanism that determines when to escalate to humans represents a thoughtful approach to managing the accuracy-automation trade-off. Rather than trying to automate everything and accepting higher error rates, or being overly conservative and routing too much to humans, they’ve built a calibrated system that can assess its own uncertainty. This is a sophisticated LLMOps pattern that goes beyond simple threshold-based rules.

Results and Impact

The quantitative results Brex reports are substantial. They moved from 0% to 40% auto-approval of card applications in a matter of weeks, achieved an 85% reduction in business address RFIs (requests for information), and reduced manual identity reviews by 70% through the fuzzy-matching agent alone. Most eligible businesses now onboard in minutes rather than days. Importantly, they claim these improvements in speed and scale came with maintained or improved accuracy, and the claim is that the system catches every bad actor that humans would catch plus additional ones that humans historically missed.

While these are impressive metrics, it’s worth noting that they come from Brex’s own reporting and the case study is fundamentally a marketing piece. The specific claim that the system “never rejects a customer that a human would approve” is particularly strong and would be difficult to verify externally. There’s also limited discussion of failure modes, edge cases, or situations where the system has struggled, which would be expected in any real-world deployment of this complexity.

Key Learnings and Patterns

Brex identifies several critical success factors. The partnership between subject matter experts (SMEs) and AI engineers proved essential. Rather than the traditional “requirements document” handoff where analysts specify what they want and engineers build it, they worked as a single integrated unit to translate complex institutional logic into agentic workflows. This tight feedback loop is what enabled such rapid progress - moving from 0% to 40% auto-approval in weeks.

The finding that specialized agents outperform general ones aligns with broader trends in production AI systems. Narrow scope provides clearer accountability and easier iteration. When something goes wrong with the fuzzy name-matching, you know exactly which agent to investigate and improve. This modularity is a key architectural pattern for maintaining and evolving complex AI systems in production.

The positioning of deferral as a feature rather than a flaw represents mature thinking about AI system design. Teaching agents to say “I’m not sure” reduced errors on both sides - false positives and false negatives. This allowed human analysts to focus on genuinely high-judgment edge cases rather than repetitive data entry tasks that the system could handle reliably. The framing of human-in-the-loop as a strength rather than an admission of AI limitations is important for building systems that work reliably in high-stakes domains.

The emphasis on governance living inside the system rather than being added later reflects hard-won lessons in production AI deployment. Auditability, privacy boundaries, and policy enforcement need to be architectural considerations from the start, not afterthoughts. This is particularly critical in regulated industries like financial services.

Critical Assessment and Limitations

While this case study provides valuable insights into production deployment of multi-agent AI systems, several caveats should be considered. First, this is a first-party case study published by Brex on their own blog, which inherently limits how critically it will examine challenges or failures. The narrative arc moves smoothly from problem to solution to impressive results without dwelling on the difficulties, false starts, or ongoing challenges that surely existed in such a complex system rebuild.

Second, many of the specific technical details are omitted. We don’t learn what underlying LLMs or AI models power these agents, what prompting strategies were used, how exactly the confidence scoring works, what the specific thresholds are for escalation, or how the agents were evaluated during development. The case study operates at a relatively high level of abstraction, describing the architecture and approach without diving deep into implementation details.

Third, the metrics reported are selective. We learn about improvements in auto-approval rates and reductions in manual reviews, but we don’t see comprehensive statistics on false positive rates, false negative rates, customer satisfaction with the onboarding experience, or operational costs. The claim that accuracy has been maintained or improved is asserted but not demonstrated with detailed data.

Fourth, the timeline is notably compressed in the telling. Moving from 0% to 40% auto-approval “in weeks” is presented as straightforward, but this likely elides significant preparatory work in building the dataset, developing the initial agents, and establishing the infrastructure. The case study published in January 2026 likely represents work that began much earlier.

Finally, there’s limited discussion of ongoing operational challenges. How often do the agents need to be retrained or updated? How do they handle drift in customer populations or fraud patterns? What happens when external data sources like LinkedIn change their APIs or data formats? How do they manage version control and rollback when agent updates cause problems? These are all critical LLMOps concerns that aren’t addressed in the narrative.

Future Directions

Brex indicates three priorities going forward: expanding instant onboarding to more customers, continuing investment in risk controls and calibration as automation scales, and extending these agentic patterns to other workflows across Brex where judgment and scale intersect. This suggests they view this onboarding system as a template or pattern that can be applied more broadly, which is a common evolution in organizations that successfully deploy AI in one domain.

The emphasis on risk controls and calibration scaling with automation is important. As the system handles more volume and more complex cases, maintaining accuracy and compliance becomes both more critical and more challenging. The continuous learning loop where human interventions improve the system is key to this scaling strategy.

Relevance to LLMOps Practice

This case study illustrates several patterns highly relevant to LLMOps practitioners. The multi-agent architecture with specialized, narrowly-scoped agents provides modularity and maintainability. The structured exchange of claims, evidence, and confidence between agents enables orchestration while maintaining auditability. The human-in-the-loop design with confidence-based escalation manages the accuracy-automation trade-off. The emphasis on repeatability and determinism addresses production reliability concerns. The tight partnership between domain experts and AI engineers enables effective translation of implicit knowledge into automated systems. The focus on building governance and auditability into the architecture rather than adding it later reflects mature production thinking.

The use case itself - KYC and underwriting in financial services - represents a high-stakes, regulated domain where the consequences of errors are significant. The fact that Brex was able to deploy AI agents in this context successfully (by their account) suggests that with appropriate architecture, evaluation, and human oversight, agentic AI systems can handle complex, consequential decisions in production. However, the emphasis on auditability, explainability, and human escalation also underscores that full automation without oversight remains inappropriate for these high-stakes domains, at least with current AI capabilities.

More Like This

Running LLM Agents in Production for Accounting Automation

Digits 2025

Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.

healthcare fraud_detection customer_support +50

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building a Microservices-Based Multi-Agent Platform for Financial Advisors

Prudential 2025

Prudential Financial, in partnership with AWS GenAI Innovation Center, built a scalable multi-agent platform to support 100,000+ financial advisors across insurance and financial services. The system addresses fragmented workflows where advisors previously had to navigate dozens of disconnected IT systems for client engagement, underwriting, product information, and servicing. The solution features an orchestration agent that routes requests to specialized sub-agents (quick quote, forms, product, illustration, book of business) while maintaining context and enforcing governance. The platform-based microservices architecture reduced time-to-value from 6-8 weeks to 3-4 weeks for new agent deployments, enabled cross-business reusability, and provided standardized frameworks for authentication, LLM gateway access, knowledge management, and observability while handling the complexity of scaling multi-agent systems in a regulated financial services environment.

healthcare fraud_detection customer_support +48