Nubank: Building an AI Private Banker with Agentic Systems for Customer Service and Financial Operations

LLMOps Database

Finance

Nubank

Company

Nubank

Title

Building an AI Private Banker with Agentic Systems for Customer Service and Financial Operations

Industry

Finance

Link

https://www.youtube.com/watch?v=paaOevEFNlo

Year

2025

Summary (short)

Nubank, one of Brazil's largest banks serving 120 million users, implemented large-scale LLM systems to create an AI private banker for their customers. They deployed two main applications: a customer service chatbot handling 8.5 million monthly contacts with 60% first-contact resolution through LLMs, and an agentic money transfer system that reduced transaction time from 70 seconds across nine screens to under 30 seconds with over 90% accuracy and less than 0.5% error rate. The implementation leveraged LangChain, LangGraph, and LangSmith for development and evaluation, with a comprehensive four-layer ecosystem including core engines, testing tools, and developer experience platforms. Their evaluation strategy combined offline and online testing with LLM-as-a-judge systems that achieved 79% F1 score compared to 80% human accuracy through iterative prompt engineering and fine-tuning.

## Company Overview and Business Context Nubank represents one of the most significant LLMOps implementations in the financial services sector, operating as the third-largest bank in Brazil and the fastest-growing bank in Mexico and Colombia. The company serves approximately 120 million users and has provided first credit card access to 21 million people in Brazil over the past five years alone. With 58% of Brazil's population as customers and brand recognition exceeding well-known global brands like McDonald's and Nike in Brazil, Nubank faces unique challenges in deploying LLM systems at massive scale while maintaining the accuracy and trust essential for financial operations. The company's technical infrastructure is highly sophisticated, operating 1,800 services with deployments every two minutes and making all product decisions through A/B testing. This operational context creates both opportunities and constraints for LLM deployment, requiring systems that can integrate seamlessly with existing high-frequency deployment pipelines while maintaining the reliability standards expected in financial services. ## Core LLM Applications and Use Cases Nubank has implemented two primary LLM-powered applications that demonstrate different aspects of production LLMOps challenges. The first application is a comprehensive customer service chatbot system that handles their massive volume of customer interactions. With 8.5 million customer contacts per month, chat serving as the primary channel, and 60% of interactions now being handled first by LLMs, this represents one of the largest-scale customer service LLM deployments documented in the financial sector. The second application is more technically sophisticated: an agentic money transfer system that operates across voice, image, and chat interfaces. This system exemplifies the complexity of financial LLM applications, requiring integration with WhatsApp, multi-factor authentication, natural language understanding of transfer instructions, and real-time verification of account balances, fraud status, and other financial constraints. The system has achieved remarkable performance improvements, reducing transaction time from 70 seconds across nine different screens to under 30 seconds, while maintaining over 90% success rates and less than 0.5% inaccuracy. ## Technical Architecture and LLMOps Infrastructure Nubank's LLMOps ecosystem is built around a four-layer architecture that provides comprehensive support for LLM development, deployment, and monitoring. The layers include core engines, testing and evaluation tools, developer experience platforms, and operational monitoring systems. The company has strategically chosen to build this ecosystem around LangChain, LangGraph, and LangSmith as foundational technologies, representing a significant commitment to these tools for production LLM operations. The use of LangGraph specifically addresses challenges around building standardized approaches to agentic systems and RAG implementations. According to the presentation, complex LLM flows can be difficult to analyze without proper tooling, and LangGraph provides both centralized logging repositories and graphical interfaces that enable non-technical stakeholders to contribute to system improvements. This democratization of LLM system development allows business analysts, product managers, and operations teams to make decisions about prompts, inputs, and parameters without requiring deep technical expertise. The graphical representation capabilities of LangGraph also reduce cognitive effort in representing complex flows, which is particularly important given the intricate nature of financial operations where multiple products (lending, financing, investment, banking, credit cards) and account types (individual, dependent accounts, multiple cards) must be considered simultaneously for each customer interaction. ## Evaluation Strategy and Quality Assurance Nubank's approach to LLM evaluation represents one of the most comprehensive frameworks described in production financial LLM deployments. The company faces unique evaluation challenges due to operating across multiple countries with different languages (Portuguese, Spanish), dialects, and communication patterns. The evaluation strategy addresses both technical accuracy and softer metrics like empathy and tone, recognizing that robotic responses can damage customer trust even when technically accurate. For customer service applications, evaluation focuses on multiple dimensions beyond basic accuracy: customer approach and empathy, intent understanding, content and context retrieval from internal sources, deep link accuracy (navigating users to specific pages within their 3,000-page application), and hallucination prevention. The deep link accuracy requirement is particularly interesting, as it demonstrates how LLM evaluation must consider the entire user experience rather than just conversational accuracy. For money transfer applications, evaluation becomes even more complex due to the financial accuracy requirements. The system must correctly identify recipients (including disambiguation when users refer to "my brother" among multiple brothers), interpret temporal instructions ("send $100 but do it tomorrow" versus immediate transfer), recognize cancellation requests, and verify account balances, fraud status, and collection standings across multiple interconnected financial products. ## Offline and Online Evaluation Systems The company implements both offline and online evaluation systems, each serving different purposes in their development lifecycle. Offline evaluation occurs after experiments and includes both individual evaluation and pairwise evaluation, primarily using human evaluators with LLM assistance and custom heuristics. Statistical tests determine winner variants for deployment, but this approach alone would slow development speed significantly. Online evaluation provides the more interesting capability, enabling continuous loops of improvement and development in controlled sandbox environments. The online evaluation system includes comprehensive tracing, logging, and alerting capabilities that create a flywheel effect: observability leads to filtering and dataset definition, which enables experimentation, which generates more data for observability. This approach dramatically increases development speed for both developers and analysts. The combination of offline and online evaluation addresses different organizational needs: offline evaluation provides rigorous validation before deployment, while online evaluation enables rapid iteration and improvement cycles that are essential for maintaining competitive advantage in the fast-moving fintech sector. ## LLM-as-a-Judge Implementation One of the most technically interesting aspects of Nubank's LLMOps implementation is their LLM-as-a-judge system for automated evaluation. The need for this system arose from the scale challenge: with hundreds of thousands to millions of money transfer transactions daily, human labeling even with sampling becomes insufficient to maintain product quality. Training human evaluators and managing human evaluation errors at this scale creates operational bottlenecks that prevent rapid iteration. The development of their LLM-as-a-judge system demonstrates a systematic approach to achieving human-level evaluation accuracy. Starting with a simple prompt using GPT-4 mini (chosen for cost efficiency) and no fine-tuning, they achieved only 51% F1 score compared to 80% human accuracy. Through six iterations over approximately two weeks with a couple of developers, they systematically improved performance: - Test 2: Fine-tuning increased F1 from 51% to 59% - Test 3: Prompt engineering provided an 11-point jump to 70% - Test 4: Upgrading from GPT-4 mini to GPT-4 improved performance further - Test 5: Additional prompt refinement reached 80% F1 - Test 6: Final fine-tuning iteration achieved 79% F1 The slight decrease from 80% to 79% in the final iteration was intentional, as the 79% model better identified inaccurate information and caught more edge cases, demonstrating the importance of precision in financial applications over pure accuracy metrics. ## Operational Challenges and Language Considerations Operating across multiple countries creates unique challenges for Nubank's LLM systems. The company must handle Portuguese and Spanish languages with various regional dialects and communication patterns. This linguistic complexity is compounded by cultural differences in how customers express financial needs and concerns across Brazil, Mexico, and Colombia. The scale of operations - with 58% of Brazil's population as customers - means that any system failures or inaccuracies affect millions of people directly. This creates enormous pressure for reliability and accuracy that goes beyond typical enterprise LLM deployments. The financial nature of the applications means that trust, once lost through poor money transfer experiences, is difficult to recover. ## Integration Challenges and Technical Debt The presentation hints at significant integration challenges that are common in large-scale financial LLMOps deployments. The money transfer system must integrate with existing authentication systems, fraud detection, account management, and multiple product lines simultaneously. The WhatsApp integration adds another layer of complexity, requiring the system to handle multimedia inputs (voice, image, text) while maintaining security standards appropriate for financial transactions. The company's emphasis on avoiding "one-off solutions" suggests they have encountered the common LLMOps challenge of technical debt accumulation. In financial services, there are hundreds of different operations that could benefit from LLM automation, but building separate systems for each would create an unmaintainable architecture. Their focus on reusable components and standardized approaches through LangGraph appears to address this challenge proactively. ## Business Impact and Performance Metrics The quantifiable business impacts of Nubank's LLM implementations are substantial. The 60% first-contact resolution rate for customer service interactions represents significant cost savings and improved customer experience. The money transfer system's improvement from 70 seconds across nine screens to under 30 seconds with over 90% success rates demonstrates how LLMs can fundamentally improve user experience in financial applications. However, the presentation also demonstrates appropriate caution about LLM limitations. The emphasis on guardrails and the careful approach to evaluation reflects an understanding that in financial services, the cost of errors extends beyond immediate technical failures to regulatory compliance, customer trust, and business reputation. ## Lessons for LLMOps Practitioners Nubank's experience provides several important lessons for LLMOps practitioners, particularly in regulated industries. First, the importance of comprehensive evaluation cannot be overstated - the company's statement that "if you don't evaluate, you don't know what you're building, and if you don't know what you're building, then you cannot ship it to the world" reflects the reality that LLM systems require different validation approaches than traditional software. Second, the emphasis on democratizing LLM system development through tools that enable non-technical stakeholders to contribute represents an important organizational capability. In large enterprises, the ability for business users to iterate on prompts and parameters without requiring developer intervention can dramatically accelerate development cycles. Third, the systematic approach to LLM-as-a-judge development demonstrates that achieving human-level automated evaluation is possible but requires methodical iteration combining prompt engineering, fine-tuning, and model selection. The two-week timeline for developing their evaluation system shows that with proper tooling (in this case, online tracing systems), rapid iteration is feasible even for complex financial applications. Finally, the integration of LLM systems with existing high-frequency deployment processes (deployments every two minutes) shows that LLMOps can be compatible with modern DevOps practices, but requires careful attention to observability, monitoring, and automated evaluation systems that can operate at enterprise scale.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free