Nubank, one of Brazil's largest banks serving 120 million users, implemented large-scale LLM systems to create an AI private banker for their customers. They deployed two main applications: a customer service chatbot handling 8.5 million monthly contacts with 60% first-contact resolution through LLMs, and an agentic money transfer system that reduced transaction time from 70 seconds across nine screens to under 30 seconds with over 90% accuracy and less than 0.5% error rate. The implementation leveraged LangChain, LangGraph, and LangSmith for development and evaluation, with a comprehensive four-layer ecosystem including core engines, testing tools, and developer experience platforms. Their evaluation strategy combined offline and online testing with LLM-as-a-judge systems that achieved 79% F1 score compared to 80% human accuracy through iterative prompt engineering and fine-tuning.
Nubank is a digital bank operating in Brazil, Mexico, and Colombia, serving nearly 120 million customers. The presentation, delivered by Tan, a four-year veteran at the company, describes their journey building what they call an “AI private banker” for all customers. The scale is remarkable: Nubank is the third-largest bank in Brazil, the fastest-growing bank in Mexico and Colombia, and has provided first-time credit card access to 21 million people in Brazil over the past five years. This massive user base and the critical nature of financial services create unique challenges for deploying LLMs in production.
The core problem Nubank aims to solve is that people are notoriously bad at making financial decisions—from deciding which subscriptions to cancel to making complex loan and investment decisions. The AI systems they’ve built aim to democratize access to sophisticated financial guidance that was previously available only to wealthy individuals with private bankers.
Nubank receives approximately 8.5 million customer contacts per month, with chat being the primary channel. Currently, 60% of these contacts are first handled by LLMs. The company emphasizes that results are continuously improving as they build specialized agents for different situations.
The customer service application faces unique challenges in financial services. Unlike typical chatbots, the system must handle sensitive financial inquiries with appropriate empathy and tone. As the speaker notes, if a customer calls about an unrecognized charge on their account or a missing card, a robotic response loses customer trust. The balance between being helpful without being overly flattering is critical—the speaker references OpenAI’s experience with GPT-4.1 being recalled due to issues with overly agreeable responses.
The more technically interesting application is a multimodal agentic system that handles money transfers via voice, image, and chat through WhatsApp integration. Key metrics cited include:
The system includes robust security measures, requiring multiple password confirmations before enabling transactions. Users can give natural language instructions like “make a transfer to Jose for 100 reais,” and the system confirms the recipient before executing.
Nubank’s LLM ecosystem consists of four layers:
The speaker emphasizes that without LangGraph, they couldn’t achieve fast iterations or establish canonical approaches for building agentic and RAG systems. A key insight is that graphs decrease the cognitive effort required to represent complex flows, making it easier for both machines and humans to understand the system architecture.
A critical observation from the talk is the importance of not building one-off solutions. In financial services, there are hundreds of operations involving money movement or micro-decisions. Building separate systems and agents for each would be unscalable. The architecture must be designed for reusability while still allowing rapid iteration.
The company operates 1,800 services with deployments every two minutes, all decisions driven by A/B testing. This CI/CD velocity requires robust evaluation and observability infrastructure.
The evaluation requirements differ significantly between the customer service and money transfer use cases:
Customer Service Evaluation Dimensions:
Money Transfer Evaluation Dimensions:
After running experiments, results are fed to LLM applications for both individual and pairwise evaluation. The process primarily uses human labelers but also incorporates LLM evaluation and custom heuristics. Statistical tests determine the winning variant for launch.
Online evaluation enables continuous improvement loops in controlled sandbox environments. The key insight is that relying solely on offline evaluation significantly slows decision-making for developers and analysts. Online evaluation with proper tracing, logging, and alerting dramatically increases development velocity. Nubank employs both approaches in parallel.
The LLM-as-a-judge system was developed to address the scalability problem of human evaluation. With hundreds of thousands to millions of transactions daily, even sampling-based human labeling isn’t sufficient to maintain product quality. Training human labelers is also expensive and error-prone.
The goal was to achieve LLM judge quality comparable to human evaluators. The development process went through six iterations over approximately two weeks with a couple of developers:
Test 1: Simple prompt, GPT-4 Mini (chosen for cost efficiency), no fine-tuning. Result: 51% F1 score (humans at 80% accuracy).
Test 2: Added fine-tuning to GPT-4 Mini. Result: 59% F1 score (+8 points).
Test 3: Changed prompt (V2). Result: 70% F1 score (+11 points, the biggest jump).
Test 4: Upgraded from GPT-4 Mini to GPT-4. Incremental improvement.
Test 5: Additional prompt changes. Result: 80% F1 score.
Test 6: Additional fine-tuning adjustments. Result: 79% F1 score.
The team chose Test 6 over Test 5 despite the slightly lower overall F1 score because it better identified inaccurate information—a more critical metric for financial services where catching errors matters more than overall accuracy.
The speaker emphasizes that this rapid iteration was only possible because they had online tracing systems in place through LangSmith.
Operating in Brazil, Mexico, and Colombia means handling Portuguese and Spanish with various dialects and regional expressions. With 58% of the Brazilian population as customers, understanding diverse communication styles is essential. The brand’s prominence (more recognized than McDonald’s or Nike in Brazil) creates additional pressure to maintain high standards, particularly around jailbreak prevention and guardrails.
A notable aspect of Nubank’s approach is democratizing access to LLM system data beyond just developers. Business analysts, product managers, and operations teams can access the graphical interface provided by LangSmith to make faster decisions about prompts, inputs, and parameters. This centralized logging and graphical representation of LLM flows enables non-technical stakeholders to contribute to system improvement.
The speaker concludes with practical advice for LLMOps practitioners:
The flywheel model Nubank has established—observability to filtering to dataset definition to experimentation and back—represents a mature approach to LLMOps that enables continuous improvement at scale in a highly regulated industry where accuracy and trust are paramount.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.
Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.