Finance
Northwestern Mutual
Company
Northwestern Mutual
Title
Building a Gradual, Trust-Focused GenBI Agent for Enterprise Data Democratization
Industry
Finance
Year
2025
Summary (short)
Northwestern Mutual, a 160-year-old financial services and life insurance company, developed a GenBI (Generative AI for Business Intelligence) agent to democratize data access and reduce dependency on BI teams. Faced with the challenge of balancing innovation with risk-aversion in a highly regulated industry, they adopted an incremental, phased approach that used real messy data, focused on building trust through a crawl-walk-run user rollout strategy, and delivered tangible business value at each stage. The system uses multiple specialized agents (metadata, RAG, SQL, and BI agents) to answer business questions, initially by retrieving certified reports rather than generating SQL from scratch. This approach allowed them to automate approximately 80% of the 20% of BI team capacity spent on finding and sharing reports, while proving the value of metadata enrichment through measurable improvements in LLM performance. The incremental delivery model enabled continuous leadership buy-in and risk management, with each six-week sprint producing productizable deliverables that could be evaluated independently.
## Overview and Context Northwestern Mutual, a major financial services and life insurance company with 160 years of history, embarked on building a GenBI (Generative AI for Business Intelligence) system to democratize data access across their enterprise. The company manages substantial assets and serves clients through decades-long commitments, which creates a deeply risk-averse culture centered around "generational responsibility" and stability. This case study is particularly valuable for LLMOps practitioners because it demonstrates how to navigate the tension between innovation and risk management in a highly regulated, conservative enterprise environment. The speaker, Assaf, leads this initiative and presents a remarkably honest and balanced view of the challenges involved. Unlike many vendor presentations, this case study acknowledges uncertainties, limitations, and the ongoing experimental nature of the work. The core problem being addressed is the bottleneck created by BI teams who spend significant time helping users find reports, understand data, and extract insights—work that could theoretically be automated through conversational AI interfaces. ## The Challenge: Four Major Obstacles Northwestern Mutual faced four interconnected challenges when pursuing GenBI. First, no one had successfully built this type of system before at enterprise scale with messy, real-world data. Second, they deliberately chose to use actual production data rather than synthesized or cleaned datasets, understanding that this is where real complexities would emerge. Third, they had to overcome blind trust bias—building confidence not just with end users but with senior leadership who were well aware of LLM limitations around accuracy and hallucination. Fourth, and perhaps most critically, they needed to secure ongoing budget and demonstrate ROI in an environment where the DNA of the organization emphasizes risk aversion. The decision to use real, messy data from a 160-year-old company proved strategically important for multiple reasons. It ensured that solutions developed in the lab would translate to production environments. It provided access to subject matter experts who work with the data daily, yielding realistic evaluation examples and ground truth for testing. Critically, it brought business stakeholders into the research process itself, creating organic buy-in rather than requiring later persuasion. By the time components matured enough for production, end users were already pulling for deployment rather than resisting it. ## Trust-Building Strategies and Crawl-Walk-Run Approach Northwestern Mutual implemented several sophisticated strategies to build trust with both management and users. A key insight was recognizing that users' ability to verify outputs and provide useful feedback varies dramatically based on their data expertise. This led to a crawl-walk-run rollout strategy with three distinct user tiers. The first tier targets actual BI experts—people who could perform the analysis manually and recognize what "good" looks like. For them, the GenBI system functions like a GitHub Copilot, accelerating workflows rather than replacing judgment. The second tier comprises business managers closer to BI teams who regularly work with data and can identify mistakes when they occur. These users are less sensitive to occasional errors and more likely to provide constructive feedback. The third tier—executives who need clear, concise, trustworthy answers—remains aspirational. The speaker candidly acknowledges they may never achieve sufficient accuracy for executive-level deployment, showing a refreshing realism about system limitations. Another critical trust-building mechanism was the architectural decision to initially retrieve existing certified reports and dashboards rather than generate SQL queries from scratch. This approach leverages assets that have already been fine-tuned and validated, essentially delivering the same information users would receive through traditional channels but much faster and more interactively. BI teams confirmed that approximately 80% of their work involves directing people to the right report and helping them use it. By focusing on this information retrieval problem first, Northwestern Mutual built inherent trust into the system architecture while deferring the harder SQL generation problem. ## Incremental Delivery Model and Risk Management The most impressive aspect of this case study from an LLMOps perspective is the incremental delivery model designed specifically to manage risk and secure ongoing investment. Rather than requesting a large upfront budget for an uncertain research project, the team structured the work as a series of six-week sprints, each producing tangible business deliverables that could be independently productized. Phase one focused on pure research—understanding how to translate natural language to SQL, generate responses, and interpret incoming questions. Phase two investigated what constitutes "good metadata and good context" for a BI agent, which differs significantly from RAG systems working with unstructured documents. Importantly, this phase produced immediate business value by informing a parallel semantic layer initiative and establishing principles for metadata that apply to human users as well as LLMs. The next phase delivered a multi-context semantic search capability for finding relevant data and data owners—a standalone product addressing a pain point that typically takes two to four weeks to resolve manually in their enterprise. Subsequent phases added light data pivoting capabilities on top of retrieved reports, role-based access controls and enterprise governance features, and eventually the full SQL generation capability for more complex, multi-source queries. This staged approach provided several critical benefits for LLMOps practitioners to note. It delivered value early and often, with each sprint producing something tangible rather than waiting for a complete end-to-end system. It provided transparent progress that leadership could evaluate continuously. It created a learning feedback loop where each phase informed the next. Most importantly, it controlled risk by eliminating sunk cost bias—at any point, leadership could stop funding, evaluate whether to adopt emerging third-party solutions like Databricks Genie, or pivot based on market changes. Even if they ultimately decided to adopt a vendor solution, they would have developed benchmarks, evaluation frameworks, and deep understanding of what good performance looks like, enabling them to ask tough questions and avoid "fluffy demos." ## Technical Architecture: Multi-Agent System The GenBI system employs a multi-agent architecture orchestrated through a central controller. When a business question arrives, it flows through several specialized agents, each of which can be independently productized. The metadata agent works with the data catalog and documentation to understand context and identify relevant information sources. The RAG agent searches through certified reports and dashboards to find existing assets that address the question. The SQL agent generates queries when no existing report suffices, or extends queries from reports that provide a starting point (functioning as a form of few-shot learning where the example is very close to the desired output). Finally, the BI agent synthesizes all this information into a business-appropriate answer rather than simply dumping raw data back to the user. This architecture includes conversation state management so that follow-up questions within the same session don't require re-executing the entire pipeline. Critically, governance and trust mechanisms are baked into the architecture rather than bolted on afterward—something the speaker notes would be much harder to achieve with external solutions like ChatGPT or even third-party enterprise tools. The speaker explicitly addresses why they couldn't simply use ChatGPT: schemas in real enterprises are extremely messy and lack clear context, making it difficult for general-purpose models to understand meaning and relationships. Moreover, governance requirements—controlling who can access what data, ensuring answers comply with regulatory requirements, maintaining audit trails—are far easier to implement when you control the entire stack rather than working through external APIs. ## Measurable Business Impact The case study provides specific quantitative results, which is relatively rare in GenAI presentations and particularly valuable for LLMOps practitioners building business cases. The RAG agent alone automated approximately 80% of the 20% of BI team capacity devoted to finding and sharing reports—effectively eliminating two full-time positions' worth of work on a 10-person team. While the presentation doesn't claim these positions were eliminated (and likely they were redeployed to higher-value work), this represents concrete capacity recovery. The metadata research phase enabled A/B testing that quantitatively proved the value of metadata enrichment. By running the same battery of questions against databases with good versus poor metadata, they demonstrated measurable improvements in LLM performance. This is significant because metadata enrichment is often seen as "fluffy" work that's hard to justify—here they created a compelling business case that secured executive buy-in for a rigorous catalog enrichment initiative. The data pivoting capabilities are still experimental, allowing users to change time horizons, views, segmentations, and groupings in retrieved reports without requiring human intervention. This addresses another major BI team bottleneck. ## Evaluation and Testing Approach While not extensively detailed in the presentation, the case study reveals several evaluation practices worth noting. Working with subject matter experts who handle data daily provided "a lot of real-life examples of what people are actually asking" and "what people have answered to them"—essentially building a high-quality evaluation dataset from actual user interactions. This grounds evaluation in realistic use cases rather than synthetic benchmarks. The A/B testing comparing model performance with good versus poor metadata demonstrates rigorous experimental methodology. The fact that each phase required demonstrating tangible business value before proceeding suggests ongoing evaluation against business metrics, not just technical metrics like accuracy or F1 scores. The decision to benchmark their internal system creates valuable assets even if they ultimately adopt external solutions—they'll know how to evaluate vendor claims and where to probe for weaknesses. This is sophisticated LLMOps thinking that recognizes research investment value beyond immediate production deployment. ## Challenges, Limitations, and Honest Assessment The presentation stands out for its honesty about limitations and challenges. The speaker acknowledges they don't know when (or if) the system will be accurate enough for executive-level deployment. He notes that even with all this work, a fully-fledged SQL generation capability is "still some ways to go ahead." He frames the crawl-walk-run approach partially as acknowledging current system limitations rather than purely as a rollout strategy. The governance challenges in a risk-averse financial services company are acknowledged but not deeply explored—this is clearly an ongoing concern. The speaker notes that while they've built governance into their architecture, it remains "super important" and a differentiator versus external solutions. There's also candid discussion about the possibility that they might never build a complete end-to-end GenBI agent themselves, potentially adopting solutions like Databricks Genie instead. However, even in that scenario, the research investment pays off through deep understanding, benchmarks, and the ability to critically evaluate vendor solutions. ## Strategic Insights for LLMOps Practitioners Several strategic insights emerge that are broadly applicable to enterprise LLMOps initiatives. First, the incremental delivery model with productizable outputs at each stage is a powerful pattern for managing uncertain research projects in risk-averse organizations. It transforms "give us a million dollars for pie-in-the-sky research" into "fund this six-week sprint with a guaranteed concrete deliverable." Second, the crawl-walk-run user rollout based on data expertise rather than organizational hierarchy shows sophisticated thinking about trust, verification, and feedback quality. Starting with users who can validate outputs and provide useful feedback accelerates learning and builds credibility. Third, solving the easier retrieval problem before tackling SQL generation demonstrates architectural pragmatism—get something working and trusted before attempting the hardest parts. The insight that existing reports can serve as few-shot examples for SQL generation is clever, reducing the problem complexity significantly. Fourth, embedding governance and trust mechanisms into the architecture from the beginning rather than treating them as afterthoughts reflects mature LLMOps thinking appropriate for regulated industries. Fifth, the recognition that research value extends beyond production deployment—creating benchmarks, evaluation frameworks, and deep understanding that inform build-versus-buy decisions—shows strategic sophistication often missing in GenAI initiatives. ## Future Considerations and Broader Implications The presentation concludes with reflections on the broader future of GenAI in enterprise contexts. The speaker identifies data preparation as a huge emerging market area, along with task-specific models and applications. The co-pilot paradigm of meeting users where they are (rather than forcing them to new interfaces) aligns with their own crawl-walk-run approach. Interestingly, the speaker raises a thought-provoking question about SaaS pricing in the GenAI era: when individual workers become 10x more productive, should software be priced by seats, usage, or value delivered? He notes Salesforce is already experimenting with usage-based pricing for their Data Cloud product. This economic question has significant implications for LLMOps practitioners thinking about how to measure and capture the value of AI-powered tools. The case study ultimately demonstrates that successful enterprise LLMOps in conservative, regulated industries requires more than technical excellence—it demands sophisticated risk management, incremental value delivery, deep user understanding, and architectural decisions that balance innovation with governance and trust. Northwestern Mutual's approach provides a valuable template for organizations facing similar challenges in deploying GenAI for business-critical applications.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.