Northwestern Mutual implemented a GenAI-powered developer support system to address challenges with their internal developer support chat system, which suffered from long response times and repetitive basic queries. Using Amazon Bedrock Agents, they developed a multi-agent system that could automatically handle common developer support requests, documentation queries, and user management tasks. The system went from pilot to production in just three months and successfully reduced support engineer workload while maintaining strict compliance with internal security and risk management requirements.
This case study comes from a presentation at AWS re:Invent featuring Northwestern Mutual, a financial services company providing financial planning, wealth management, and insurance products (life and long-term care). The presentation was delivered by Heiko Zuerker, a principal architect at the company, alongside AWS team members Michael Liu (product lead for Bedrock Agents) and Mark Roy (principal architect with Amazon Bedrock). The session focused on multi-agent collaboration patterns and included Northwestern Mutual as a real-world customer example of deploying agents in production within a regulated industry.
Northwestern Mutual’s case study demonstrates a practical implementation of LLM-powered agents for internal developer support. When generative AI became widely available, the company immediately recognized its potential and fully embraced the technology, taking multiple use cases from experimentation into production. The specific use case discussed involves automating internal developer support that was previously handled through chat interfaces.
The existing internal developer support system had several pain points that made it a suitable candidate for AI automation:
The company also had ambitious goals beyond this initial use case—they wanted to create a foundation for more complex business-facing applications, which presented additional challenges given their status as a regulated financial services company.
Northwestern Mutual chose to build on Amazon Bedrock rather than creating their own solution from scratch. As one coworker quoted in the presentation noted, while there are many excellent tools and libraries for building generative AI applications, a year prior there was no choice but to build custom solutions. With managed services like Bedrock, simple RAG use cases can now be spun up in minutes versus hours or days. The company also benefited from their existing AWS infrastructure expertise, fine-grained access controls, built-in encryption, and HIPAA compliance—critical for a financial services company.
The solution follows a serverless, managed-services-first architecture with the explicit design goal of having no infrastructure to manage. The key components include:
Message Queuing and State Management:
Orchestration Layer: The orchestration layer, written in Python and running on Lambda, handles the complex flow of processing incoming chat messages. The flow includes:
Multi-Agent Architecture: The company implemented five distinct agents, each with a specific area of responsibility:
This multi-agent approach was adopted because, as Mark Roy emphasized in his presentation, keeping agents small and focused is a best practice. When too many actions are crammed into a single agent, the LLM becomes confused, prompts get longer, and performance degrades. The team found that spreading responsibilities across multiple specialized agents improved accuracy and reliability.
Guardrails and Security: The solution relies heavily on Amazon Bedrock Guardrails for protection, including:
Cross-Region Inference: The team switched to cross-region inference to improve stability and performance. This feature allows the solution to leverage multiple AWS regions with minimal code changes (just updating the inference profile ID), adding significant reliability to the production system.
One of the most interesting aspects of this case study is how Northwestern Mutual addressed their strict internal rules about AI taking direct actions. In a regulated financial services environment, AI was not allowed to take any direct action on behalf of users—a significant constraint when building agents whose purpose is to automate tasks.
The solution they developed with their risk partners involves explicit confirmation before any action is executed. When an agent determines it needs to perform an action (like unlocking a user account), it responds with the exact action it plans to take and asks for explicit confirmation with a “yes” or “no” response. The team specifically limited acceptance to only “yes” or “no” rather than accepting variations like “yeah” or “sure” to eliminate ambiguity. This approach satisfied the compliance requirements while still enabling automation.
The evaluator agent was described as “a really good addition” to the solution. Because LLMs can be overly helpful (sometimes to a fault), hallucinate, or simply provide unhelpful answers, having a dedicated agent to evaluate responses before they reach users significantly improved the user experience. The presentation showed examples of both positive and negative evaluations where inappropriate responses were filtered out.
The team explicitly stated they “over index on filtering out messages”—meaning they prefer to have a lower response rate from the bot if it means maintaining quality. As Heiko noted, once you lose users’ trust, it’s very hard to regain. A good user experience takes priority over maximizing automation coverage.
The team started work in earnest in June and reached production by September—approximately a three-month timeline from serious development effort to deployment. The results achieved include:
The presentation included several practical lessons learned from the production deployment:
The team expressed enthusiasm about Amazon Bedrock’s new multi-agent collaboration feature (launched at re:Invent 2024). Their current orchestration layer is custom-built in Python running on Lambda, but they plan to migrate to Bedrock’s native multi-agent collaboration capabilities. This migration is expected to simplify their codebase significantly by removing the need for custom orchestration logic.
They also mentioned “virtual agents”—not implemented in Amazon Bedrock but handled in prompts and code for special cases like escalations and message ignoring—which they plan to move into Bedrock Agents as part of this migration.
The presentation also included demos from Mark Roy showing two primary patterns for multi-agent systems: (1) unified customer experience using router-based supervisors for intent classification and routing, and (2) automating complex processes using supervisor agents that plan and execute multi-step workflows across collaborators. While these demos were AWS examples (Octank Mortgage Assistant and Startup Advisor), they illustrated the patterns that Northwestern Mutual and similar organizations can leverage for their own implementations.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.
Two organizations operating in highly regulated industries—Sicoob, a Brazilian cooperative financial institution, and Holland Casino, a government-mandated Dutch gaming operator—share their approaches to deploying generative AI workloads while maintaining strict compliance requirements. Sicoob built a scalable infrastructure using Amazon EKS with GPU instances, leveraging open-source tools like Karpenter, KEDA, vLLM, and Open WebUI to run multiple open-source LLMs (Llama, Mistral, DeepSeek, Granite) for code generation, robotic process automation, investment advisory, and document interaction use cases, achieving cost efficiency through spot instances and auto-scaling. Holland Casino took a different path, using Anthropic's Claude models via Amazon Bedrock and developing lightweight AI agents using the Strands framework, later deploying them through Bedrock Agent Core to provide management stakeholders with self-service access to cost, security, and operational insights. Both organizations emphasized the importance of security, governance, compliance frameworks (including ISO 42001 for AI), and responsible AI practices while demonstrating that regulatory requirements need not inhibit AI adoption when proper architectural patterns and AWS services are employed.