Multiple banks, including Discover Financial Services, Scotia Bank, and others, share their experiences implementing generative AI in production. The case study focuses particularly on Discover's implementation of gen AI for customer service, where they achieved a 70% reduction in agent search time by using RAG and summarization for procedure documentation. The implementation included careful consideration of risk management, regulatory compliance, and human-in-the-loop validation, with technical writers and agents providing continuous feedback for model improvement.
This case study is drawn from a panel discussion at a Google Cloud event featuring executives from four major financial institutions: Discover Financial Services, Macquarie Bank (Australia), Scotia Bank (Canada), and Intesa Sanpaolo (Italy). The discussion covers their respective journeys in cloud transformation, data platform development, and AI/generative AI implementation. While each bank shares insights, the most detailed LLMOps case study comes from Discover Financial Services, which describes a production deployment of generative AI to assist customer service agents.
Discover identified a significant challenge in their customer service operations. As digital banking has evolved, customers increasingly self-serve for simple tasks like checking balances or payment due dates through mobile apps and web interfaces. This has fundamentally changed the nature of calls that reach human agents—only the difficult, complex questions now come through to the contact center.
This shift has made the job of customer service agents significantly more difficult. When handling complex inquiries, agents must navigate through extensive procedure documentation to find the correct steps and information. The traditional search process was measured in minutes, often requiring agents to put customers on hold with phrases like “can I put you on hold for a minute”—a poor customer experience that conflicts with Discover’s brand promise of award-winning customer service.
The core issue was that agents didn’t want to find documents; they wanted to find solutions. Traditional search returned multiple multi-page documents that agents had to manually parse through, making it difficult to quickly locate the relevant information.
Discover partnered with Google Cloud to implement a generative AI solution using Vertex AI and Gemini. The architecture leverages Retrieval Augmented Generation (RAG) to ground the AI’s responses in Discover’s actual procedure documentation. Key technical aspects include:
Before any implementation work began, Discover established what they call a “Generative AI Council” (or “clo”). This organizational framework was crucial for a regulated financial services company and includes participants from:
The council serves multiple purposes. It establishes organizational comfort with AI boundaries, creates intake processes for ideas and experimentation, and builds a risk framework that helps prioritize use cases by risk level. One key policy example: employees cannot use open/public AI tools freely, but the organization can build enterprise capabilities with proper security controls.
The risk scaling approach allows Discover to start with lower-risk use cases and progressively tackle more complex scenarios. This methodical approach is particularly important in regulated industries where customer data and financial decisions are involved.
What makes this case study particularly interesting from an LLMOps perspective is the acknowledgment that the traditional engineering work was the easy part:
Engineering Phase: The actual coding and platform integration was “measured in weeks”—a very quick process enabled by the pre-built integrations in Vertex AI.
Model Tuning Phase: This became the heavy lifting. Tuning the models to accurately summarize information relevant to Discover’s specific procedures required extensive experimentation and iteration. The work proceeds skill group by skill group, taking procedures that are similar and tuning the system for that domain before moving to the next.
Key Personnel: The most important people in the tuning process are not engineers but “knowledge workers”—expert agents and the technical writers who create the procedure documents themselves. These domain experts run iterative cycles to evaluate and improve the quality of AI-generated answers.
The production system incorporates continuous human evaluation mechanisms:
This feedback loop creates a continuous improvement cycle where the system gets progressively better at handling the specific types of questions that agents face.
The primary metric Discover tracked was time to information. The results were significant:
Beyond the quantitative results, the solution provides qualitatively better search results—agents not only find information faster, but the RAG-based approach surfaces more relevant results than the historical keyword-based search.
Interestingly, Discover started with an even simpler use case before deploying the agent-facing system. They used generative AI to help technical writers rewrite and improve the procedure documents themselves. This approach maintained full human oversight:
This layered approach allowed Discover to build organizational confidence with AI in a lower-risk context before moving to the agent-facing production deployment.
Scotia Bank shared an important lesson about data quality and AI adoption. They deployed a generative AI system on top of their knowledge base for contact center agents. Initial answer quality was around 33%, but improved to approximately 96% after the contact center team took ownership of data quality.
The key insight was that the improvement came not from better AI, but from better data. The team:
This work was done organically by the business team, motivated by the visible connection between data quality and AI output quality. The speaker noted this would traditionally have been a “$10 million, 12-month consulting engagement” but was accomplished in months by motivated internal teams.
Scotia Bank also emphasized the productivity gains for data scientists when data is properly organized and accessible in the cloud, reducing the traditional 80% time spent on data wrangling.
Macquarie Bank (96.4% public cloud) discussed their customer-facing AI implementations:
They emphasized keeping “human in the loop” for production AI systems to manage risk appropriately.
The Italian bank discussed their structured AI program targeting €500 million in bottom line benefits by 2025, with use cases split across cost reduction (25%), risk reduction (25%), and revenue increase (50%). They established four governing principles for all AI use cases:
They are proceeding cautiously with generative AI specifically, ensuring they understand the full implications for skills, processes, and operations before scaling.
Several common themes emerged across all participants that are relevant to LLMOps practitioners:
Data Foundation is Critical: Every speaker emphasized that AI success depends on having clean, organized, accessible data. Cloud data platforms enable both the storage and the velocity needed for AI workloads.
Human in the Loop is Non-Negotiable: In regulated financial services, all participants stressed the importance of human oversight, whether in model evaluation, output review, or decision approval.
Cross-Functional Governance: AI initiatives require buy-in and participation from legal, compliance, security, and business teams—not just technology.
Start Small and Scale: The successful approaches all involved starting with lower-risk use cases, proving value, and progressively tackling more complex scenarios.
Domain Experts Drive Quality: The most impactful contributors to AI quality are often not engineers but domain experts who understand the business context and can evaluate whether outputs make sense.
Data Quality Ownership Shifts: Generative AI has an unexpected benefit of making data quality issues visible and motivating business teams to own and improve their data.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Prudential Financial, in partnership with AWS GenAI Innovation Center, built a scalable multi-agent platform to support 100,000+ financial advisors across insurance and financial services. The system addresses fragmented workflows where advisors previously had to navigate dozens of disconnected IT systems for client engagement, underwriting, product information, and servicing. The solution features an orchestration agent that routes requests to specialized sub-agents (quick quote, forms, product, illustration, book of business) while maintaining context and enforcing governance. The platform-based microservices architecture reduced time-to-value from 6-8 weeks to 3-4 weeks for new agent deployments, enabled cross-business reusability, and provided standardized frameworks for authentication, LLM gateway access, knowledge management, and observability while handling the complexity of scaling multi-agent systems in a regulated financial services environment.
OpenAI's Forward Deployed Engineering (FDE) team, led by Colin Jarvis, embeds with enterprise customers to solve high-value problems using LLMs and deliver production-grade AI applications. The team focuses on problems worth tens of millions to billions in value, working with companies across industries including finance (Morgan Stanley), manufacturing (semiconductors, automotive), telecommunications (T-Mobile, Klarna), and others. By deeply understanding customer domains, building evaluation frameworks, implementing guardrails, and iterating with users over months, the FDE team achieves 20-50% efficiency improvements and high adoption rates (98% at Morgan Stanley). The approach emphasizes solving hard, novel problems from zero-to-one, extracting learnings into reusable products and frameworks (like Swarm and Agent Kit), then scaling solutions across the market while maintaining strategic focus on product development over services revenue.