ZenML

Enterprise-Scale Cloud Event Management with Generative AI for Operational Intelligence

Fidelity Investments 2025
View original source

Fidelity Investments faced the challenge of managing massive volumes of AWS health events and support case data across 2,000+ AWS accounts and 5 million resources in their multi-cloud environment. They built CENTS (Cloud Event Notification Transport Service), an event-driven data pipeline that ingests, enriches, routes, and acts on AWS health and support data at scale. Building upon this foundation, they developed and published the MAKI (Machine Augmented Key Insights) framework using Amazon Bedrock, which applies generative AI to analyze support cases and health events, identify trends, provide remediation guidance, and enable agentic workflows for vulnerability detection and automated code fixes. The solution reduced operational costs by 57%, improved stakeholder engagement through targeted notifications, and enabled proactive incident prevention by correlating patterns across their infrastructure.

Industry

Finance

Technologies

Overview

Fidelity Investments, a major financial services institution with roots dating back to 1946, has undergone significant cloud transformation. By 2019, they had begun accelerating their cloud journey, and as of the presentation, approximately 80% of their 8,500+ applications run in the cloud across multiple cloud providers. This massive scale includes nearly 2,000 AWS accounts and approximately 5 million resources, generating over 150,000 AWS health events in just the last two years. The case study, presented by Rahul Singler (Senior Enterprise Support Manager), Jason Casamiro (VP of Cloud Engineering at Fidelity), and Joe from AWS, demonstrates how Fidelity built a sophisticated operational intelligence platform and then layered generative AI capabilities on top to evolve from reactive monitoring to proactive, AI-augmented operations.

The fundamental challenge they addressed was the growing complexity of enterprise cloud environments at scale. In financial services, every minute of downtime can mean millions of dollars in lost revenue and damaged customer trust. Research shows that organizations with comprehensive data monitoring detect incidents 3.5 times faster than those without. The key insight driving their approach was that different data domains—AWS health events (service issues and scheduled maintenance) and support cases (recurring problems and resolutions)—when combined together, build a holistic picture of operational health. Correlating a spike in support cases with a health event, or connecting a configuration change to performance degradation, enables true operational intelligence and what they call “holistic technical hygiene.”

Foundation: The CENTS Platform

Before applying generative AI, Fidelity built CENTS (Cloud Event Notification Transport Service), an enterprise-scale data pipeline and notification platform. The evolution of this system reflects important lessons about building production systems at scale.

Initial Architecture (2019): Their first iteration was deliberately simple—a time-based pull mechanism using AWS Health API to extract data from accounts, aggregate events and support cases into a central hub, and process them. They sent simple email notifications. As Jason Casamiro emphasized, “when you get to scale, simplicity matters because any complex architecture at scale is brittle.” This initial system was focused on rapid delivery, following Fidelity’s culture of getting things into developers’ hands quickly and iterating rather than experiencing analysis paralysis. However, the email-based notification system quickly became overwhelming and ineffective, with engineers drowning in notifications that ended up ignored.

Evolved Architecture (Current): As Fidelity scaled to 2,000 accounts and millions of resources, they completely rearchitected CENTS into an event-driven system. Key enablers from AWS included the delegated administrative account feature (allowing a single account to aggregate all organizational events), EventBridge for native event-driven integration, and third-party integration capabilities. The redesigned system features several critical components:

The results of this foundational platform were significant: 57% cost reduction (moving from poll-based to event-driven architecture while adding capabilities like the OpenSearch cluster), improved stakeholder engagement through targeted notifications rather than mass emails, and the ability to ingest and route data at true enterprise scale. A concrete example of impact: when AWS issued RDS certificate expiration notices with a hard deadline of August 22nd, the consistent delivery of targeted notifications to the right stakeholders enabled Fidelity to avoid any customer impact from certificate renewal issues.

Generative AI Layer: The MAKI Framework

Building on the CENTS foundation, Fidelity and AWS developed MAKI (Machine Augmented Key Insights), a framework published in AWS samples on GitHub that applies generative AI to operational data using Amazon Bedrock. This represents the evolution from reactive and proactive monitoring toward AI-augmented operations.

Core Architecture and Approach:

The MAKI framework takes AWS support case data and health event data and processes them through Amazon Bedrock with several important production-oriented patterns:

Capabilities Enabled:

The MAKI framework provides several levels of operational intelligence:

Agentic Workflows with MCP:

The most advanced capabilities demonstrated involve agentic workflows using the Model Context Protocol (MCP). In this architecture, events are fed into OpenSearch with embeddings created using Amazon Bedrock Titan embedding models. Importantly, they maintain a hybrid approach: structured metadata from health events remains in structured form (queryable via traditional structured queries), while natural language fields like event descriptions are embedded for semantic search. This enables lexical search, semantic search, and hybrid search tools that are exposed to an MCP stack hosted in Quiro (using Claude).

The live demonstration showed several compelling agentic use cases:

Production Considerations and Balanced Assessment:

While the demonstrations are impressive, several important caveats and production considerations emerge:

Future Vision and Data Integration:

The presentation concluded with an expansive vision for enriching the operational worldview by continuously adding data domains:

This vision is ambitious but grounded in the principle that each additional data domain enriches the worldview and enables more meaningful actions.

LLMOps Lessons and Production Insights

Several important LLMOps principles emerge from this case study:

Foundation First: The most critical lesson is that generative AI capabilities require solid data infrastructure. Fidelity spent years building CENTS before layering on AI capabilities. The data ingestion, enrichment, storage, and routing infrastructure proved essential, and shortcuts here would undermine AI effectiveness.

Progressive Maturity: The journey from reactive monitoring (2019: polling and emails) to proactive monitoring (event-driven CENTS) to AI-augmented operations (MAKI framework) demonstrates realistic maturity progression. Each stage built on the previous, and they were transparent about still working toward full automation goals.

Right Model for Right Task: The pattern of using fast, lightweight models for event-level processing and sophisticated models for aggregate analysis demonstrates production-oriented cost and performance optimization. Similarly, routing to batch vs. on-demand inference based on workload characteristics shows mature thinking about LLM infrastructure.

Hybrid Approaches: Maintaining structured data in structured form while embedding natural language fields demonstrates understanding that pre-generative-AI techniques remain valuable. Not everything needs to be embedded or processed by LLMs.

Action-Oriented Design: The entire system was designed around enabling action, not just generating reports. The preference management system, integration with work management tools, and contextual enrichment all focus on ensuring insights lead to concrete actions.

Scale Considerations: Jason’s repeated emphasis that “nobody understands a scale problem until you have a scale problem” and that “any complex architecture at scale is brittle” reflects hard-won wisdom about building production systems.

Resilience and Failure Modes: The conscious architectural choices around API routing (for multi-cloud consistency and failover capability), batch/on-demand routing (for throttling prevention), and regional backup (AWS Health’s new capability they plan to adopt) all demonstrate thinking about LLM system resilience.

The case study represents a sophisticated, production-grade application of LLMs for operational intelligence at massive enterprise scale, with both the impressive capabilities and the realistic challenges and prerequisites clearly presented.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57