ZenML

Automating Root Cause Analysis Using Amazon Bedrock Agents

BMW 2025
View original source

BMW implemented a generative AI solution using Amazon Bedrock Agents to automate and accelerate root cause analysis (RCA) for cloud incidents in their connected vehicle services. The solution combines architecture analysis, log inspection, metrics monitoring, and infrastructure evaluation tools with a ReAct (Reasoning and Action) framework to identify service disruptions. The automated RCA agent achieved 85% accuracy in identifying root causes, significantly reducing diagnosis times and enabling less experienced engineers to effectively troubleshoot complex issues.

Industry

Automotive

Technologies

BMW’s Generative AI Solution for Cloud Incident Analysis

Overview

BMW Group, the Munich-based premium automobile manufacturer with over 154,000 employees and a connected fleet of more than 23 million vehicles worldwide, developed a generative AI-powered solution for automating root cause analysis (RCA) of cloud service incidents. The solution was built in collaboration with AWS and centers on Amazon Bedrock Agents to streamline the traditionally cumbersome process of diagnosing partial or full service outages affecting BMW’s digital connected vehicle services.

BMW Connected Company, a division within BMW, operates digital services that millions of vehicle owners use daily—from remotely locking/unlocking doors via mobile apps, to starting window defrost, purchasing navigation map updates, and streaming music. These services are implemented by chaining multiple software components together, often built and operated by different geographically dispersed teams across various AWS regions and other cloud environments. When service disruptions occur, determining the root cause becomes a complex investigation requiring engineers to check multiple systems, form hypotheses, trace component dependencies, and often backtrack when initial theories prove incorrect.

The LLMOps Production Architecture

The production system uses an Amazon Bedrock agent that operates using the ReAct (Reasoning and Action) framework, which dynamically combines logical reasoning with task execution. Unlike traditional predefined workflow systems, the ReAct agent uses real-time inputs and iterative decision-making to adapt to specific incident circumstances.

Agent Workflow in Production

When an incident occurs, an on-call engineer provides a textual description of the issue to the Amazon Bedrock agent. The agent then initiates its investigation using a structured but adaptive workflow:

The agent first interprets the incident description to identify which parts of the system are most likely impacted. It then gathers evidence by calling specialized tools, leveraging data centrally aggregated in a cross-account observability setup. The agent continuously reevaluates results from each tool invocation, progressively narrowing down potential causes until it isolates the root cause. The agent can identify issues ranging from expired certificates and revoked firewall rules to traffic spikes and misconfigured components.

Custom Tools Implementation

The Amazon Bedrock agent’s effectiveness relies on its integration with four specialized tools, each implemented as AWS Lambda functions:

Architecture Tool: This tool uses C4 diagrams enhanced through Structurizr to provide a hierarchical understanding of component relationships, dependencies, and workflows. When an issue affects a specific service, the Architecture Tool identifies upstream and downstream dependencies and suggests hypotheses focused on those systems. This contextual architectural reasoning prevents blind searching through logs and metrics, instead enabling targeted investigation based on how different systems interact.

Logs Tool: Leveraging CloudWatch Logs Insights, this tool analyzes log data in real time by searching for patterns, errors, and anomalies while comparing trends to previous periods. For example, it can identify spikes in error messages like “FATAL: password authentication failed” compared to the previous hour, allowing the agent to quickly associate failures with potential root causes such as improperly rotated database passwords.

Metrics Tool: This tool provides real-time insights into system health by monitoring key metrics through CloudWatch, including latency, error rates, resource utilization, and usage patterns. It identifies statistical anomalies by surfacing CloudWatch metric alarms, enabling the agent to prioritize hypotheses related to resource mismanagement, misconfigured thresholds, or unexpected system load.

Infrastructure Tool: Using CloudTrail data, this tool analyzes critical control-plane events such as configuration changes, security group updates, and API calls. It is particularly effective at identifying misconfigurations or breaking changes that trigger cascading failures, such as inadvertently removed security group ingress rules causing service connectivity issues.

The ReAct Framework Benefits in Production

The ReAct framework provides several operational advantages for BMW’s production environment. The agent tailors its approach dynamically to each specific incident rather than following a one-size-fits-all methodology, which is critical for BMW’s multi-regional, multi-service architecture. The framework reasons about which tools to invoke and when, minimizing redundant queries and providing faster diagnostics without overloading AWS services like CloudWatch or CloudTrail. The agent mimics the logical thought process of a seasoned engineer, iteratively exploring hypotheses until it identifies the root cause, effectively bridging automation and human expertise.

Real-World Production Example

The case study describes a realistic scenario where a security group was deliberately changed in a test environment, blocking requests from the fleet and preventing remote vehicle lock/unlock functionality. When engineers received a report that remote lock/unlock wasn’t working via the iOS app, the agent was engaged with the problem description: “Users of the iOS app cannot unlock car doors remotely.”

The agent’s investigation proceeded through several phases. First, it called the Architecture Tool to understand the overall system architecture, revealing that the iOS app connects to a backend-for-frontend API, which connects to internal APIs including the Remote Vehicle Management API responsible for sending commands to cars via MQTT messaging. The agent then used its other tools in a targeted manner, scanning logs, metrics, and control plane activities only for components involved in the remote unlock flow.

The investigation uncovered multiple clues: anomalous logs indicating network timeouts, a sharp decrease in successful invocations of the Remote Vehicle Management API, and control plane activities showing security group changes in the central networking account. Based on these findings, the agent formulated hypotheses ordered by likelihood, correctly identifying that a security group was inadvertently changed in the central networking account, blocking network traffic between the backend-for-frontend and the Remote Vehicle Management API. The agent correlated logs, metrics, and control plane changes to reach this conclusion.

Performance and Results

BMW reports that the fully automated RCA agent correctly identifies the right root cause in 85% of test cases. In the remaining cases, the agent provides detailed insights that significantly expedite engineers’ investigations. The solution reduced diagnosis time from hours to minutes for validated test cases.

It is worth noting that these performance metrics appear to come from the proof of concept phase and representative use cases rather than extensive production deployment data. While the results are promising, real-world production performance may vary as the system encounters novel failure modes not represented in testing.

Operational Benefits

Beyond raw accuracy, the solution provides several operational advantages. It helps engineers with system understanding and real-time insights even when it doesn’t identify the exact root cause. The solution has lowered the barrier to entry for junior engineers, enabling less-experienced team members to diagnose issues effectively while maintaining reliability and scalability across BMW’s operations. Engineers can ask follow-up questions or instruct the agent to investigate elsewhere, maintaining human oversight over the diagnostic process.

Cross-Account Observability Architecture

A key enabling factor for the solution is BMW’s cross-account observability setup, which centrally aggregates data from multiple AWS accounts and regions. This infrastructure allows the agent to access logs, metrics, and control plane events across BMW’s distributed architecture without requiring manual log aggregation by engineers during incidents.

Critical Assessment

While the solution demonstrates impressive capabilities, several considerations merit attention. The 85% accuracy figure comes from test cases rather than production incidents, and complex real-world failures may present scenarios the agent hasn’t been trained to handle. The solution is deeply integrated with AWS services (CloudWatch, CloudTrail, Bedrock), which may limit applicability for organizations with multi-cloud or hybrid infrastructure. The case study is co-authored with AWS representatives and published on the AWS blog, suggesting some promotional intent that readers should consider when evaluating the claimed benefits.

The solution represents a practical application of agentic AI for IT operations, combining LLM reasoning capabilities with structured tool access to solve a genuine operational challenge at scale for a major automotive manufacturer’s connected vehicle services.

More Like This

AI-Powered Customer Service and Call Center Transformation with Multi-Agent Systems

Fastweb / Vodafone 2025

Fastweb / Vodafone, a major European telecommunications provider serving 9.5 million customers in Italy, transformed their customer service operations by building two AI agent systems to address the limitations of traditional customer support. They developed Super TOBi, a customer-facing agentic chatbot system, and Super Agent, an internal tool that empowers call center consultants with real-time diagnostics and guidance. Built on LangGraph and LangChain with Neo4j knowledge graphs and monitored through LangSmith, the solution achieved a 90% correctness rate, 82% resolution rate, 5.2/7 Customer Effort Score for Super TOBi, and over 86% One-Call Resolution rate for Super Agent, delivering faster response times and higher customer satisfaction while reducing agent workload.

customer_support chatbot question_answering +32

AI Agents and Intelligent Observability for DevOps Modernization

HRS Group / Netflix / Harness 2026

This panel discussion brings together engineering leaders from HRS Group, Netflix, and Harness to explore how AI is transforming DevOps and SRE practices. The panelists address the challenge of teams spending excessive time on reactive monitoring, alert triage, and incident response, often wading through thousands of logs and ambiguous signals. The solution involves integrating AI agents and generative models into CI/CD pipelines, observability workflows, and incident management to enable predictive analysis, intelligent rollouts, automated summarization, and faster root cause analysis. Results include dramatically reduced mean time to resolution (from hours to minutes), elimination of low-level toil, improved context-aware decision making, and the ability to move from reactive monitoring to proactive, machine-speed remediation while maintaining human accountability for critical business decisions.

customer_support code_generation summarization +35

AI-Powered Transformation of AWS Support for Mission-Critical Workloads

Whoop 2025

AWS Support transformed from a reactive firefighting model to a proactive AI-augmented support system to handle the increasing complexity of cloud operations. The transformation involved building autonomous agents, context-aware systems, and structured workflows powered by Amazon Bedrock and Connect to provide faster incident response and proactive guidance. WHOOP, a health wearables company, utilized AWS's new Unified Operations offering to successfully launch two new hardware products with 10x mobile traffic and 200x e-commerce traffic scaling, achieving 100% availability in May 2025 and reducing critical case response times from 8 minutes to under 2.5 minutes, ultimately improving quarterly availability from 99.85% to 99.95%.

healthcare customer_support high_stakes_application +29