ZenML

Agentic AI Manufacturing Reasoner for Automated Root Cause Analysis

Apollo Tyres 2025
View original source

Apollo Tyres developed a Manufacturing Reasoner powered by Amazon Bedrock Agents to automate root cause analysis for their tire curing processes. The solution replaced manual analysis that took 7 hours per issue with an AI-powered system that delivers insights in under 10 minutes, achieving an 88% reduction in manual effort. The multi-agent system analyzes real-time IoT data from over 250 automated curing presses to identify bottlenecks across 25+ subelements, enabling data-driven decision-making and targeting annual savings of approximately 15 million Indian rupees in their passenger car radial division.

Industry

Automotive

Technologies

Company Overview and Business Context

Apollo Tyres is a prominent international tire manufacturer headquartered in Gurgaon, India, with production facilities across India and Europe. The company operates under two global brands - Apollo and Vredestein - and distributes products in over 100 countries through an extensive network of outlets. Their product portfolio spans the complete spectrum of tire manufacturing, including passenger car, SUV, truck-bus, two-wheeler, agriculture, industrial, and specialty tires.

As part of an ambitious digital transformation initiative, Apollo Tyres collaborated with Amazon Web Services to implement a centralized data lake architecture. The company’s strategic focus centers on streamlining their entire business value process, with particular emphasis on manufacturing optimization. This digital transformation journey led to the development of their Manufacturing Reasoner solution, which represents a sophisticated application of generative AI in industrial settings.

Problem Statement and Business Challenge

The core challenge faced by Apollo Tyres centered on the manual and time-intensive process of analyzing dry cycle time (DCT) for their highly automated curing presses. Plant engineers were required to conduct extensive manual analysis to identify bottlenecks and focus areas using industrial IoT descriptive dashboards. This analysis needed to cover millions of parameters across multiple dimensions including all machines, stock-keeping units (SKUs), cure mediums, suppliers, machine types, subelements, and sub-subelements.

The existing process presented several critical limitations. First, the analysis consumed between 7 hours per issue on average, with some cases requiring up to 2 elapsed hours per issue for initial assessment. Second, subelement-level analysis - particularly bottleneck analysis of subelement and sub-subelement activities - was not feasible using traditional root cause analysis tools. Third, the process required coordination between subject matter experts from various departments including manufacturing, technology, and industrial engineering. Finally, since insights were not generated in real-time, corrective actions were consistently delayed, impacting operational efficiency.

Technical Architecture and LLMOps Implementation

The Manufacturing Reasoner solution represents a sophisticated multi-agent architecture built on Amazon Bedrock. The system demonstrates advanced LLMOps practices through its comprehensive agent orchestration, real-time data processing, and natural language interface capabilities.

Multi-Agent Architecture Design

The solution employs a primary AI agent that serves as the orchestration layer, classifying question complexity and routing requests to specialized agents. This primary agent coordinates with several specialized agents, each designed for specific analytical functions. The complex transformation engine agent functions as an on-demand transformation engine for context and specific questions. The root cause analysis agent constructs multistep, multi-LLM workflows to perform detailed automated RCA, particularly valuable for complex diagnostic scenarios.

The system also includes an explainer agent that uses Anthropic’s Claude Haiku model to generate two-part explanations: evidence providing step-by-step logical explanations of executed queries, and conclusions offering brief answers referencing Amazon Redshift records. A visualization agent generates Plotly chart code for creating visual charts using Anthropic’s Claude Sonnet model. This multi-agent approach demonstrates sophisticated LLMOps practices in agent coordination and specialization.

Data Integration and Real-Time Processing

The technical infrastructure connects curing machine data flows to AWS Cloud through industrial Internet of Things (IoT) integration. Machines continuously transmit real-time sensor data, process information, operational metrics, events, and condition monitoring data to the cloud infrastructure. This real-time data streaming capability is essential for the solution’s effectiveness in providing immediate insights and enabling rapid corrective actions.

The system leverages Amazon Redshift as its primary data warehouse, providing reliable access to actionable data for the AI agents. Amazon Bedrock Knowledge Bases integration with Amazon OpenSearch Service vector database capabilities enables efficient context extraction for incoming requests. This architecture demonstrates mature LLMOps practices in data pipeline management and real-time processing.

Natural Language Interface and User Experience

The user interface is implemented as a Chainlit application hosted on Amazon EC2, enabling plant engineers to interact with the system using natural language queries in English. This interface represents a significant advancement in manufacturing analytics, allowing domain experts to access complex industrial IoT data without requiring technical expertise in query languages or data manipulation.

The system processes user questions through the primary AI agent, which classifies complexity and routes requests appropriately. The primary agent calls explainer and visualization agents concurrently using multiple threads, demonstrating efficient parallel processing capabilities. Results are streamed back to the application, which dynamically displays statistical plots and formats records in tables, providing comprehensive visual and textual insights.

Performance Optimization and LLMOps Best Practices

The development team encountered and addressed several critical performance challenges that highlight important LLMOps considerations for production deployments. Initially, the solution faced significant response time delays when using Amazon Bedrock, particularly with multiple agent involvement. Response times exceeded 1 minute for data retrieval and processing across all three agents, which was unacceptable for operational use.

Through systematic optimization efforts, the team reduced response times to approximately 30-40 seconds by carefully selecting appropriate large language models and small language models, and disabling unused workflows within agents. This optimization process demonstrates the importance of model selection and workflow efficiency in production LLMOps environments.

The team also addressed challenges related to LLM-generated code for data visualization. Initially, generated code often contained inaccuracies or failed to handle large datasets correctly. Through continuous refinement and iterative development, they developed a dynamic approach capable of accurately generating chart code for efficiently managing data within data frames, regardless of record volume. This iterative improvement process exemplifies mature LLMOps practices in code generation and validation.

Data Quality and Consistency Management

Consistency issues were resolved by ensuring correct data format ingestion into the Amazon data lake for the knowledge base. The team established a structured format including questions in natural language, complex transformation engine scripts, and associated metadata. This structured approach to data preparation demonstrates important LLMOps practices in data quality management and knowledge base maintenance.

Governance and Safety Implementation

The solution implements Amazon Bedrock Guardrails to establish tailored filters and response limits, ensuring that interactions with machine data remain secure, relevant, and compliant with operational guidelines. These guardrails prevent errors and inaccuracies by automatically verifying information validity, which is essential for accurate root cause identification in manufacturing environments.

This governance approach demonstrates mature LLMOps practices in production safety and compliance management. The guardrails help maintain system reliability while enabling natural language interaction with sensitive operational data.

Operational Impact and Business Results

The Manufacturing Reasoner solution delivers significant operational improvements across multiple dimensions. The system analyzes data from over 250 automated curing presses, more than 140 SKUs, three types of curing mediums, and two types of machine suppliers across 25+ automated subelements. This comprehensive coverage enables detailed bottleneck identification and targeted improvement recommendations.

The solution achieved an 88% reduction in manual effort for root cause analysis, reducing analysis time from up to 7 hours per issue to less than 10 minutes per issue. This dramatic improvement enables plant engineers to focus on implementing corrective actions rather than data analysis. The system provides real-time triggers to highlight continuous anomalous shifts in DCT for mistake-proofing and error prevention, aligning with Poka-yoke methodologies.

Additional benefits include observability of elemental-wise cycle time with graphs and statistical process control charts, press-to-press direct comparison on real-time streaming data, and on-demand RCA capabilities with daily alerts to manufacturing subject matter experts. The targeted annual savings of approximately 15 million Indian rupees in the passenger car radial division alone demonstrates substantial business value from the LLMOps implementation.

Lessons Learned and LLMOps Best Practices

The Apollo Tyres implementation provides several valuable insights for LLMOps practitioners working with industrial IoT and real-time data. The team learned that applying generative AI to streaming real-time industrial IoT data requires extensive research due to the unique nature of each use case. The journey from prototype to proof-of-concept involved exploring multiple strategies to develop an effective manufacturing reasoner for automated RCA scenarios.

Performance optimization emerged as a critical consideration, requiring careful model selection and workflow optimization to achieve acceptable response times. The iterative approach to improving code generation capabilities demonstrates the importance of continuous refinement in production LLMOps environments.

Data quality and consistency management proved essential for reliable system operation. The structured approach to knowledge base preparation and maintenance ensures consistent system performance and accurate insights.

Future Scaling and Development Plans

The Apollo Tyres team is scaling the successful solution from tire curing to various areas across different locations, advancing toward Industry 5.0 goals. Amazon Bedrock will play a pivotal role in extending the multi-agentic Retrieval Augmented Generation solution through specialized agents with distinct roles for specific functionalities.

The team continues focusing on benchmarking and optimizing response times for queries, streamlining decision-making and problem-solving capabilities across the extended solution. Apollo Tyres is also exploring additional generative AI applications using Amazon Bedrock for other manufacturing and non-manufacturing processes.

This expansion strategy demonstrates mature LLMOps thinking in scaling successful solutions across broader organizational contexts while maintaining performance and reliability standards. The focus on specialized agents for different domains shows sophisticated understanding of multi-agent system design and deployment strategies.

The Manufacturing Reasoner case study represents a comprehensive example of production LLMOps implementation in industrial settings, demonstrating successful integration of multiple AI agents, real-time data processing, natural language interfaces, and robust governance frameworks to deliver substantial business value through manufacturing optimization.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Scaling AI Product Development with Rigorous Evaluation and Observability

Notion 2025

Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.

document_processing content_moderation question_answering +52

AI-Powered Vehicle Information Platform for Dealership Sales Support

Toyota 2025

Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.

customer_support chatbot question_answering +47