Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.
This case study documents Toyota Motor North America (TMNA) and Toyota Connected’s development of an enterprise-scale generative AI platform designed to provide vehicle information to dealership sales staff and customers. The presentation, delivered at AWS re:Invent, provides a detailed technical walkthrough of both their production RAG-based system (version 1) and their planned transition to an agentic platform (version 2) using Amazon Bedrock AgentCore.
The collaboration between TMNA and Toyota Connected represents a mature approach to enterprise LLMOps, with Bryan Landes (AWS Solutions Architect supporting Toyota for 7.5 years), Stephen Ellis (TMNA Enterprise AI team), and Stephen Short (Toyota Connected Senior Engineer) presenting different perspectives on the platform engineering, business strategy, and technical implementation.
The core business problem emerged from a shift in customer behavior. Modern car buyers arrive at dealerships highly researched, having consumed YouTube reviews, online forums, and detailed vehicle comparisons. When these informed customers ask specific questions about vehicle features, trim differences, or technical specifications (such as the differences between Supra models), sales staff often couldn’t provide immediate, authoritative answers. This led to customers pulling out their phones to search Google during sales conversations, creating an awkward dynamic and potentially lost sales opportunities.
TMNA’s Enterprise AI team was formed as a center of excellence with a unique structure where almost all team members are engineers rather than traditional business analysts. This engineering-heavy composition enabled them to build AI accelerators and what they call “AI teammates” - systems designed to augment human capabilities rather than replace them, in line with Toyota’s policy of keeping teammates at the center of all work.
Version 1 represents a sophisticated RAG implementation currently serving Toyota’s entire dealer network with over 7,000 interactions per month. The architecture spans multiple AWS accounts with careful separation of concerns.
When a front-end client initiates a request, it routes through the TMNA Enterprise AI account, passing through Route 53 with an attached Web Application Firewall (WAF). Lambda@Edge handles authentication and authorization using Entra ID (formerly Azure Active Directory). Once authenticated, requests flow to an “intent router” deployed on Amazon ECS. This intent router’s primary responsibility is identifying which vehicle the user is asking about to determine which data to retrieve.
Before any LLM inference occurs, all requests immediately route through “Prompt Guard,” an in-house solution built by Toyota’s cybersecurity team to identify and block malicious activities such as prompt injection attacks. This security-first approach demonstrates the mature governance applied to production LLM systems.
The intent router establishes a WebSocket connection with the front end and initializes conversation tracking using DynamoDB. After vehicle identification (which does use an external LLM call), the request transfers to Toyota Connected’s main account via the internet through CloudFlare (with another WAF) and hits an API Gateway.
The RAG application code runs within an Amazon EKS (Elastic Kubernetes Service) cluster in Toyota Connected’s Shared Services account, maintained by their cloud engineering team to handle scaling and traffic management. All logs forward to Datadog for observability.
The RAG inference process involves several sophisticated steps:
Embedding Generation with Conversational Context: The system uses Amazon SageMaker to generate embeddings not just for the current query but also for the previous five turns of conversation. A weighted average algorithm applies more preference to recent conversation turns while maintaining contextual awareness. This approach addresses the challenge of maintaining conversation continuity without overwhelming the context window.
Semantic Search: Generated embeddings perform semantic search against an OpenSearch Serverless vector database, retrieving 30 documents per vehicle queried. These documents serve as the primary source of truth, ensuring responses rely on official Toyota data rather than LLM world knowledge.
LLM Inference with Streaming: Amazon Bedrock hosts the Anthropic models used for inference. The system sends the assistant prompt along with retrieved documents to generate responses. Critically, the system performs post-processing on the streaming output to meet business requirements around legal disclaimers and image handling.
Compliance and Logging: After inference completes, messages push to an SQS queue, which triggers a Lambda function to export logs to MongoDB for compliance reporting requirements. The response then buffers back to TMNA via webhook, updates the DynamoDB conversation history, and streams to the front-end client.
An important architectural principle is that the RAG application is completely stateless from Toyota Connected’s perspective. All conversation management happens in the Enterprise AI account, which allows for cleaner separation of concerns and easier scaling.
One of the most complex aspects of the system is transforming raw vehicle data into a format suitable for RAG. The raw data consists of large JSON objects with internal mappings for trim codes, MSRP information, descriptions, titles, and other fields - plus critical disclaimer codes that must be preserved exactly as written for legal compliance.
The ETL pipeline utilizes AWS Step Functions to orchestrate a series of AWS Glue scripts across a dedicated data account:
Extract Phase: Scripts pull all supported vehicle data from Toyota API servers (covering model years 2023-2026) and push to S3.
Transform Phase: This is the heaviest portion, processing up to 30 vehicles concurrently for maximum throughput. The scripts chunk the JSON data and then use Amazon Bedrock to generate natural language summarizations of each chunk. For example, a JSON object representing a single vehicle feature gets translated into readable prose that includes trim availability, pricing, and descriptions.
Because LLM output is non-deterministic, the team implemented data quality checks to verify the accuracy of these summarizations, particularly for critical information like pricing details and trim availabilities. This validation step is crucial for maintaining trust in a production system.
Embedding and Publishing: Another script generates embeddings of the natural language summarizations, ties them to the raw data (which is preserved for citation purposes), and publishes to S3.
Load Phase: An Amazon EventBridge event triggers Lambda functions on dev, stage, and prod accounts. These Lambdas retrieve configuration from AWS Systems Manager Parameter Store, create a new timestamped index in OpenSearch, configure an OpenSearch ingest pipeline to read from the transform output, and ingest the data.
Before any newly ingested data becomes active, Toyota runs it through a comprehensive evaluation pipeline orchestrated via GitLab Runners. TMNA counterparts provided a “golden set” of question-answer pairs for one vehicle, validated by subject matter experts. This golden data set serves as the foundation for generating synthetic test sets for all vehicles.
The evaluation system invokes the deployed RAG application with test questions, then uses a “council of LLMs” approach to assess responses against defined metrics measuring system performance and data quality. Only after this evaluation passes does an index alias switch to point to the newly created index, enabling zero-downtime data updates.
The council of LLMs approach for evaluation is noteworthy - rather than relying on a single model’s judgment, multiple LLMs assess response quality, likely providing more robust and less biased evaluations.
Legal requirements posed significant technical challenges. Every response must include contextually relevant disclaimers taken from a controlled vocabulary - the disclaimer text is immutable and cannot be altered by the LLM. Similarly, vehicle image URLs and metadata must remain unchanged.
Toyota’s engineers solved this with an innovative “stream splitting” approach. The system prompt includes extensive examples (in-context learning) training the model to split its output into three distinct streams:
The implementation uses specific delimiters in the streaming output, with code that monitors the invoke_model response stream and switches state based on delimiter detection. After the LLM completes inference, Toyota maps the disclaimer codes and image IDs to their immutable legal text and image URLs without the LLM ever touching this content. This elegant solution maintains legal compliance while leveraging LLM reasoning about relevance.
Toyota built a compliance analysis system that categorizes incoming questions and measures how well responses adhere to legal guidelines about how the assistant should behave. Results feed into MongoDB, backing a compliance reporting dashboard shared with legal teams to monitor production performance.
The production system uses Datadog extensively for observability, with logs forwarded from the EKS cluster. The team tracks conversation histories in DynamoDB for the Enterprise AI side, while compliance data lives in MongoDB for reporting. This multi-database approach reflects different data access patterns and compliance requirements.
Stephen Ellis from TMNA provided valuable context on their enterprise AI strategy. The Enterprise AI team follows a unique organizational structure that’s “diagonal” across the organization rather than a traditional horizontal center of excellence. Their workflow moves through distinct phases:
Exploration: Novel use cases with brand new technology that hasn’t been done before.
Experimentation and Education: Working with IT teams to bring capabilities into business as usual. An example given was contract analysis - analysts were manually reviewing 300,000 contracts at a rate of 30,000 per year. A gen AI solution reduced time by 15-17 hours per user while discovering contract compliance issues and expiring clauses the company was missing, leading to significant savings.
Enablement: Once capabilities are proven, they democratize them across different groups. Ellis categorizes most use cases into three types: taking data and doing analysis, taking content and generating new content, or distilling disparate content into unified sources for different audiences.
Adoption: Engaging with business users while emphasizing that there’s no perfect version - the key is getting something done quickly and improving based on learnings. This philosophy contrasts with traditional manufacturing approaches that want to de-risk and plan perfectly before execution. In gen AI, perfect planning means falling behind daily.
The team follows a “build, configure, buy” approach. Because they started with engineers and research scientists, they built capabilities from day one (starting as soon as ChatGPT API became available). Once they’ve built and defined working requirements, they look for products that can be configured to fit existing platforms. Finally, if those products mature into SaaS platforms or are delivered by trusted partners, they buy rather than maintain in-house solutions where Toyota isn’t the expert.
For new AI/ML projects, teams submit ideas through an AI/ML governance board that evaluates whether solutions, vendors, or technologies comply with existing standards. When standards don’t exist, they help shape new ones. After governance approval, the Enterprise AI team builds prototypes, sets up productionalization plans, and supports authorization. For teams with existing full stacks, they hand off prototypes and enable new technology rather than maintaining ongoing operations.
Stephen Short detailed the planned evolution to an agentic platform, driven by several factors:
Data Stillness Problem: Every time upstream vehicle data changes (which happens frequently during new model year rollouts), the entire ETL pipeline must run, followed by evaluation. This creates lag between data updates and system availability.
Limited Capabilities: Version 1 can answer questions but cannot perform actions like checking local dealership inventory for specific vehicles.
Scalability and Maintenance: The complex ETL pipeline creates significant infrastructure overhead.
Early experiments with the Strands SDK and MCP (Model Context Protocol) servers revealed that modern LLMs can connect directly to data sources, potentially eliminating the traditional RAG pipeline entirely while enabling actions and advanced reasoning. However, moving from proof-of-concept demos to production presents challenges around authentication, authorization, auto-scaling, and guaranteeing context and session isolation in multi-agent systems.
Toyota selected Amazon Bedrock AgentCore as their platform for version 2, specifically because it addresses production concerns:
AgentCore Runtime: Firecracker VM-based solution providing isolation by default, serverless scaling, and low infrastructure overhead.
AgentCore Identity: Tackles the complexities of inbound and outbound authentication in multi-agent, multi-MCP systems.
AgentCore Memory: Simplifies conversation management and enables novel use cases.
AgentCore Gateway: Managed service for deploying MCP servers.
AgentCore Observability: Supports OpenTelemetry by default, integrating with Toyota’s existing Datadog infrastructure.
The planned architecture involves an “orchestrator” (replacing the intent router) that integrates with an “agent registry” - a mapping of authenticated clients to available agents. When front-end requests arrive, the orchestrator consults the registry to route to appropriate agents, with Bedrock handling what previously required external LLM calls.
Toyota Connected plans to deploy multiple Strands agents in AgentCore Runtime:
Product Expert Agent: Essentially agentifies version 1 capabilities, answering questions about vehicle specifications, pricing, trim options, and accessories.
Product Support Agent: Services customer inquiries about their specific vehicles, expanding beyond the information-only capabilities of version 1.
Each agent couples with MCP servers providing necessary tools. The Product Support MCP Server will use AgentCore Gateway, which Toyota believes is a perfect fit. However, the Product Expert MCP Server requires response caching to be a responsible consumer of Toyota’s APIs - a hard requirement.
Stephen Short demonstrated particularly creative LLMOps engineering by using AgentCore Memory as a distributed cache. The approach involves:
The implementation required using the low-level client rather than the high-level client, as only the low-level client supports filtering based on event metadata. The code invokes the GMDP client’s list_events function with metadata filters checking if the cache key matches.
For memory to act as a shared cache across different MCP server sessions, specific configuration is needed (actor ID and session ID or agent ID must be statically coded). This enables memory to function as a distributed cache accessible by any agent, solving the response caching requirement while leveraging managed infrastructure.
This creative repurposing of AgentCore Memory demonstrates sophisticated LLMOps thinking - identifying capabilities in platform services that can solve problems in non-obvious ways.
Toyota Connected targets a Q1 2026 launch for version 2. By eliminating the ETL pipeline and connecting agents directly to data sources via MCP servers, they expect to solve the data stillness issues plaguing version 1 while enabling new action-oriented capabilities. The move to AgentCore substantially reduces infrastructure overhead compared to maintaining custom agent orchestration, authentication, and scaling systems.
Bryan Landes provided context on the seven-year AWS-Toyota partnership. When he joined the Toyota account in 2018, their AWS footprint was “very small.” His team works not just with North America but also Toyota Motor Corporation in Japan, Toyota Connected in Japan, Woven by Toyota, and Toyota Racing Development (which uses SageMaker to predict NASCAR race outcomes).
Landes emphasized the importance of deeply embedded customer relationships where account teams are constantly engaging, learning, and building together. Toyota pushes AWS services daily, discovering new workload types continuously. There are approximately 47 different AI/ML use cases across Toyota entities.
The presentation referenced Toyota’s adoption of platform engineering principles with internal development platforms (IDPs) that democratize AI tooling across the organization. The concept is that one centralized platform enables DevOps at scale, building features and capabilities for developers, data scientists, and business users. Toyota has four or five such platforms depending on organizational structure.
The IDP approach allows deployment of agents at scale with confined governance, identity, and guardrails, preventing security teams from “freaking out” while enabling self-service across different organizational units (legal, HR, etc.). Landes mentioned Cisco and Spotify as other companies following similar patterns.
This case study demonstrates exceptionally mature LLMOps practices:
Strengths:
Areas of Concern:
Notable LLMOps Practices:
The fact that Toyota is presenting this at re:Invent and plans to use AgentCore for v2 suggests strong AWS partnership, but the presentation maintains credibility by openly discussing challenges like the data stillness problem and infrastructure overhead rather than only highlighting successes.
The case study touches on several production scaling aspects:
The relatively modest interaction numbers (7,000/month is roughly 10 interactions per hour averaged) suggest this isn’t yet a massively scaled system, though dealer usage patterns likely show significant peaks. The infrastructure complexity may be more about governance, compliance, and multi-account organization than pure scale requirements.
The transition from RAG to agentic approaches reflects broader industry trends. Toyota’s experience suggests that:
The automotive industry’s specific needs - frequent data updates during model year rollouts, strict legal disclaimer requirements, multi-stakeholder audiences (dealers, customers, internal staff) - make this a particularly challenging domain for generative AI deployment. Toyota’s solutions offer lessons for other heavily regulated industries with similar constraints.
The emphasis on “AI teammates” rather than automation aligns with Toyota’s manufacturing philosophy and may offer a more sustainable approach to AI adoption than pure replacement narratives. The 15-17 hour per user time savings in contract analysis, combined with discovery of compliance issues, exemplifies how augmentation can provide value beyond simple efficiency gains.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
Prudential Financial, in partnership with AWS GenAI Innovation Center, built a scalable multi-agent platform to support 100,000+ financial advisors across insurance and financial services. The system addresses fragmented workflows where advisors previously had to navigate dozens of disconnected IT systems for client engagement, underwriting, product information, and servicing. The solution features an orchestration agent that routes requests to specialized sub-agents (quick quote, forms, product, illustration, book of business) while maintaining context and enforcing governance. The platform-based microservices architecture reduced time-to-value from 6-8 weeks to 3-4 weeks for new agent deployments, enabled cross-business reusability, and provided standardized frameworks for authentication, LLM gateway access, knowledge management, and observability while handling the complexity of scaling multi-agent systems in a regulated financial services environment.