ZenML

Automated LLM Evaluation Framework for Customer Support Chatbots

Instacart 2025
View original source

Instacart developed the LLM-Assisted Chatbot Evaluation (LACE) framework to systematically evaluate their AI-powered customer support chatbot performance at scale. The company faced challenges in measuring chatbot effectiveness beyond traditional metrics, needing a system that could assess nuanced aspects like query understanding, answer correctness, and customer satisfaction. LACE employs three LLM-based evaluation methods (direct prompting, agentic reflection, and agentic debate) across five key dimensions with binary scoring criteria, validated against human judgment through iterative refinement. The framework enables continuous monitoring and improvement of chatbot interactions, successfully identifying issues like context maintenance failures and inefficient responses that directly impact customer experience.

Industry

E-commerce

Technologies

Company and Use Case Overview

Instacart operates a grocery delivery platform serving millions of customers, where the majority of orders proceed smoothly but various issues can arise including missing items, delivery delays, refund questions, and membership inquiries. Traditionally, these customer service issues required human agents, creating wait times and scalability challenges. To address this, Instacart developed an AI-powered customer support chatbot to handle requests faster and at scale. However, the company recognized that building the chatbot was only part of the challenge - the more critical aspect was developing robust evaluation methods to ensure the chatbot actually helped customers in real conversations.

The core problem Instacart faced was that traditional chatbot metrics were insufficient for assessing the nuanced aspects of customer support quality. Simple metrics like response time or completion rates didn’t capture whether the chatbot truly understood customer queries, provided accurate information, maintained conversational context, or delivered satisfactory experiences. This led to the development of their LLM-Assisted Chatbot Evaluation (LACE) framework, which represents a sophisticated approach to automated evaluation of conversational AI systems in production.

Technical Architecture and Implementation

The LACE framework employs a multi-dimensional evaluation approach that assesses chatbot interactions across five key dimensions: Query Understanding, Answer Correctness, Chat Efficiency, Client Satisfaction, and Compliance. Each dimension contains granular binary criteria that address specific aspects of the user experience. For instance, Answer Correctness evaluates contextual relevancy, factual correctness, consistency, and usefulness. This comprehensive structure provides a holistic quality assessment by aggregating individual criterion scores into a total session score.

The system evaluates complete chat sessions, defined as full multi-turn conversations between customers and the support chatbot, rather than individual responses. This session-level evaluation captures the flow and context maintenance that are crucial for effective customer support interactions. The framework uses binary True/False scoring rather than granular scales, which the team found to be more effective through experimentation. Binary evaluations provide greater consistency, simplicity, and alignment with human judgment while requiring less extensive prompt engineering.

LLM-Based Evaluation Methods

LACE implements three distinct LLM-based evaluation approaches, each offering different trade-offs in terms of complexity and evaluation depth:

Direct Prompting serves as the baseline approach where an LLM evaluates chatbot interactions based on predefined criteria in a single pass. While efficient and straightforward, this method lacks the flexibility for the model to refine its assessment, potentially limiting evaluation depth.

Agentic Evaluation via Reflection introduces a sophisticated two-step process where the LLM first performs an initial evaluation, then reflects on its assessment to identify potential biases or gaps before adjusting and refining its judgment. This self-reflection mechanism helps deliver more nuanced and accurate evaluations by allowing the model to catch and correct its own limitations.

Agentic Evaluation via Debate creates a structured multi-agent system where three distinct LLM-simulated agents evaluate the chatbot’s performance from different perspectives. A Customer Agent adopts a critical customer mindset to scrutinize the chatbot’s response, a Support Agent independently defends the chatbot’s behavior and rationale, and a Judge Agent reviews the original chat along with both agents’ assessments to deliver an impartial evaluation. The Customer and Support agents operate independently and in parallel without access to each other’s assessments, while the Judge synthesizes multiple viewpoints to provide balanced evaluation.

Implementation Considerations and Best Practices

The team made several important architectural decisions to optimize the framework’s performance. They decoupled evaluation from structured output generation, allowing the model to first produce free-form text before converting it to structured JSON in a separate step. This approach enables the use of powerful reasoning models like o1-preview for generating high-quality evaluations without being constrained by JSON formatting requirements, which can negatively affect performance due to restricted decoding.

Prompt formatting received special attention, with the team adopting industry best practices by structuring prompts in Markdown format with clearly organized sections. This approach aligns with research showing that prompt formatting can significantly impact LLM output quality. The structured format helps delineate content and convey instructions more effectively to the evaluation models.

The system generates multiple structured fields for each evaluation, including True/False ratings for each criterion along with brief explanations. This structured output format facilitates post-processing and integration with monitoring dashboards while maintaining the flexibility to capture nuanced reasoning behind each evaluation decision.

Human-LLM Alignment and Validation

To ensure that LLM-based evaluations accurately reflect human judgment, Instacart implemented an iterative validation process. Human evaluators rated carefully selected customer conversations using the same criteria as the LACE system, and their ratings were compared to those generated by LACE. When misalignments were identified, the team refined the evaluation framework through two primary mechanisms: refining existing criteria by improving definitions and prompts, and redesigning the criteria structure when simpler refinements proved insufficient.

This human-LACE comparison and refinement cycle was repeated multiple times until strong alignment was achieved between human judgment and automated evaluation. The process served to bootstrap the initial system and continues to be used for validating updates and catching regressions. This validation approach demonstrates a critical aspect of responsible LLMOps - ensuring that automated systems maintain alignment with human expectations and values.

Evaluation Criteria Complexity and Lessons Learned

Through iterative refinement, the team discovered that evaluation criteria naturally grouped into three categories based on complexity:

Simple Criteria rely on universal standards that don’t require deep contextual understanding, such as evaluating professionalism through tone and politeness assessments. These proved straightforward to measure and improve, with the agentic debate-style approach achieving near-perfect accuracy.

Context-Dependent Criteria present greater challenges, requiring interpretation of responses based on Instacart-specific behaviors and operational nuances. For example, assessing contextual relevancy means evaluating whether the chatbot correctly integrates conversation history while demonstrating understanding of Instacart’s business model and processes. These criteria demand close attention to conversational details and interpretation of information that might contradict surface meaning. The team embedded specialized knowledge into the chatbot’s prompts through static templates, achieving over 90% accuracy with the agentic debate approach.

Subjective Criteria, such as answer conciseness, proved most challenging due to varying human preferences and the tendency for LLMs to apply stricter standards than humans. Rather than investing significant effort in refining these ambiguous criteria, the team uses them as high-level directional checks to flag unexpected changes while focusing improvement efforts on direct chatbot behavior modification through prompt refinement and model fine-tuning.

Production Monitoring and Continuous Improvement

LACE powers Instacart’s continuous evaluation pipeline, enabling analysis of chat logs and driving targeted improvements across their AI-powered customer support system. Using stratified sampling based on topic distribution, LACE feeds into dashboards that allow monitoring of performance trends over time, analysis of specific interaction details to pinpoint issues, and integration of feedback directly into experimentation platforms for real-time improvements.

The dashboard provides multi-layered insights into chatbot performance, from top-level score trends to detailed breakdowns of criterion-level failures across conversation topics. A dedicated “Failed Chat” section highlights interactions that didn’t meet standards, categorizing them by failure reason and linking them to corresponding chat segments for deeper investigation.

The system successfully identifies practical issues such as failures to maintain context when users request to continue conversations, and inefficient responses where the assistant repeats the same message without adapting or advancing the conversation. These dimension-level failure identifications enable targeted improvements that make AI interactions more context-aware, efficient, and helpful.

Scalability and Knowledge Management Challenges

Managing Instacart-specific knowledge poses ongoing challenges, as overloaded prompts can dilute effectiveness by reducing the signal-to-noise ratio. The team is exploring dynamic prompt construction paired with real-time knowledge retrieval, similar to Retrieval-Augmented Generation (RAG) approaches. This strategy would ensure the chatbot retrieves relevant knowledge on demand while keeping prompts concise and focused, maintaining strong evaluation performance without overwhelming the system with static information.

Broader Implications for LLMOps

The LACE framework demonstrates several important principles for effective LLM evaluation and deployment in production environments. The approach of balancing precision with pragmatism, leveraging agentic LLM evaluation techniques, and focusing efforts where they yield the greatest impact extends beyond chatbot development to illuminate broader principles about AI evaluation in real-world settings where user experience remains paramount.

The iterative refinement process, human-AI alignment validation, and the recognition that different types of criteria require different evaluation approaches provide valuable insights for organizations implementing similar systems. The framework’s emphasis on continuous monitoring and improvement, combined with its integration into experimentation platforms, exemplifies mature LLMOps practices that enable rapid iteration and systematic enhancement of AI systems in production.

The case study also highlights the importance of operationalizing evaluation frameworks at scale, moving beyond simple metrics to capture the nuanced aspects of AI system performance that directly impact user experience. This comprehensive approach to LLM evaluation represents a significant advancement in the field of conversational AI operations and provides a template for other organizations seeking to implement robust evaluation systems for their production LLM applications.

More Like This

Building Production-Ready AI Assistant with Agentic Architecture

Shopify 2025

Shopify developed Sidekick, an AI-powered assistant that helps merchants manage their stores through natural language interactions, evolving from a simple tool-calling system into a sophisticated agentic platform. The team faced scaling challenges with tool complexity and system maintainability, which they addressed through Just-in-Time instructions, robust LLM evaluation systems using Ground Truth Sets, and Group Relative Policy Optimization (GRPO) training. Their approach resulted in improved system performance and maintainability, though they encountered and had to address reward hacking issues during reinforcement learning training.

customer_support chatbot data_analysis +27

Company-Wide GenAI Transformation Through Hackathon-Driven Culture and Centralized Infrastructure

Agoda 2025

Agoda transformed from GenAI experiments to company-wide adoption through a strategic approach that began with a 2023 hackathon, grew into a grassroots culture of exploration, and was supported by robust infrastructure including a centralized GenAI proxy and internal chat platform. Starting with over 200 developers prototyping 40+ ideas, the initiative evolved into 200+ applications serving both internal productivity (73% employee adoption, 45% of tech support tickets automated) and customer-facing features, demonstrating how systematic enablement and community-driven innovation can scale GenAI across an entire organization.

customer_support code_generation document_processing +44

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90