Echo AI: Automated LLM Evaluation and Quality Monitoring in Customer Support Analytics

LLMOps Database

Tech

Echo AI

Company

Echo AI

Title

Automated LLM Evaluation and Quality Monitoring in Customer Support Analytics

Industry

Tech

Link

https://www.youtube.com/watch?v=42q8OmAF_Gw

Year

Summary (short)

Echo AI, leveraging Log10's platform, developed a system for analyzing customer support interactions at scale using LLMs. They faced the challenge of maintaining accuracy and trust while processing high volumes of customer conversations. The solution combined Echo AI's conversation analysis capabilities with Log10's automated feedback and evaluation system, resulting in a 20-point F1 score improvement in accuracy and the ability to automatically evaluate LLM outputs across various customer-specific use cases.

## Overview This case study, presented as a joint talk between Echo AI and Log10, demonstrates a real-world production LLM deployment focused on customer conversation analytics at enterprise scale. Echo AI is a platform that connects to various customer communication channels (support tickets, chat, phone calls) and uses generative AI to extract insights, categorize conversations, and surface actionable information for customer-facing teams. The partnership with Log10 addresses one of the most critical challenges in LLMOps: maintaining accuracy and trust when deploying LLMs at scale. ## The Business Problem Echo AI serves enterprises dealing with exceptionally high volumes of customer interactions. The core insight motivating their platform is that most companies only manually review a small sample (around 5%) of customer conversations, leaving the vast majority unanalyzed. Traditional approaches involve: - Manual reviews with small sample sizes that leave stakeholders unhappy due to inaccuracy - Engineering scripts for retroactive analysis that pull data from multiple systems - Building software to detect specific, pre-known patterns in conversations All of these approaches are fundamentally reactive—they happen after problems have already occurred. As one speaker noted, "everything is after fires had formed, you have no sense of where the smoke is." The promise of generative AI is 100% coverage—analyzing every conversation rather than sampling, and surfacing insights that weren't explicitly programmed to look for. However, this introduces significant LLMOps challenges around accuracy, trust, and ongoing quality management. ## Technical Architecture Echo AI's system follows a pipeline architecture that is common in production LLM applications: **Data Ingestion and Normalization**: The platform connects to various contact systems and ticket systems, pulling in customer conversations. This data is normalized, cleaned, and compressed to be passed efficiently into LLM prompts. While described as "non-AI boring stuff," this ETL layer is critical infrastructure for any production LLM system. **Multiple Analysis Pipelines**: Echo AI runs dozens of pipelines that assess conversations in different ways. These are configurable by users/customers, who can specify what they care about and what they're looking for. Notably, customers work with Echo AI to write prompts, and eventually take ownership of prompt management over time. This represents a mature approach to prompt engineering in production—treating prompts as configurable, customer-specific assets rather than fixed system components. **Extracted Insights**: From a single customer message, the system extracts multiple dimensions: - Intent (why is this customer reaching out) - Business aspects at root of issue (e.g., routers, broken deliveries, supply chain) - Sentiment of customers and representatives - Other domain-specific classifications **Self-Hosted Models**: Due to the immense volume of prompts and throughput requirements, Echo AI does "quite a bit of self-hosting" and is "constantly training new models to better handle different domains of our customer base." This highlights a key production consideration—when scaling LLM applications, self-hosting and fine-tuning can become necessary for cost and latency reasons. ## The Accuracy Challenge The presentation emphasizes that enterprise customers are primarily concerned with accuracy. There is significant hesitation in the market around whether generative AI insights can be trusted—whether they're better than what human business analysts, CX leaders, or sales VPs could produce. Echo AI's approach to building trust involves: - Delivering initial insights within seven days of customer onboarding - Working towards approximately 95% accuracy (though acknowledged as involving "a lot of sampling and figuring out") - Establishing trust early because "that's ultimately what's going to get them to keep renewing" This commercial pressure around accuracy is what makes the Log10 partnership critical. ## Log10's Auto Feedback System Log10 provides what they describe as an "infrastructure layer to improve LLM accuracy." Their vision is building self-improving systems where LLM applications can improve prompts and models themselves. While acknowledging they're "not there as a field yet," they've made progress with their Auto Feedback system. **The Problem with LLM-as-Judge**: The presentation cites several well-known issues with using LLMs to evaluate LLM outputs: - Models prefer their own output - Positional bias (preferring first-presented options) - Verbosity bias - Bias towards diversity of tokens - Often just predicting a single score regardless of ground truth **Auto Feedback Research**: Log10 conducted research on building auto feedback models using three approaches: - Few-shot learning - Fine-tuning - Bootstrap synthetic data generation followed by fine-tuning Key research findings include: - 45% improvement in evaluation accuracy by moving from aggregate to annotator-specific models - Upgrading from GPT-3.5 to GPT-4 as base model improved results - Fine-tuned models outperformed few-shot approaches - Sample efficiency: Using bootstrapping, they achieved accuracy equivalent to 1,000 ground truth examples with only 50 examples - Open-source model compatibility: Matched GPT-4 and GPT-3.5 fine-tuned evaluation accuracy using Mistral 7B and Llama 70B Chat ## Integration and Workflow The Log10 platform integrates via a "seamless one-line integration" that sits between the LLM application and the LLM SDK. It supports OpenAI, Anthropic, Gemini, and open-source models, plus framework integrations. For Echo AI specifically, the integration enables: **Engineer Debugging**: When outputs are problematic (the demo showed a summarization failure that just reiterated system prompt instructions), engineers can quickly investigate by examining the generated prompts and understanding failure modes. **Solution Engineer Workflow**: Solution engineers working directly with customers can view auto-generated feedback scores and provide human overrides when needed. The interface allows changing point values and accepting corrections. This creates an "effortless" way to collect high-fidelity human feedback at scale. **Monitoring Use Cases**: The auto feedback system enables: - Tracking hallucinations via automated scoring - Detecting model drift in a "data-driven way" that was previously impossible - Triaging for optimal use of limited human review resources - Curating high-quality datasets for automated prompt improvement and fine-tuning ## Production Considerations Highlighted Several production-specific challenges are addressed in this case study: **Prompt Diversity**: Because every customer is different and each brings different requirements, there's an "immense number of prompts" that must be managed. This creates unique challenges for quality assurance—you can't just evaluate a single system prompt, you need tooling that scales across customer-specific configurations. **Summarization as Critical Infrastructure**: Echo AI relies on summarization not just as a user-facing feature but as input for "a variety of different downstream analysis." This cascade dependency makes summarization accuracy particularly important—errors propagate through the system. **Trust and Maintenance**: The system requires ongoing maintenance to "achieve the utmost trust" with customers. This isn't a deploy-and-forget situation; there's continuous work to monitor quality and improve models. ## Results The partnership claims a 20 F1-point improvement in accuracy for specific use cases. While the exact baseline and methodology aren't detailed in the presentation, this represents a significant claimed improvement in production accuracy. A concrete customer example mentioned was Wine Enthusiasts, a company selling high-end wine refrigerators. Echo AI's real-time analysis surfaced a manufacturing defect that could have "gone on for weeks and weeks and weeks" before detection through traditional methods. ## Critical Assessment This case study represents a joint presentation from a vendor (Log10) and customer (Echo AI), so claims should be considered in that context. The accuracy improvement metrics (20 F1 points, 45% evaluation accuracy improvement) are presented without detailed methodology. The "95% accuracy" target for Echo AI is acknowledged as involving "a lot of sampling and figuring out," suggesting the measurement itself is challenging. That said, the case study presents a realistic picture of LLMOps challenges at scale: the need for customer-specific prompt configuration, the importance of automated evaluation to supplement limited human review capacity, the challenge of maintaining quality across model updates, and the commercial pressure to build and maintain customer trust in AI-generated insights. The emphasis on human-in-the-loop feedback collection and the acknowledgment that the field isn't yet at truly "self-improving systems" reflects a mature understanding of current LLM limitations. The technical approaches discussed—fine-tuning evaluation models, bootstrap data generation, open-source model deployment—represent practical production strategies rather than theoretical frameworks, making this a useful reference for teams building similar systems.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source