ZenML

Building a Production-Grade GenAI Customer Support Assistant with Comprehensive Observability

Elastic 2024
View original source

Elastic developed a customer support chatbot using generative AI and RAG, focusing heavily on production-grade observability practices. They implemented a comprehensive observability strategy using Elastic's own stack, including APM traces, custom dashboards, alerting systems, and detailed monitoring of LLM interactions. The system successfully launched with features like streaming responses, rate limiting, and abuse prevention, while maintaining high reliability through careful monitoring of latency, errors, and usage patterns.

Industry

Tech

Technologies

Summary

Elastic’s Field Engineering team developed a comprehensive observability infrastructure for their Support Assistant, a generative AI-powered customer support chatbot. This case study is part of a broader series documenting how Elastic built their customer support chatbot, with this particular entry focusing on the observability aspects critical to running LLMs in production. The Support Assistant uses RAG (Retrieval-Augmented Generation) with GPT-4 and GPT-4o models through Azure OpenAI to help customers get technical support for Elastic products.

The key insight from this case study is that observability serves dual purposes: demonstrating value during good times (tracking metrics like chat completions and unique users) and enabling rapid troubleshooting during incidents. The team discovered a bug causing repeated data loading (100+ transactions per minute instead of the expected 1 TPM) through their APM data, and tracked their launch success (100th chat completion just 21 hours post-launch).

Observability Architecture

Data Collection Infrastructure

The team maintains a dedicated Elastic Cloud cluster for monitoring purposes, separate from their production and staging data clusters. This separation is a best practice that ensures monitoring workloads don’t impact production performance and vice versa.

Their observability stack consists of several components:

The APM configuration emphasizes using environment variables for configuration (ELASTIC_APM_ENVIRONMENT, ELASTIC_APM_SERVER_URL, ELASTIC_APM_SECRET_TOKEN) and the importance of setting the environment field to distinguish between production, staging, and development traffic. The team recommends integrating APM early in development rather than waiting until the end, as it helps with debugging during development and establishes baseline metrics for alerting thresholds.

Dashboard Strategy

Status Dashboard

The status dashboard serves as the primary operational view, organized around usage, latency, errors, and capacity. Key visualizations include:

The team uses Kibana’s markdown visualization for convenient links to related resources in the dashboard’s prime real estate. They save default time ranges with dashboards—initially pinned from launch time to “now” to show the feature’s entire life, with plans to update to rolling windows like “last 30 days” as the feature matures.

APM Traces for GenAI

The APM traces capture the full journey of a request through the frontend server and API service. Custom spans are used to monitor GenAI-specific performance characteristics:

These custom spans provide the data needed to chart average durations on the dashboard and understand the user experience beyond raw HTTP latency.

ES|QL for Complex Analytics

The team used ES|QL to solve a challenging analytics problem: visualizing returning users. Computing overlapping windows for determining first-time vs. returning users per day isn’t compatible with standard histogram visualizations. ES|QL enabled processing the dataset to generate unique combinations of user email and request date, enabling counts of unique visit days per user.

Milestones Dashboard

A separate dashboard highlights achievements and growth, featuring horizontal bullet visualizations as gauges with target goals. Metrics are displayed across multiple time windows (all time, last 7 days, last 30 days) with bar charts aggregating by day to visualize growth trends.

Alerting Strategy

The team applies a thoughtful framework for alert configuration. Thresholds can be based on either defined quality of service targets or observed baselines from production data with tolerance for deviation. A key best practice mentioned: only page on-call staff for alerts with well-defined investigation/resolution steps—otherwise, send less demanding notifications like emails to avoid training the team to ignore alerts.

Alert severity determines notification channel—email for warnings, Slack messages tagging the team for critical issues. The team recommends testing alert formatting by temporarily configuring triggers that fire immediately.

GenAI-Specific Observability Challenges

First Generation Timeouts

The Support Assistant uses streaming responses to avoid waiting for full LLM generation before showing output. A 10-second timeout on receiving the first chunk of generated text helps maintain a responsive experience. Initially configured in client-side code, this created observability challenges because the server never became aware when the client aborted the request.

The solution was moving the timeout and AbortController to the API layer talking directly to the LLM. When timeouts occur, the server can now send errors to APM via captureError and properly close the connection. An additional refinement was added: before closing the connection, the server sends an Error event through the streaming protocol (which uses Started, Generation, End, and Error event types) so the client can properly update the UI state.

For error grouping, parameterized message objects are passed to the error handler so APM groups all errors from the same handler together despite varying error messages. Parameters include error message, error code, and which LLM was used.

Declined Request Monitoring

The Support Assistant is designed to decline two categories of requests: off-topic queries (email drafting, song lyrics, etc.) and topics it cannot answer well (like billing questions where inaccurate answers are worse than none).

The team uses prompt engineering rather than a separate moderation service, though they may evolve this approach. A key observability insight: by using a standardized decline response in the prompt, they can compare the LLM’s output to this predefined message and use captureError when there’s a match. This enables monitoring for spikes in rejections that might indicate users attempting to bypass restrictions.

An optimization suggested by a colleague: instead of buffering the entire response for comparison, track an index into the expected decline message and compare tokens as they stream—if any don’t match, it’s not a declined request. This avoids memory overhead, though the team hasn’t implemented it since they haven’t observed performance issues.

Rate Limiting

The chat completion endpoint has a dedicated rate limit separate from the general API limit. Using internal usage data (heaviest users sent 10-20 messages/day, top user sent 70 in one day) and latency metrics (20-second average completion time, implying a theoretical max of ~180 chats/hour for a single tab), the team set a limit of 20 chat completions per one-hour window. This matches heavy internal usage while limiting malicious users to ~11% of theoretical maximum throughput.

The monitoring includes alerts for HTTP 429 responses and a dashboard table listing users who triggered the limit, frequency, and recency.

Ban Flags

The feature flag system was enhanced to support features that are on by default and flags that block access. This enables customer organizations to block employee access to the Support Assistant if desired, and allows Elastic to cut off access to users consistently violating usage policies while outreach occurs.

Large Context Payloads

Observability caught an issue where HTTP 413 status codes indicated payloads exceeding server limits. The cause was RAG search context combined with user input exceeding size limits. The short-term fix increased accepted payload size for the chat completion endpoint. The planned long-term solution involves refactoring to send only RAG result metadata (ID, title) to the client, with the completion endpoint fetching full content by ID server-side.

Model Upgrades

The team upgraded from GPT-4 to GPT-4o during the observation period, which is visible in the latency metrics on their dashboard—a testament to the value of continuous observability in understanding system behavior changes.

Key Takeaways

This case study demonstrates mature LLMOps practices: separating monitoring infrastructure from production, instrumenting early in development, building dashboards around questions that matter, configuring actionable alerts with appropriate escalation paths, and implementing GenAI-specific observability for streaming responses, content moderation, rate limiting, and abuse prevention. The Elastic stack (APM, Kibana, Elasticsearch, ES|QL) provides the foundation, but the practices are largely transferable to other observability platforms.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building a Multi-Agent Research System for Complex Information Tasks

Anthropic 2025

Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.

question_answering document_processing data_analysis +48

Deploying Secure AI Agents in Highly Regulated Financial and Gaming Environments

Sicoob / Holland Casino 2025

Two organizations operating in highly regulated industries—Sicoob, a Brazilian cooperative financial institution, and Holland Casino, a government-mandated Dutch gaming operator—share their approaches to deploying generative AI workloads while maintaining strict compliance requirements. Sicoob built a scalable infrastructure using Amazon EKS with GPU instances, leveraging open-source tools like Karpenter, KEDA, vLLM, and Open WebUI to run multiple open-source LLMs (Llama, Mistral, DeepSeek, Granite) for code generation, robotic process automation, investment advisory, and document interaction use cases, achieving cost efficiency through spot instances and auto-scaling. Holland Casino took a different path, using Anthropic's Claude models via Amazon Bedrock and developing lightweight AI agents using the Strands framework, later deploying them through Bedrock Agent Core to provide management stakeholders with self-service access to cost, security, and operational insights. Both organizations emphasized the importance of security, governance, compliance frameworks (including ISO 42001 for AI), and responsible AI practices while demonstrating that regulatory requirements need not inhibit AI adoption when proper architectural patterns and AWS services are employed.

healthcare fraud_detection customer_support +50