ZenML

LLM Testing Framework Using LLMs as Quality Assurance Agents

Various 2024
View original source

Alaska Airlines and Bitra developed QARL (Quality Assurance Response Liaison), an innovative testing framework that uses LLMs to evaluate other LLMs in production. The system conducts automated adversarial testing of customer-facing chatbots by simulating various user personas and conversation scenarios. This approach helps identify potential risks and unwanted behaviors before deployment, while providing scalable testing capabilities through containerized architecture on Google Cloud Platform.

Industry

Tech

Technologies

Overview

This case study presents a collaboration between Alaska Airlines and Bitra (a Google Cloud services partner specializing in data infrastructure, machine learning, and MLOps) to develop CARL—the QA Response Liaison—a generative AI-powered testing framework for evaluating large language model chatbots before production deployment. The presentation was delivered by François Nicholas (Engineering Manager at Alaska Airlines), Adam Thorstenson (Senior Data Scientist at Alaska Airlines), and Saf Abid (CTO at Bitra).

Alaska Airlines has been building generative AI-powered customer experiences, including a trip planning chatbot that helps guests find personalized travel recommendations through conversational interactions. The chatbot, powered by Gemini, allows users to search for flights using natural language queries like “find me the best flight from Seattle for an adventure” and receives tailored recommendations based on cost, weather, activities, and other preferences. However, before deploying such customer-facing AI systems, the team recognized significant risks that needed systematic mitigation.

The Problem: Adversarial User Behavior

The presenters highlighted that while LLMs themselves are “fairly predictable” in that you input language and receive contextually relevant language back, the real risks come from human users who inevitably try to manipulate chatbots. The presentation cited two notable industry incidents:

These examples illustrate the brand and legal risks when deploying conversational AI without rigorous testing. Users may attempt to get chatbots to say offensive things, agree to unauthorized deals, discuss topics outside the intended scope, or simply behave in ways that could embarrass the company on social media.

Adam Thorstenson described his initial approach to addressing these risks during development of version zero of Alaska’s chatbot: manual adversarial testing where he personally tried to manipulate the bot. Within two hours, he identified several concerning behaviors, but this approach doesn’t scale. The team needed to test hundreds of different conversational paths, varying levels of adversarial intent, and generate quick insights on trends.

The Solution: CARL (QA Response Liaison)

The insight was straightforward: if the challenge is having conversations based on loose prompts and synthesizing insights from those conversations, generative AI itself could be the solution. CARL is an LLM-powered tool that engages in conversations with target chatbots to test their behavior systematically.

Core Functionality

CARL provides several customization options:

After each conversation, CARL assesses response quality and rates the interaction on a scale of 1 to 10, documenting reasons for any score deductions. All conversation logs, ratings, and metadata are stored for further analysis.

Technical Architecture

The production architecture runs on Google Cloud Platform and incorporates several key components:

Model Garden and Persona Fine-Tuning: The team treats “model garden” as a broad concept encompassing any LLM—whether pre-built APIs like Gemini or self-hosted models. They discovered that different LLMs perform better at simulating different adversarial personas, so they fine-tune models specifically for persona simulation. These tuned models are stored in a model registry (described as “GitHub for models”).

Vertex AI Deployment: Models are deployed via Vertex AI endpoints, ready to handle conversation requests. The architecture supports swapping different LLMs on both ends—the CARL testing agent and the model under test.

Configuration Approaches: The team implemented multiple ways to configure test scenarios:

The JSON configuration is intentionally simple, containing a tag (for metrics grouping), a topic (to frame the conversation context), and persona descriptions that define simulated user behaviors.

Conversation Coordinator: A set of containers running on GCP Autopilot that acts as a liaison between CARL and the model under test. It proxies conversations, runs evaluation criteria, performs scoring, and stores results.

Dual Storage Strategy: Results are stored in both Cloud Storage (used as a metric store capturing conversation metadata and performance metrics like latency and rate limit hits) and AlloyDB. The AlloyDB integration was specifically chosen to leverage Vertex AI extensions for running LLM-based summarization and analysis on the text-based conversation outputs.

Config Updates and Human-in-the-Loop: A module enables semi-automated configuration updates based on test results. When scores hit critical thresholds, the system can flag issues for human review. A simple GUI allows non-technical stakeholders (product managers, QA staff) to read conversations, view outputs, and flag issues.

Orchestration: Cloud Composer handles scheduled smoke tests that run every few hours, while Cloud Build manages CI/CD integration for testing when code changes are pushed.

Future-Proofing and Flexibility

A key design principle was avoiding vendor lock-in given the rapidly evolving LLM landscape. The team emphasized that “if you build for one API you’re going to be screwed because in a week later there’s going to be another one.” CARL was architected to plug into any LLM, and they successfully validated this by testing with Dialogflow (with and without Gemini integration) and directly with Gemini using conversation context.

Demonstrated Results

The presentation included a live demo showing CARL testing three personas:

This third case exemplifies the nuanced testing CARL enables—the conversation achieved its functional goal but violated brand guidelines, which is exactly the type of issue that could go viral on social media.

Operational Insights

Scoring Methodology: A score of 10 indicates the conversation successfully achieved what the LLM was designed to do. Zeros flag issues requiring engineering review—either CARL correctly blocked a problematic interaction, or something needs fixing. The scoring criteria is customizable via the JSON configuration.

Metrics and Reporting: Tags enable aggregation for internal reporting on conversation performance trends. The team monitors for performance degradation over time by running CARL against a core set of “happy path” descriptions and watching for score changes.

Human-in-the-Loop Sampling: Not everything goes to human review. Items sent for human evaluation include: severely low scores, detected changes in smoke test/CI/CD scores compared to baselines, and a sample of 10-scoring conversations to validate the AI is correctly identifying successes.

Prompt Engineering Fixes: When edge cases are discovered, the team looks for common patterns and makes targeted changes to system prompts rather than overfitting to individual failure cases. One significant insight was that framing the agent’s purpose as a “mission” (rather than a “goal”) and instructing it not to deviate from its mission substantially reduced unwanted behaviors.

Team Size: The project started with a single data scientist (Adam), then added an ML engineer for robustness, and one additional engineer for production deployment—demonstrating that a small team with strong GCP expertise can build sophisticated LLMOps infrastructure.

Customer Impact: While specific metrics weren’t disclosed, the team reported significant NPS (Net Promoter Score) lift in beta testing comparing pre-CARL and post-CARL versions of the chatbot.

Future Roadmap

The team outlined several planned enhancements:

Critical Assessment

This case study represents a thoughtful approach to a genuine problem in LLM deployment. The core insight—using LLMs to test LLMs—is elegant and addresses the scalability limitations of manual adversarial testing. The architecture demonstrates mature MLOps thinking applied to generative AI, with proper attention to CI/CD integration, monitoring, and human oversight.

However, some limitations should be noted. The scoring mechanism, while configurable, relies on LLM-based evaluation which inherits its own failure modes. The team acknowledged they performed persona fine-tuning to make CARL better at simulating certain user types, which adds operational complexity. Additionally, while the demo showed clear success and failure cases, real-world adversarial attacks may be more sophisticated than the examples shown.

The cost question raised during Q&A (acknowledged but not answered publicly) is material—running an LLM to test an LLM involves significant inference costs, and the economics at scale weren’t disclosed. The team’s analogy comparing CARL to CI/CD infrastructure suggests they view this as worthwhile upfront investment that pays dividends in deployment velocity and risk reduction.

Overall, this represents a compelling example of applying MLOps principles to the emerging discipline of LLMOps, with particular emphasis on pre-production testing rather than just monitoring production systems—a proactive rather than reactive approach to managing LLM risks.

More Like This

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik 2025

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot +53

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Production-Ready AI Agent Systems: Multi-Agent Orchestration and LLMOps at Scale

Galileo / Crew AI 2025

This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.

customer_support code_generation document_processing +41