## Overview
Arize AI developed "Alyx," an AI agent integrated into their observability and evaluation platform designed to help users debug, optimize, and improve their machine learning and generative AI applications. The case study provides a detailed look at how a company specializing in LLMOps tooling approached building their own production AI agent, including the evolution from early prototypes to a sophisticated agentic system, all while using their own platform to build and monitor the product itself.
The Arize AI team, led by Sally (Director of Product with a data science background) and Jack (Staff Engineer), began the Alyx project in November 2023 with a planned launch at their conference in July 2024. The genesis of the project came from observing that customers who spent time with their solutions architects and customer success engineers could effectively use the platform's advanced features to debug issues in minutes, but new users struggled without this guidance. The team aimed to codify the expertise of their solutions architects into an AI agent that could provide self-service support while also educating users on best practices for using the platform.
## Problem Context and Initial Scope
Arize AI's platform focuses on three core capabilities: tracing (capturing all steps an application takes from input to output), observability (understanding what's happening in those traces), and evaluations (metrics to assess performance beyond traditional measures like latency or error rates). The platform initially focused on traditional ML use cases like classification, ranking, and regression models, where users needed to identify whether they were experiencing performance degradation, data drift, or data quality issues.
The team recognized that their solutions architects had developed repeatable playbooks for debugging different types of issues. For traditional ML, there were clear categories: performance problems, drift problems, or data quality problems. When a customer experienced an issue, the solutions architect would systematically examine multiple charts, correlate feature behavior with performance drops, and dive deep into specific time ranges to identify root causes. This systematic approach was perfect for codification into an agent's workflow.
The transition to supporting LLM applications represented a natural evolution and, as the team discovered, a much better fit for their agent approach. Traditional ML involves heavy statistical analysis and numerical data that early GPT-3.5 struggled with considerably. In contrast, LLM applications generate primarily text-based data, and modern frameworks often dump entire Python objects as text into logs. For users trying to manually parse through these verbose traces, the task is nearly impossible without automated assistance.
## Technical Architecture Evolution
The initial architecture of Alyx was highly structured and "on rails," using OpenAI tool calling extensively. The system functioned essentially as a decision tree, where at each level the LLM would choose from a limited set of available tools. For example, at the top level, a user might ask for help debugging their model, and Alyx would select from approximately four high-level skills: debugging model performance, answering documentation questions, setting up monitoring dashboards, or other core functions. Once a skill was selected, it would expose another set of sub-skills specific to that domain.
This constrained approach was necessary given the limitations of GPT-3.5, which the team characterized as producing "50% hallucinations, 50% something useful." By putting the agent on rails, they could ensure more reliable and predictable behavior. The team essentially took the battle-tested playbooks from their solutions architects—documented in internal channels and customer win reports—sorted these by similar issue types, and converted them into structured workflows the LLM could follow.
A critical architectural decision was handling mathematics and statistical analysis. Early experiments revealed that GPT-3.5 was "terrible at math, absolutely horrible." Rather than trying to force the LLM to perform calculations, the team adopted a pragmatic approach: perform all mathematical and statistical computations in traditional code, then feed only the results to the LLM for interpretation and decision-making. This pattern of recognizing what LLMs are and aren't good at, then engineering around those limitations, proved essential to the system's success.
## Development Process and Tooling
The development process was notably scrappy and iterative. The team started with Jupyter notebooks to prototype individual skills and test different approaches to data formatting. A key early challenge was determining the optimal way to pass data to the LLM—whether as JSON, comma-separated values, or other formats. They experimented extensively to find what formats allowed the LLM to most reliably understand and work with the data.
The team built a local web application for testing that team members could run on their laptops, though this proved cumbersome. They eventually developed a testing framework that allowed them to run local scripts and notebooks without needing to spin up the full UI, making iteration much faster. This lightweight testing approach enabled rapid experimentation with new skills and prompts.
A crucial principle was using real customer data rather than synthetic or demo data. As Jack noted, "when you use fake data, demo data, manufactured data, everything works amazingly," but real data is messy, incomplete, or enormous, often blowing through context limits or causing the model to focus on the wrong details. By connecting to real data from the start and having technically proficient team members across functions (including the CEO) able to run the prototype, they could probe edge cases and smooth out issues before production release.
## Skill Selection and Prioritization
The team identified initial skills by focusing on the most common and high-value customer pain points. For traditional ML applications, they created skills for the three main problem categories: performance degradation, data drift, and data quality issues. For generative AI applications, they prioritized skills based on what they knew was most difficult for customers:
- **Prompt optimization**: At the time, prompts were where most AI engineers spent their time, and customers needed help improving them systematically rather than through trial and error
- **Evaluation template generation**: Converting abstract criteria like "I want fewer hallucinations" into concrete evaluation prompts was a consistent stumbling block
- **AI search**: Allowing semantic search through traces to find issues like "frustrated users" that couldn't be captured through simple query languages
The team emphasized starting small and iterating—they launched with six skills and have since expanded to 30. Sally noted that they maintain a backlog document with "probably a thousand skills" they could add, but they focus on those that provide immediate value and that LLMs demonstrably excel at performing.
## Evaluation Strategy and Framework
The evaluation strategy for Alyx evolved significantly throughout the project and represents one of the most valuable aspects of the case study for understanding production LLMOps. Initially, the team admits they weren't particularly sophisticated with evaluations. Early efforts involved simple QA correctness checks, often run in notebooks before the Arize platform had robust online evaluation capabilities. Sally and Jack would maintain a Google Doc with test examples and manually compare outputs, a tedious process they acknowledged was "really painful."
The evolution toward comprehensive evaluation came from recognizing that they needed to take a systems-level approach, evaluating at multiple layers of the agent's operation:
- **Tool selection level**: Did the agent choose the appropriate tool given the user's request? This evaluation happens at "router" decision points where the agent selects which path to take.
- **Task execution level**: Did each individual tool accomplish its intended task correctly? This includes checking whether tools were called with the right arguments and whether they produced meaningful results.
- **Trace level**: Were all the right tools called in the right order? This evaluates the overall workflow coherence.
- **Session level**: Is the user experience degrading over time? Are users becoming frustrated after repeated interactions? Is the agent maintaining context appropriately?
The framework for developing evaluations involved bringing together all stakeholders who care about different aspects of the product: product managers focused on user experience, engineers concerned with technical correctness, security teams watching for jailbreaking attempts, and others. Each stakeholder identified their concerns, which were then translated into specific evaluation templates.
Sally emphasized that while Arize offers out-of-the-box evaluation templates, these are meant only as starting points. The team doesn't use these templates themselves without customization. Instead, they create custom evaluations tailored to their specific tasks, data, and model behavior. This might mean having six different "task correctness" evaluations for six different types of tasks that Alyx performs.
The team's approach to identifying evaluation needs involves looking at traces in their platform and asking questions at each important decision point: What question do I need answered here? Did the right thing happen? They map out decision points rather than trying to create evaluations for every single span in a trace, which would be overwhelming.
An interesting aspect of their evaluation philosophy is the recognition that LLM evaluation differs fundamentally from traditional ML evaluation. In traditional ML, there's often a clear right or wrong answer. With LLMs, you might get a technically correct response that's nevertheless useless—like four paragraphs when users won't read that much text. Evaluating across multiple dimensions (correctness, conciseness, tone, helpfulness, etc.) while acknowledging inherent ambiguity proved essential.
## Dog Fooding and Internal Testing
The customer success and solutions architect teams played a crucial role in testing and refining Alyx. Sally held weekly dog fooding sessions where team members would test new skills and changes. Once UI components were developed, features were put behind feature flags so the CSE team could use them in their daily work. For example, when creating insight reports for customers, the team was instructed to use Alyx rather than creating reports from scratch, then evaluate how close Alyx got to what they would have produced manually.
This intensive internal usage served multiple purposes: it provided rapid feedback on functionality, it helped identify edge cases and failure modes, and it allowed the team to gather diverse perspectives from users with deep domain knowledge but varying technical expertise. The cross-functional nature of the core team—including former CSEs, solutions architects, engineers who had worked in product management, and the CEO—meant that team members could effectively evaluate Alyx from multiple user personas.
## Challenges and Learnings
Several key challenges emerged throughout the development process:
**Model limitations**: GPT-3.5's poor mathematical capabilities required significant architectural workarounds. The team had to learn early what the model was and wasn't good at, then engineer solutions accordingly.
**Data formatting**: Finding the optimal way to present data to the LLM required extensive experimentation. The wrong format could cause the model to miss important information or focus on irrelevant details.
**Context management**: With real production data, especially for LLM applications where frameworks dump verbose logs, context window limitations became a real concern. The team had to carefully manage what information to include and how to structure it.
**Evaluation development**: Creating meaningful evaluations proved more difficult than anticipated, particularly given the ambiguity inherent in evaluating LLM outputs. The tedious process of manually reviewing outputs and building up test cases was necessary but time-consuming.
**Security concerns**: Early in production, someone attempted to jailbreak Alyx, leading to the development of dedicated jailbreak detection evaluations. This highlighted the importance of security evaluations for production agents.
**Model upgrades**: When GPT-4 was released, the team experienced both opportunities and challenges as the new model behaved differently than GPT-3.5, requiring adjustments to their prompts and workflows.
## The Value of Intuition Building
A recurring theme in the conversation is the importance of "intuition building" for anyone working with LLM applications. Sally emphasized that there's no way around the need to develop intuition about how your specific application behaves, what failure modes it exhibits, and what quality looks like in your domain. Alyx is designed not just to provide answers but to help users build this intuition faster by showing them how to investigate issues and pointing them to relevant parts of the platform.
This educational component is central to Alyx's design. When Alyx identifies an issue, it doesn't just report the finding—it provides links to the specific charts and analyses in the Arize platform that support the conclusion. This approach serves dual purposes: it gives users transparency into how conclusions were reached, and it teaches them how to use the platform's advanced features themselves for future investigations.
## Future Direction: Toward Greater Autonomy
The team is currently evolving Alyx toward a more autonomous, planning-based architecture inspired by modern AI coding assistants like Cursor. Rather than the constrained decision tree approach, the new architecture will:
- **Generate explicit plans**: Before taking action, Alyx will outline a multi-step plan for accomplishing the user's goal
- **Use broader tool sets**: Rather than exposing only specific tools at each decision point, Alyx will have access to a larger toolkit and reason about which tools to use for each step
- **Employ reflection**: At each step, Alyx will evaluate whether the step was completed successfully and whether the overall plan still makes sense, allowing it to adapt and take tangents as needed
- **Handle more complex workflows**: The prompt optimization example illustrates this well—Alyx will check for existing evaluations, create evaluations if needed, run baseline assessments, modify prompts, re-run evaluations, and iterate, all while making autonomous decisions about when to revisit steps or try different strategies
This evolution takes advantage of more capable models (those from the last 9-12 months) that can handle more flexible, open-ended tasks while still maintaining reliability through structured planning and reflection.
## Meta-Level Insights on LLMOps
The case study provides several meta-level insights about building production LLM applications:
**Use your own tools**: Arize used their own platform to build and monitor Alyx, providing valuable dog fooding experience and driving product improvements based on their own needs as builders.
**Multidisciplinary teams are essential**: The success of Alyx relied on having team members who had worked across multiple roles (data science, product management, customer success, engineering) and could understand the problem from multiple perspectives.
**Start simple, iterate**: Despite the sophistication of the final product, the team started with basic prototypes in notebooks and a simple web app, gradually adding capabilities based on real user feedback.
**Real data is crucial**: Using actual customer data rather than synthetic examples exposed problems and edge cases that wouldn't have appeared in testing with clean, manufactured data.
**Evaluation is hard but necessary**: Even for a company building evaluation tools, developing good evaluations was challenging and required iteration, stakeholder involvement, and continuous refinement.
**Timeline realism**: The eight-month timeline from November 2023 to July 2024 felt aggressive at the time, but the team acknowledges that with today's frameworks and tools, they could move faster. This highlights how rapidly the LLMOps ecosystem has matured.
**Scrappy is okay**: The messy, manual process of early evaluation—Google Docs with test cases, manual output comparison, tedious data review—is normal and valuable. Even teams building sophisticated evaluation platforms go through this phase.
The Arize AI case study demonstrates that building production AI agents requires careful architectural decisions, comprehensive evaluation strategies, deep engagement with real user data, and a willingness to iterate based on what the technology can and cannot do reliably. The team's transparency about their challenges and evolution provides valuable lessons for others building similar systems.