This case study presents a methodology for understanding and improving LLM applications at scale when manual review of conversations becomes infeasible. The core problem addressed is that traditional logging misses critical issues in AI applications, and teams face data paralysis when dealing with millions of complex, multi-turn agent conversations across multiple languages. The solution involves using LLMs themselves to automatically summarize, cluster, and analyze user conversations at scale, following a framework inspired by Anthropic's CLEO (Claude Language Insights and Observations) system. The presenter demonstrates this through Kura, an open-source library that summarizes conversations, generates embeddings, performs hierarchical clustering, and creates classifiers for ongoing monitoring. The approach enabled identification of high-leverage fixes (like adding two-line prompt changes for upselling that yielded 20-30% revenue increases) and helped Anthropic launch their educational product by analyzing patterns in one million student conversations. Results show that this systematic approach allows teams to prioritize fixes based on volume and impact, track improvements quantitatively, and scale their analysis capabilities beyond manual review limitations.
This case study presents a comprehensive methodology for operating LLM applications at production scale, with a particular focus on observability, conversation analysis, and systematic improvement. The presenter works at Manus and discusses the challenges that emerge when operating LLM applications with millions of users, where individual conversation review becomes impossible. The presentation centers on how to use LLMs themselves as analytical tools to understand user behavior patterns, identify issues, and prioritize fixes systematically.
The core thesis is that traditional software logging (tools like Sentry, Elasticsearch) remains important for infrastructure concerns like authentication, pagination, and security, but fundamentally misses what matters for AI applications. The actual user experience—the quality of LLM responses, the effectiveness of agent actions, the relevance of retrieved information—requires a different approach to observability and improvement.
The presenter articulates a critical challenge in production LLM systems: the explosion of complexity that makes manual review intractable. With simple RAG applications, the pattern was straightforward—user sends message, system performs retrieval, system sends response. Modern agentic applications involve dramatically more complexity: a single user message might trigger 40 different tool calls before generating a response, and conversations might involve 10-20 such exchanges, creating extremely long traces that are difficult to parse manually.
This complexity is compounded by several factors. First, the sheer volume of users—moving from 10 or 100 users to millions means manual review is simply impossible. Second, multi-language support means traces may be in languages the development team doesn’t speak fluently. Third, agentic applications introduce non-deterministic behavior and state management challenges where the same request at different times may produce different results. Fourth, the various moving parts—backends, durable execution layers, sandboxes—create numerous potential failure points.
The result is what the presenter calls “data paralysis”—teams are overwhelmed by data but unable to extract actionable insights. They see angry user messages but lack systematic ways to prioritize which issues to address first, making all fixes seem equally important when they’re demonstrably not.
The presentation introduces a framework borrowed from Jason Liu (who teaches a course on systematically improving RAG) that categorizes user issues into two fundamental types: missing capabilities and missing inventory.
Missing capabilities refer to actions the system cannot perform regardless of how well it’s prompted. If users ask “Why can’t I send emails?” but the agent lacks Gmail integration, or “Why can’t I book meetings?” when there’s no Google Calendar integration, no amount of prompt engineering will solve the problem. These require new integrations, tools, or features.
Missing inventory refers to data gaps. If users ask “How many contracts are signed?” but the system doesn’t track contract status, or “When was this contract last modified?” without maintaining modification timestamps, retrieval systems cannot surface information that doesn’t exist. These require data pipeline changes, new indices, or metadata enrichment.
The presenter acknowledges these categories aren’t always cleanly separable—real issues may span both—but having this systematic framework helps teams diagnose root causes rather than assuming every problem requires a “grand overhaul” like migrating to new agent frameworks or upgrading to more expensive models.
A critical insight throughout the presentation is that teams often assume they need massive changes when targeted, small interventions would be more effective. The presenter provides several illustrative examples:
DoorDash’s merchant selection problem: DoorDash faced high search volumes but low conversion rates—every search without conversion is lost revenue. Teams might assume the solution is better recommendation algorithms or upgraded embeddings. However, the actual fix was improving merchant selection—identifying which merchants to have available in which markets at which times. The problem wasn’t the algorithm recommending from available options; it was ensuring the right options were available to recommend.
Uber’s early morning cancellations: High cancellation rates between 5-7am represented lost revenue and customer dissatisfaction. The solution wasn’t algorithmic—it was providing incentives for drivers to work early shifts, ensuring supply met demand.
Voice agent upselling: A customer service voice bot had issues with answering questions about reservations, parking, and holiday hours. The client initially wanted better RAG capabilities. However, analysis revealed a bigger problem: the agent never attempted to upsell customers with standard questions like “Would you like fries with that?” or “Would you like to upsize?” Adding just two lines to the prompt to encourage upselling generated an estimated 20-30% revenue increase—a massive impact from minimal engineering effort.
These examples illustrate the value of systematic analysis over assumptions. Without understanding the actual patterns in user conversations, teams risk spending months on low-impact changes while missing high-leverage opportunities.
The presenter recommends three complementary approaches to understanding production LLM systems:
Traditional error logging and monitoring: This remains the baseline—tools like Sentry for tracking errors, monitoring tool call failures, checking for consistently low cosine similarity in retrieval queries. This catches technical failures and infrastructure issues.
User feedback mechanisms: Simple UI elements allowing users to provide feedback create direct channels for identifying issues. When customers say “this isn’t working, help me please,” this is a shortcut to discovering problems. Having a dedicated channel where these issues are aggregated creates a queue of potential improvements.
LLM-powered clustering and pattern analysis: This is the focus of the presentation—using language models themselves to analyze conversation patterns at scale, identify clusters of similar issues, and prioritize fixes based on frequency and impact.
The presentation uses Anthropic’s CLEO (Claude Language Insights and Observations) system as a case study in production-scale conversation analysis. When Anthropic wanted to launch an educational product, they needed to understand how students were actually using Claude.
They collected one million user conversations from accounts registered with .edu email domains over 18 days. The analysis revealed distinct usage patterns across disciplines:
More importantly, the analysis revealed four distinct usage styles:
Anthropic then mapped these patterns against Bloom’s taxonomy of learning, discovering significant usage around higher-order cognitive functions like analyzing and evaluating, not just lower-order creation and problem-solving.
These insights directly informed product development. Anthropic launched “Socratic questioning mode” for Claude in education, where instead of directly answering “How do I write hello world in Python?”, Claude guides students through the learning process: “Let’s start by setting up your IDE. Do you understand what the print statement does?” This product decision was shaped by understanding actual user behavior patterns at scale.
The CLEO system (and its open-source analog Kura, demonstrated in the presentation) follows a multi-stage pipeline:
Stage 1: Conversation summarization: Each conversation is processed by an LLM to generate a summary. This is crucial because raw conversations may be extremely long, multi-turn, and include extensive tool calls. The summarization step extracts the essential elements—what the user was trying to accomplish, what challenges they faced, what the system did or didn’t do.
Stage 2: Facet extraction: Along with summarization, the system extracts metadata facets including language used, number of turns, and other LLM-generated metadata. This can be enriched with traditional metrics—customer satisfaction ratings, actions taken (document downloaded, shared), session duration, conversion events, etc.
Stage 3: Initial clustering: Conversations are embedded (converted to vector representations) and clustered using techniques like HDBSCAN. This creates initial, fine-grained clusters. For example, “how to tie shoes” and “public tie bows in my daughter’s hair” might cluster together as “tying various knots.”
Stage 4: Meta-clustering: The system iteratively merges clusters in a bottom-up approach. “Tying various knots” might merge with other clusters to become “daily life skills.” Very sparse clusters (like “information about rare genetic conditions”) may be discarded as too infrequent to inform product decisions.
Stage 5: Dimensionality reduction and visualization: High-dimensional embeddings (typically 1536 dimensions for OpenAI embeddings) are reduced to 2D using techniques like UMAP, enabling visual exploration of the conversation space.
Stage 6: Classifier development: Once stable clusters are identified through multiple runs, teams develop explicit classifiers—typically prompt-based LLM judges—that can categorize new conversations into these discovered categories in real-time or batch processes.
An important technical detail the presenter emphasizes is that topic modeling and clustering are fundamentally non-deterministic. Running the same pipeline multiple times on the same data produces different clusters because of randomness in the dimensionality reduction (going from 1536 dimensions to 2D for visualization) and the clustering algorithms themselves.
The approach to handling this instability is to run clustering multiple times and look for consistent patterns—clusters that appear across multiple runs are more likely to represent real, stable user behavior patterns rather than artifacts of the algorithm. Once these stable clusters are identified, the team develops explicit classifiers (LLM-as-judge prompts) that provide consistent, repeatable categorization going forward.
The presenter demonstrates Kura, an open-source library implementing these concepts. The workflow involves:
Data preparation: Starting with conversation data (in the demo, around 500 user queries from Weights & Biases documentation, though production use cases would involve tens of thousands to millions of conversations).
Summarization pipeline: Using an LLM to generate concise summaries of each conversation. The presenter emphasizes that generic summarization often isn’t sufficient—teams need to iterate on the summarization prompt to extract information relevant to their specific use case. For example, a generic summary might say “Bayesian optimization is a hyperparameter tuning technique that uses surrogate functions,” but a feature-focused summary would identify “User needs help with experiment tracking, specifically logging hyperparameters and metrics using the weights and biases logging function.”
Clustering execution: Running the clustering algorithm to group similar conversations. The demo shows clusters around topics like “analyzing weights and biases sweep results,” “optimizing hyperparameters,” and “API key management.”
Visualization: Generating an interactive web interface where teams can explore the 2D projection of conversation space, click on clusters to see representative examples, and drill down into individual conversations.
Classifier development: Once clusters are identified, developing prompt-based classifiers that can categorize new conversations. The presenter walks through an example where an initial classifier for Weights & Biases queries achieved only 66% accuracy, but through systematic prompt engineering—adding clear system prompts and few-shot examples—accuracy increased to 89%.
The classifier development process demonstrates practical LLMOps principles for using language models as judges:
Initial baseline: A simple classifier with categories (artifacts, integrations, visualizations, other) achieves 66% accuracy on a labeled test set—clearly insufficient for production use.
System prompt engineering: Adding a clear system prompt (“You are provided a query and corpus. Look carefully and understand what the query and document are about”) improves accuracy to 81%—a 43.1% improvement over baseline.
Few-shot examples: Adding positive and negative examples for each category (e.g., “How do I run hyperparameter sweeps?” for visualizations, “How do I use weights and biases with langchain?” for integrations) brings accuracy to 89%—a 64.4% improvement over baseline.
The presenter emphasizes the importance of maintaining clean train/validation/test splits even with LLM-based classifiers. A critical workflow is:
Once classifiers are developed and validated, they can be integrated into production pipelines and business intelligence tools. The presenter describes connecting these classifiers to tools like Metabase or other BI platforms to create dashboards tracking:
This transforms the one-time clustering analysis into an ongoing monitoring system that provides continuous insight into user behavior and product performance.
The presentation emphasizes several practical considerations for deploying these systems in production:
Data privacy: Always respect user privacy when analyzing conversations. The presenter notes that while they discuss reading traces for analysis, this should always be done with appropriate user consent and privacy safeguards.
Scale considerations: The demo works with 500 conversations, but production systems should expect to analyze tens of thousands to millions. The presenter notes that Anthropic’s CLEO analyzed one million conversations—a scale where patterns become much more reliable and rare edge cases can still appear frequently enough to matter.
Cost and efficiency: The presenter emphasizes that LLM-based analysis is “cheap and efficient”—the cost of running summarization and clustering on even large conversation datasets is manageable compared to the value of the insights gained. This is particularly true when compared to the alternative of hiring large teams of human annotators or analysts.
Your own time as the limiting factor: A recurring theme is that developer and analyst time is the true constraint. You can’t scale yourself beyond 10-14 hours of work per day, but LLM rate limits are easily increased. Therefore, using LLMs to augment human analysis is about scaling the most precious resource—human attention and decision-making capability.
While the demo focuses on single-label classification, the presenter notes that production systems often need multi-label classifiers—a single conversation might involve both “API authentication issues” and “integration with external tools.” The framework supports this through:
A key outcome of this systematic approach is the ability to quantify problems and prioritize fixes based on data rather than intuition:
Volume quantification: “42% of users can’t trigger the web development tool” vs “1% complain about lack of Outlook integration”—the choice is clear.
Impact measurement: “60% of low satisfaction queries are about our new responses documentation” identifies both the problem and its severity.
Before/after metrics: “With our new changes we’re seeing a 40% increase in average satisfaction ratings and increased user retention” provides clear evidence of impact.
This data-driven approach transforms product development from guessing which features to build to systematically addressing the highest-impact issues.
While the presentation is enthusiastic about this approach, several important caveats and limitations should be noted:
Clustering instability: The presenter acknowledges that topic modeling is “fundamentally a bit random” and requires multiple runs to identify stable patterns. This means teams can’t rely on a single clustering run for critical decisions.
Classifier accuracy limitations: Even after optimization, the demo classifier achieved 89% accuracy—good but not perfect. Teams must accept some level of misclassification and design systems that are robust to classification errors.
Prompt engineering overhead: Achieving good summarization and classification results requires significant iteration on prompts. The presenter shows how generic summaries aren’t useful and task-specific prompts are essential, which represents non-trivial engineering work.
Labeled data requirements: Despite using LLMs to reduce annotation burden, the system still requires human-labeled validation and test sets to evaluate classifier performance. This creates a chicken-and-egg problem when first deploying the system.
Scale requirements: The methodology works best with large conversation volumes where patterns become statistically significant. Smaller applications with fewer users may not benefit as much from automated clustering versus manual review.
Cold start problem: The presentation doesn’t deeply address how to get started when you have relatively little data—the examples involve hundreds of thousands or millions of conversations.
This case study demonstrates several core LLMOps principles in action:
Observability as foundational: Just as DevOps requires comprehensive logging and monitoring, LLMOps requires visibility into model behavior, but adapted to the unique characteristics of language model applications.
Evaluation as continuous: Rather than one-time evaluation before deployment, the system enables continuous evaluation of production conversations, identifying degradation or new issues as they emerge.
Human-in-the-loop at scale: The approach doesn’t eliminate human judgment but scales it—instead of reading every conversation, humans review cluster summaries and validate classifier decisions on samples.
Prompt engineering as iterative: Both the summarization and classification stages require multiple iterations to achieve good performance, exemplifying prompt engineering as an empirical discipline.
Model-assisted workflows: Using LLMs to analyze LLM applications creates a meta-layer where the technology helps improve itself—a hallmark of mature MLOps and LLMOps practices.
The presenter emphasizes that Kura is open source and encourages teams to either use it directly or “take the source code, pass it to something like Claude or Code Copilot and reimplement it on your own infrastructure.” The documentation includes Colab notebooks that can be run immediately, lowering the barrier to experimentation.
This open approach addresses a common LLMOps challenge: teams need solutions tailored to their specific data sources, infrastructure, and business logic. By providing both a working implementation and clear documentation of the approach, the presenter enables teams to adapt the methodology to their needs rather than forcing a one-size-fits-all solution.
The presentation concludes with several high-level takeaways for production LLM systems:
This represents a mature approach to LLMOps that goes beyond initial deployment concerns (model selection, API integration, basic monitoring) to address the challenges of operating LLM applications at scale with millions of users and complex, multi-turn interactions.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.
This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.