Raindrop: Production Monitoring and Issue Discovery for AI Agents

Company

Raindrop

Title

Production Monitoring and Issue Discovery for AI Agents

Industry

Tech

Link

https://www.youtube.com/watch?v=nKmPZVxfzY0

Year

2026

Summary (short)

Raindrop's CTO Ben presents a comprehensive framework for building reliable AI agents in production, addressing the challenge that traditional offline evaluations cannot capture the full complexity of real-world user behavior. The core problem is that AI agents fail in subtle ways without concrete errors, making issues difficult to detect and fix. Raindrop's solution centers on a "discover, track, and fix" loop that combines explicit signals like thumbs up/down with implicit signals detected semantically in conversations, such as user frustration, task failures, and agent forgetfulness. By clustering these signals with user intents and tracking them over time, teams can identify the most impactful issues and systematically improve their agents. The approach emphasizes experimentation and production monitoring over purely offline testing, drawing parallels to how traditional software engineering shifted from extensive QA to tools like Sentry for error monitoring.

## Overview Raindrop is building what they describe as "Sentry for AI products" - a monitoring and observability platform specifically designed for conversational AI applications and agents. The presentation, delivered by Ben, the company's CTO who previously worked on avionics at SpaceX and human interface design at Apple, focuses on the fundamental challenge of operating LLMs in production: agents fail in subtle, unpredictable ways that traditional testing cannot anticipate, and these failures manifest without concrete errors that would trigger conventional monitoring systems. The core thesis is that while offline evaluations and unit tests have their place, production monitoring becomes increasingly critical as agents grow more capable and are used in more diverse ways. The presentation argues that real-world user behavior cannot be fully anticipated during development, and that teams must develop systematic approaches to discovering, tracking, and fixing issues as they emerge in live deployments. ## The Fundamental Challenge: Unpredictable User Behavior and Silent Failures Ben opens by establishing a key insight about AI product development: users employ these systems in unexpected ways that cannot be fully designed for upfront. He provides examples like users suddenly caring about time filters or needing to organize data by contact information - requirements that only surface after deployment. This unpredictability is compounded by the fact that AI agents typically fail without throwing exceptions or errors. He cites several real-world examples of silent failures that had serious consequences: Virgin Money's chatbot threatened to end conversations when users typed the word "virgin" in its interface. An airline chatbot hallucinated and promised a customer a refund, which the company initially refused to honor until a lawsuit established that brands are legally responsible for what their AI agents promise. The Character AI lawsuit established that companies are liable for the behavior of their AI systems. These cases demonstrate that as AI products become mainstream, legal and reputational stakes are rising even as the technology remains unreliable in subtle ways. The presentation highlights a paradox: current AI models are remarkably capable in some contexts while failing absurdly in others. Examples include Deep Research producing genuinely valuable 15-20 minute analyses, while GitHub Copilot might generate unit tests that simply check if a hardcoded hash matches a specific string rather than testing actual hashing functionality. Google Cloud's chatbot asks users troubleshooting Google credits whether they meant Azure credits or Roblox credits. This uneven performance makes it difficult to predict where failures will occur. ## The Limitations of Traditional Evaluation Approaches Ben critiques the standard approach to AI evaluations, which he describes as treating them like traditional unit tests with specific inputs and expected outputs. While acknowledging that this approach has value, he argues it becomes increasingly inadequate as agents grow more sophisticated. The problem is that longer-running agents accumulate context over extended conversations that can span days, weeks, or months. They maintain memory systems that compress and retain information across sessions. They have access to various tools that users can enable, disable, or customize. All of this creates an impossibly large state space that cannot be fully captured in predefined test cases. OpenAI's own postmortem of the "sycophant" incident reinforces this point. Despite the model performing well on evaluations, issues emerged in production because, as they stated, "real world use is what helps us spot problems and understand what matters the most to users. We can't predict every issue." This acknowledgment from one of the leading AI labs validates the need for production-focused approaches. Ben draws an analogy to traditional software engineering, which has gradually shifted from extensive upfront QA and testing toward production monitoring with tools like Sentry and Datadog. He argues AI products are following a similar trajectory, though they require fundamentally different monitoring approaches because they fail without concrete errors. ## The Raindrop Framework: Discover, Track, and Fix The core of Raindrop's approach is a three-phase loop: discovering issues through signals and clustering, tracking their frequency and impact over time, and fixing them through targeted interventions. This framework assumes that teams cannot anticipate all issues upfront and must instead build systems to continuously surface and prioritize problems as they emerge. ### Phase 1: Discovering Issues Through Signals The discovery phase centers on defining and detecting signals - indicators that something is going well or poorly in the application. Raindrop distinguishes between explicit signals and implicit signals, both of which provide valuable information about system performance. Explicit signals are direct user actions or feedback mechanisms. The most common examples are thumbs up/down buttons and regenerate buttons, but Ben reveals that ChatGPT tracks far more signals than most users realize. They record when users copy text from responses, including exactly which portions were copied, treating this as a feedback signal. They track whether users regenerate responses, which indicates dissatisfaction with the initial output. They log upgrade actions, which can indicate that users found enough value to convert. The presentation encourages AI product builders to instrument similar signals throughout their applications and, crucially, to analyze patterns in these signals. For example, examining all messages that users regenerated multiple times can reveal systematic problems that might otherwise go unnoticed. Implicit signals are far more novel and represent the unique challenges of monitoring AI systems. These are semantic patterns detected in either user inputs or agent outputs that indicate problems without any explicit user feedback. Raindrop has developed a taxonomy of common implicit signals that apply across many conversational AI products. On the agent output side, they detect refusals when the agent says it's not allowed to do something, task failures when it indicates it cannot complete a request, and laziness when it suggests shortcuts or incomplete solutions. They also identify memory failures when the agent forgets information the user previously provided. On the user input side, they detect frustration through language patterns indicating anger or dissatisfaction, and they identify wins when users express gratitude or satisfaction. The key insight is that these implicit signals can be detected reliably enough to be useful without requiring perfect accuracy. Unlike LLM-as-a-judge approaches that try to score outputs on subjective dimensions like humor or writing quality, implicit signal detection focuses on binary classification of whether a specific, well-defined issue is present. This makes the detection more reliable and practical to deploy at scale. Once initial signals are defined, the next step is clustering to find patterns. Ben emphasizes that simply having a high-level category like "user frustration" is not actionable. Teams need to understand the subclusters - the specific types of problems causing frustration. He walks through an example where clustering user frustration might reveal subclusters like the agent not handling math well, file uploads getting stuck, the agent using incorrect dates from years past, claiming it cannot access contacts when that feature exists, tone issues, and the agent forgetting previously provided information. Some of these subclusters might be known issues, some might not matter for the product, and others might be completely new discoveries that warrant immediate attention. The discovery process involves multiple techniques. Manual examination of flagged data remains valuable and is often underutilized - simply reading through conversations where issues occurred frequently reveals patterns. Semantic clustering using embeddings can automatically group similar issues together, making patterns visible at scale. Text search for keywords like "sorry," "I hate that," or profanity can surface problems quickly. The goal is to transform the massive volume of production conversations into a manageable set of distinct, describable issues. Intent discovery follows a similar clustering process but focuses on what users are trying to accomplish rather than what went wrong. For a coding agent like Cursor, intents might include presenting a problem, adding a new feature, creating an entire app from scratch, or debugging an issue. These intents can exist at multiple levels - individual turn-by-turn intents within a conversation and higher-level conversation intents that describe the overall goal. Understanding intents becomes crucial when combined with signals because an issue that occurs during one type of intent might be far more serious than the same issue during another intent. ### Phase 2: Tracking Issues Over Time Discovery identifies issues, but tracking determines which ones actually matter. Ben draws a direct parallel to Sentry's approach to error monitoring. When Sentry shows an error that affected one user one time, developers typically ignore it. When the same error affects eighty percent of users, it becomes critical. The frequency and user impact metrics transform a list of possible problems into a prioritized roadmap. For tracking to be effective, issue definitions need to be precise and stable. This is where accuracy becomes more important than it was during discovery. During discovery, a clustering might include some false positives, but that's acceptable because humans are reviewing the clusters. During tracking, the system needs to reliably count how many users encountered a specific issue each day, which requires higher precision. Raindrop addresses this through their Deep Search feature, which combines semantic search with LLM-based reranking and binary classification. Users describe an issue, and the system finds candidate matches through semantic search, then uses an LLM to classify each candidate as a true match or not. Crucially, Raindrop builds refinement into the tracking workflow. As users mark examples as correctly or incorrectly classified, the system generates clarifying questions and refines the issue definition. For example, when tracking memory failures, the system might discover through labeled examples that some apparent memory failures are actually hallucinations - the agent claimed the user said something they never said, which is a different type of problem requiring different solutions. By capturing these distinctions in the issue definition, tracking becomes more precise over time. Combining signals with intents creates much more actionable issue categories. Instead of just tracking "user frustration" as a broad category, teams can track user frustration specifically during math homework help, or during pricing inquiries, or when trying to generate an application. Each combination of signal and intent becomes a distinct issue that can be understood, prioritized, and addressed independently. Metadata and dimensions become critical for debugging. Just as Sentry shows which browsers, devices, and environments are affected by an error, Raindrop tracks which models are most commonly involved in an issue, which tools were invoked, whether those tools had errors, how many times tools were called on average, whether conversations used voice versus text input, and any other contextual information available. This dimensional data provides the breadcrumbs needed to understand root causes. The tracking phase also involves continuous refinement of issue definitions. Ben emphasizes that the process of looking at production data, clustering, and creating new issues never truly ends. Teams need workflows that make it easy to iterate on issue definitions as their understanding evolves. This continuous refinement is essential because the subtlety and context-dependence of AI issues means that initial definitions are rarely perfect. ### Phase 3: Fixing Issues While prompt engineering remains an important tool for fixing issues, Ben focuses on architectural approaches that go beyond prompt changes. The most significant is what he calls the Trellis framework, developed in collaboration with Leaves, a studio with over six million users across multiple AI applications. The Trellis framework addresses a critical challenge in AI product development: how to grow and improve an agent without breaking things that already work. Teams frequently experience the problem where they get prompts working well enough to launch, gain initial users, and then become afraid to make changes because any prompt modification might degrade performance on existing use cases. The framework proposes systematically decomposing application functionality into independent modules that can be improved separately with minimal cross-talk. The key insight is that tools are actually sub-agents. When ChatGPT generates an image, it makes a tool call with a description, and an entirely separate model generates the actual image. When it performs web search, a specialized web search agent interprets the request and executes it. This pattern of delegating specific capabilities to specialized sub-agents can be applied systematically throughout an application. Ben provides an example from Raindrop's own development: their system allows users to explore data and ask questions that get translated into queries in Raindrop's non-standard query language. Rather than forcing the main agent to understand all the details of this query language - which slowed it down, caused confusion, and created maintenance problems - they built a separate specialized model (using Five Nano for efficiency) that focuses solely on translating natural language descriptions into valid queries. The main agent describes what the query should accomplish in natural language, and the sub-agent handles the translation. This modular approach means changes to the query language only require updating the specialized sub-agent, not the entire main prompt. The benefits of this sub-agent architecture are substantial. Teams can make targeted changes to specific functionality without affecting other parts of the system. They can experiment with different models for different sub-agents based on the requirements of each task. They can add new tools and capabilities without modifying the core agent. The approach maximizes modularity in a domain where true independence is difficult to achieve. The Trellis framework specifically describes an iterative process: launch a working general agent, observe production usage to identify distinct intents and issues, discretize functionality by converting high-frequency intents into specialized sub-agents or workflows, and repeat this process recursively. This allows teams to start with a general-purpose agent and gradually introduce structure based on observed real-world usage patterns. Tool monitoring becomes essential in this architecture. Teams need visibility into which tools are being called, whether those calls are succeeding or failing, and how many times tools are invoked per conversation. If a tool fails five times in a row within a single conversation, that's a critical issue that needs immediate attention. Ben emphasizes that even tool naming matters - Anthropic and OpenAI perform reinforcement learning on specific tool names, meaning the choice of name can significantly impact when and how tools are invoked. The ultimate goal is to make AI behavior "engineered, repeatable, testable, and attributable" rather than accidental or magical. If improvements cannot be attributed to specific architectural decisions or prompt changes, they are likely to disappear with the next model update. Building systematic approaches to understanding and controlling agent behavior is the only path to reliable production systems. ## Technical Implementation Details Throughout the presentation, Ben shares specific technical approaches Raindrop uses and recommends. For detecting implicit signals, they use LLMs not as general-purpose judges scoring subjective qualities, but as binary classifiers answering specific questions about whether particular issues are present. This is more reliable than trying to score outputs on continuous dimensions. They emphasize that accuracy requirements differ between discovery and tracking - discovery tolerates false positives because humans review clusters, while tracking needs precision for reliable metrics. For clustering and search, Raindrop uses embeddings-based semantic search combined with LLM reranking. Their Deep Search feature starts with semantic search to find candidates, then uses an LLM to classify each candidate as a match or not. To make this economically viable at scale, they train lightweight models under the hood so they can process millions of events per day without prohibitive costs. This training happens automatically as users label examples as correct or incorrect matches. The presentation mentions several model capabilities that aid production deployments. Context-free grammars in GPT-4 allow constraining outputs to specific language grammars or DSLs, though Ben notes they can be slow in practice. Structured outputs allow defining schemas that outputs must conform to, which Raindrop uses extensively. Log probabilities, though not widely adopted, can be valuable for debugging specific prompts. Citation APIs from Cohere and Anthropic help with factual accuracy. These capabilities exist but are often buried in documentation and underutilized. Ben emphasizes that the model's name matters for tool calling because labs perform RL on specific tool names. This means the choice of what to name a tool or function is not merely a documentation concern but actually affects model behavior. ## Practical Guidance and Philosophy Throughout the presentation, Ben offers pragmatic advice about building AI products. He repeatedly emphasizes applying traditional software engineering wisdom rather than assuming everything must be reinvented for AI. The parallels between unit tests and evals, between production monitoring and issue tracking, and between traditional debugging and agent debugging all hold value. Teams should use both testing and monitoring, just as they would with conventional software, adjusting the balance based on product criticality and maturity. On the question of whether to wait for better models, Ben acknowledges models are improving but argues they already fail in silly ways and communication is inherently hard. As agents become more capable, they actually encounter more undefined behavior because the space of possible actions grows. Waiting for perfect models is not a viable strategy. He encourages teams to look at their data more than they probably think necessary. Manual examination of conversations, especially those flagged by signals, reveals patterns that automated approaches might miss. Text search for keywords remains underutilized despite being simple and effective. Teams should instrument explicit signals throughout their applications and actually analyze the patterns, not just collect the data. The presentation stresses that issue definitions need ongoing refinement. Initial definitions are rarely perfect, and as teams gain experience with production usage, they develop more nuanced understanding of what specific issues actually mean. Building workflows that make it easy to iterate on definitions is as important as the initial detection. Finally, Ben emphasizes experimentation as the core principle. If there's one word to remember from the talk, it's "experimentation." The ability to test hypotheses, measure impact, and iterate based on production signals is what enables teams to build reliable agents. This requires infrastructure for monitoring, tooling for analysis, and workflows for continuous improvement. ## Business Context and Ecosystem Raindrop positions itself as complementary to existing tools in the LLMOps ecosystem. Most customers use Raindrop alongside platforms like LangSmith or BrainTrust, which focus more on tracing, latency, token counts, and costs. Raindrop differentiates by focusing specifically on whether the app is performing well functionally - whether users are getting good responses - rather than operational metrics. They focus specifically on conversational and multi-turn applications rather than trying to be a general-purpose platform for all prompt usage. The company has worked with customers like Leaves, a studio with over six million users, and has developed their framework and features based on real production challenges faced by teams operating AI agents at scale. The Deep Search feature, which is central to their value proposition, emerged from recognizing that teams need to test hypotheses against production data. For example, if token costs are rising, a team might hypothesize this is due to giving too many discounts. Deep Search allows them to query production conversations for examples of discount discussions, label a set of examples to refine the definition, and then track the prevalence of this pattern over time. ## Critical Assessment The presentation offers a thoughtful and grounded perspective on production AI challenges. The focus on production monitoring over purely offline evaluation is well-justified given the examples provided, and the discover-track-fix framework provides a concrete methodology rather than vague aspirations. The distinction between explicit and implicit signals is valuable, and the emphasis on combining signals with intents to create actionable issue categories shows sophisticated thinking about the problem. However, the presentation is naturally from Raindrop's perspective and positions their tooling as essential. Teams should evaluate whether the problems described are significant in their specific context. For applications with limited user bases or where issues have clear business metrics, the need for specialized AI monitoring tooling may be less urgent. The comparison to Sentry is apt but also somewhat aspirational - Sentry became essential because exceptions and errors are unambiguous and universal, while AI issues remain more context-dependent and product-specific. The emphasis on LLM-based classification and semantic search for issue detection is reasonable, but teams should be aware this introduces its own complexities and potential failure modes. The need to train models for each issue definition and continuously refine them based on labeled examples means the monitoring system itself requires ongoing attention and maintenance. This may be the right tradeoff, but it's not zero-cost. The architectural guidance around sub-agents and modularity is sound and likely applies well beyond Raindrop's platform. The Trellis framework in particular offers a practical pattern for managing complexity in agent systems. The emphasis on making AI behavior attributable and engineered rather than accidental is exactly the right mentality for production systems. Overall, the presentation makes a compelling case that production monitoring deserves more attention in the AI product development lifecycle and offers a concrete framework for implementing it, while remaining honest about the challenges and the fact that perfect solutions do not yet exist.

Start deploying reproducible AI workflows today