ZenML

Production Agent Observability and Monitoring Platform

Raindrop 2026
View original source

Raindrop addresses the challenge of monitoring and debugging AI agents in production environments where traditional testing and evaluation approaches fall short. As agents become more complex with multiple tools, memory sources, and sub-agents, the combinatorial explosion of possible behaviors makes comprehensive testing impractical. Raindrop provides a monitoring platform that combines explicit signals like error rates and latency with implicit signals detected through trained classifiers and regex patterns to identify issues like user frustration, task failures, refusals, and jailbreaking. The platform enables teams to set up alerts, run experiments comparing different agent versions in production, and use an automated triage agent to investigate spikes in problematic behaviors, helping AI engineering teams ship improvements faster while maintaining reliability.

Industry

Tech

Technologies

Raindrop provides an observability and monitoring platform specifically designed for AI agents in production. The company was founded by Zuben (CEO) and includes Danny as a backend engineer working on SDKs. The platform addresses a fundamental shift in how AI systems need to be monitored as agents become increasingly complex and autonomous.

The Problem Space

The fundamental challenge Raindrop addresses is that agent failures are categorically different from traditional software failures. Agents are non-deterministic, operate with unbounded input and output spaces, and can use tools to affect other systems in arbitrary ways. This problem is intensifying for several reasons: agents are becoming more complex with exponentially growing tool sets, sessions can run for hours without user input, and agents are being deployed in high-stakes domains including healthcare, finance, and military applications where failures can be catastrophic.

Traditional evaluation paradigms, where test inputs are run against golden datasets to verify outputs, are insufficient for modern agents. As agents gain capabilities to call from diverse tool sets, access multiple memory sources, and recursively invoke sub-agents with their own tools and memory, the combinatorial explosion of possible behaviors makes comprehensive testing impossible. No test suite can cover all edge cases in this environment. This necessitates a paradigm shift from testing-centric to monitoring-centric approaches, similar to how traditional software engineering has always valued production monitoring over unit tests for catching long-tail issues.

Raindrop frames this as potentially “humanity’s last problem” - when humans can no longer effectively monitor agents and identify issues, agents will have surpassed human capabilities in fundamental ways. This positions production agent monitoring as one of the most critical challenges of our time.

Signal Architecture

Raindrop’s approach centers on establishing comprehensive signals for agent health. These signals fall into two categories: explicit and implicit.

Explicit signals deal with objective, verifiable metrics including tool error rates, overall error rates, latency, user regeneration frequency, and cost. These are straightforward to measure and can indicate problems when they spike or deviate from baseline patterns. For example, a spike in error rates typically signals something has gone wrong, while unusual flatness in certain metrics can also be meaningful.

Implicit signals are more nuanced and technically challenging. These include regex-based pattern matching, trained classifiers, and self-diagnostics. The platform emphasizes that the most effective implicit signals focus on detecting specific issues rather than general quality judgments. Instead of asking an LLM to rate a response on a scale of one to ten or judge overall quality, Raindrop advocates for binary classifiers targeting specific, actionable problems.

The platform provides out-of-the-box detection for several common failure modes across agent products: refusals (when the assistant says it cannot perform a task), task failures (when the agent cannot complete what was requested), user frustration, content moderation issues, NSFW content, jailbreaking attempts, and even positive signals indicating wins. These classifiers are implemented using trained models rather than running full LLMs on every output, which would effectively double AI spending and become untenable at scale. The trained models can detect issues like user frustration across multiple languages without requiring language-specific implementations.

Regex-based signals provide a cheap but effective detection mechanism. The platform highlighted how Claude Code’s leaked source code revealed a user prompt keywords file containing regex patterns searching for indicators like “WTF”, “this sucks”, and “horrible” - all signals of user frustration. This boolean flag would track frustration rates over time and after each product release, providing an inexpensive way to monitor product health.

Self-Diagnostics Implementation

A particularly interesting capability is self-diagnostics, where agents introspect their own behavior and report issues. This approach was inspired by OpenAI’s December research on training models to self-confess misalignment issues, including dishonesty, scheming, hallucinations, and unintended shortcuts. Modern large models trained on reasoning have become effective at self-introspection when prompted appropriately.

Self-diagnostics can catch a broad range of issues spanning both explicit and implicit categories. Agents become aware when tools fail repeatedly and will often “rant” about tool failures in their reasoning traces. They recognize when users are frustrated and begin responding diplomatically. They can identify capability gaps where users request features the agent lacks, essentially providing built-in feature request detection. They also exhibit self-correction behaviors, such as attempting to bypass sandbox restrictions to complete tasks, which can be beneficial for task completion but problematic from a security perspective.

Implementing self-diagnostics is remarkably straightforward - it requires only adding a single tool the agent can call and a line in the system prompt encouraging its use. During the workshop portion, the implementation was demonstrated with a basic coding agent that had four tools: read, write, bash, and edit. When the write tool was artificially broken to return permission errors, the agent would use bash commands with heredoc syntax to work around the failure and would report this behavior through the self-diagnostic tool.

The effectiveness of self-diagnostics depends significantly on framing and naming. Models are trained to produce polished outputs and are reluctant to self-incriminate. Naming a tool something like “unsafe bash tool use” will suppress reporting because the agent believes it got the job done appropriately. Instead, framing it as a generic “report tool” that sends feedback to the agent’s creators proves more effective. The tool description encourages the agent to send short reports to its creator, and the system prompt asks the agent to use the report tool before giving final answers to surface anything notable.

The workshop demonstrated that self-diagnostics can be set up with minimal effort - just a tool definition and system prompt modification - and doesn’t even require using Raindrop’s platform. The tool could simply send messages to Slack for basic notification functionality. However, Raindrop’s SDK includes built-in self-diagnostic support, automatically injecting the tool without requiring manual setup.

Production Monitoring and Experimentation

The platform provides day-by-day visualization of different signal rates, allowing teams to spot spikes in user frustration, task failures, laziness, or refusals. When issues are identified, teams can drill down to see specific examples with full context. Alerts can be configured on any signal, enabling proactive notification when problem rates exceed thresholds.

Beyond reactive monitoring, Raindrop enables experimentation workflows. Teams can ship changes - whether model updates, prompt modifications, agent harness alterations, or new tools - to a percentage of users while maintaining a control group. The platform then tracks how implicit and explicit signals change between groups. For example, an experiment might show that a new prompt version reduced user frustration from 37% to 9%, decreased complaints about aesthetics and deployment issues, but increased average tool usage. This provides production-validated evidence of whether changes improve or degrade the user experience.

While traditional evaluation datasets remain useful, nothing replaces observing actual production behavior. The statistical significance threshold is pragmatic - the platform becomes useful as soon as there are enough events (a few hundred) that manually reading all inputs and outputs becomes impractable. Even without rigorous statistical significance, seeing user frustration spike provides actionable signals, especially when patterns emerge showing all issues relate to a specific failing tool.

Teams can track different feature launches by sending metadata with events, such as new tool names, experiment flags, or version identifiers. Raindrop automatically creates experiments based on this metadata and can also pipe data to external platforms like StatSig for more sophisticated statistical analysis.

Advanced Capabilities

Raindrop includes a triage agent that automatically investigates signal spikes. Every day, it examines all configured signals including user frustration rates and regex patterns. When it detects unusual spikes, it uses a suite of investigative tools to examine traces and automatically surface issues that weren’t previously known. This autonomous investigation capability helps teams scale their monitoring beyond what human operators could manually review.

The platform offers a “deep search” feature using natural language queries to find specific patterns across the product, such as all instances where the agent made a particular type of error. This can automatically create new binary classifiers deployed cheaply at scale, expanding the signal set dynamically based on discovered issues.

Trajectory visualization provides a unique way to understand tool usage patterns. Teams can describe the type of trace they want to examine using natural language rather than configuring complex filters. The visualization shows which tools were called in what order, which ones errored, and the specific inputs and outputs for each tool. This topological view helps identify similar failure patterns across different sessions.

The platform supports cluster analysis to understand how users interact with agents. Teams can identify different user intents and use cases, then measure issue rates and user frustration for each cluster. This reveals which workflows are problematic versus which work reliably.

Integration and Deployment

Raindrop integrates with existing observability stacks through standard telemetry protocols. Many customers already have OpenTelemetry streams and simply add Raindrop as another target. The platform can ingest historical data and backfill signal classifications for past events when new signals are created, enabling immediate analysis of recent behavior even for newly defined signals.

For customers with complex experimental designs running multiple concurrent experiments, Raindrop provides a query API that exports signal-tagged data to BigQuery, Snowflake, or other analytics platforms where more sophisticated analysis can be performed. This supports teams that need to analyze compound effects across multiple experimental variables.

The SDK support spans multiple languages, with JavaScript/TypeScript having the most mature implementation including built-in self-diagnostics injection. Python support exists but was acknowledged as weaker at the time of the presentation, though actively being developed with a dedicated engineer working on SDK improvements across ten different language implementations released in the preceding month.

Production Usage Patterns

Real-world usage typically involves teams sending complete agent trajectories including transcripts, tool usage, and all context through OpenTelemetry or direct integration. Teams configure signals in Raindrop for issues they care about, which vary significantly by use case. A coding agent monitors different signals than a companion bot or legal application.

The feedback loop involves using signals to improve prompting, models, and agent architecture, then validating whether changes reduced specific issue rates in production. Beyond quality improvement, many teams use clustering to understand user behavior patterns and identify which intents have high versus low issue rates.

Daily alerting provides breakdowns of current issues, deltas from the previous day, and root cause analysis showing which tools or prompts correlate with problems. This creates a Sentry-like experience for semantic agent failures rather than just technical exceptions.

One customer scenario involved an unexpected database provider failure being automatically detected through a spike in user frustration. The triage agent investigated the pattern and identified that users working with a specific PostgreSQL provider were experiencing issues, providing rapid root cause identification without manual investigation.

Practical Considerations and Tradeoffs

The platform acknowledges several practical challenges. Running full LLMs on every output would double AI costs, which is why Raindrop trains specialized smaller models for classification tasks. However, for teams just starting out or operating at smaller scale, running LLMs on every output can be acceptable and easier to implement initially.

Models are generally trained to appear polished and are hesitant to admit faults. This affects both self-diagnostics implementation and classifier design. The reluctance to self-incriminate means tool naming and framing significantly impacts effectiveness - generic reporting tools work better than tools that explicitly call out negative behaviors.

Statistical significance in experiments depends on sample size and traffic volume. For high-traffic applications, differences become apparent within minutes. For lower-traffic scenarios, longer observation periods may be necessary. The platform takes a pragmatic approach where even directional insights from modest sample sizes can guide decision-making, with teams deciding their own confidence thresholds.

For teams experiencing rapid iteration with many feature flags and constant changes, the platform supports comparing data across different configurations by tracking metadata and enabling filtered analysis. However, the fundamental challenge of needing production deployment before gathering meaningful data remains - there’s no way to fully validate agent behavior without real user interactions.

The platform is actively hiring to expand capabilities and offers a free trial period to evaluate fit. Pricing and exact trial length details were not discussed in depth, though the team indicated willingness to extend trial periods for interested parties who contact them directly.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Production AI Agents for Enterprise HR, IT, and Finance Platform

Rippling 2025

Rippling, an enterprise platform providing HR, payroll, IT, and finance solutions, has evolved its AI strategy from simple content summarization to building complex production agents that assist administrators and employees across their entire platform. Led by Anker, their head of AI, the company has developed agents that handle payroll troubleshooting, sales briefing automation, interview transcript summarization, and talent performance calibration. They've transitioned from deterministic workflow-based approaches to more flexible deep agent paradigms, leveraging LangChain and LangSmith for development and tracing. The company maintains a dual focus: embedding AI capabilities within their product for customers running businesses on their platform, and deploying AI internally to increase productivity across all teams. Early results show promise in handling complex, context-dependent queries that traditional rule-based systems couldn't address.

customer_support healthcare document_processing +39

Building Custom Agents at Scale: Notion's Multi-Year Journey to Production-Ready Agentic Workflows

Notion 2026

Notion, a knowledge work platform serving enterprise customers, spent multiple years (2022-2026) iterating through four to five complete rebuilds of their agent infrastructure before shipping Custom Agents to production. The core problem was enabling users to automate complex workflows across their workspaces while maintaining enterprise-grade reliability, security, and cost efficiency. Their solution involved building a sophisticated agent harness with progressive tool disclosure, SQL-like database abstractions, markdown-based interfaces optimized for LLM consumption, and a comprehensive evaluation framework. The result was a production system handling over 100 tools, serving majority-agent traffic for search, and enabling workflows like automated bug triaging, email processing, and meeting notes capture that fundamentally changed how their company and customers operate.

chatbot question_answering summarization +52