Company
Dropbox
Title
A Practical Blueprint for Evaluating Conversational AI at Scale
Industry
Tech
Year
2025
Summary (short)
Dropbox shares their comprehensive approach to building and evaluating Dropbox Dash, their conversational AI product. The company faced challenges with ad-hoc testing leading to unpredictable regressions where changes to any part of their LLM pipeline—intent classification, retrieval, ranking, prompt construction, or inference—could cause previously correct answers to fail. They developed a systematic evaluation-first methodology treating every experimental change like production code, requiring rigorous testing before merging. Their solution involved curating diverse datasets (both public and internal), defining actionable metrics using LLM-as-judge approaches that outperformed traditional metrics like BLEU and ROUGE, implementing the Braintrust evaluation platform, and automating evaluation throughout the development-to-production pipeline. This resulted in a robust system with layered gates catching regressions early, continuous live-traffic scoring for production monitoring, and a feedback loop for continuous improvement that significantly improved reliability and deployment safety.
## Overview Dropbox's case study on building Dropbox Dash provides a detailed blueprint for implementing rigorous evaluation processes for conversational AI systems at scale. The company emphasizes that while LLM applications present a deceptively simple interface with a single text box, they conceal complex multi-stage probabilistic pipelines involving intent classification, document retrieval, ranking, prompt construction, model inference, and safety filtering. The central insight driving their work is that any modification to any link in this chain can create unpredictable ripple effects throughout the system, turning previously accurate answers into hallucinations. This reality made evaluation not just important but essential—equally critical as model training itself. The journey began with relatively unstructured, ad-hoc testing approaches. However, as Dropbox experimented with their conversational AI product, they discovered that genuine progress came less from model selection and more from how they refined the surrounding processes: optimizing information retrieval, iterating on prompts, and balancing consistency with variety in generated responses. This realization led to a fundamental shift where they treated evaluation as a first-class concern rather than an afterthought. Their guiding principle became treating every change with the same rigor as shipping production code, requiring all updates to pass comprehensive testing before merging. ## Dataset Curation Strategy Dropbox's evaluation approach begins with carefully curated datasets combining both public and internal sources. For establishing initial baselines, they leveraged publicly available datasets including Google's Natural Questions (testing retrieval from large documents), Microsoft Machine Reading Comprehension or MS MARCO (emphasizing handling multiple document hits per query), and MuSiQue (challenging multi-hop question answering capabilities). Each dataset contributed different stress tests for the system, providing early signals about how parameter choices and system configurations would perform. However, the team recognized that public datasets alone proved insufficient for capturing the long tail of real-world user behavior. To address this gap, they collected production logs from Dropbox employees dogfooding Dash internally. From these logs, they constructed two distinct evaluation set types. Representative query datasets mirrored actual user behavior by anonymizing and ranking top internal queries, with annotations provided through proxy labels or internal annotators. Representative content datasets focused on materials users relied on most heavily—widely shared files, documentation, and connected data sources. From this content, they used LLMs to generate synthetic questions and answers spanning diverse scenarios including tables, images, tutorials, and factual lookups. This combination of public benchmarks and internal, real-world data created a comprehensive evaluation foundation that reflected genuine usage patterns and edge cases. The approach demonstrates a pragmatic understanding that while public datasets provide standardization and comparability, production-quality systems require evaluation on domain-specific, organization-specific data that captures the unique characteristics of actual user behavior. ## Metrics Evolution and LLM-as-Judge One of the most significant insights from Dropbox's experience concerns the limitations of traditional NLP metrics for evaluating production conversational AI systems. Initially, the team considered standard metrics like BLEU, ROUGE, METEOR, BERTScore, and embedding cosine similarity. While these metrics proved useful for quick sanity checks and catching egregious drift, they fundamentally failed to enforce deployment-ready correctness for real-world tasks like source-cited answer retrieval, internal wiki summarization, or tabular data parsing. The specific failures were instructive: BLEU handled exact word overlap but failed on paraphrasing, fluency, and factuality. ROUGE performed well on recall-heavy matching but missed source attribution issues and hallucinations. BERTScore captured semantic similarity but couldn't assess granular errors or citation gaps. Embedding similarity measured vector-space proximity but said nothing about faithfulness, formatting, or tone. Most critically, these metrics would frequently assign high scores to responses that lacked proper citations, contained hallucinated file names, or buried factual errors within otherwise fluent text. This led Dropbox to embrace LLM-as-judge approaches, where one LLM evaluates another's outputs. While this may seem recursive, it unlocks significant flexibility by enabling evaluation of factual correctness against ground truth or context, assessment of proper citation, enforcement of formatting and tone requirements, and scaling across dimensions that traditional metrics ignore. The key insight is that LLMs excel at scoring natural language when the evaluation problem is clearly framed. Dropbox approached LLM judges as software modules requiring design, calibration, testing, and versioning. Their evaluation template takes the query, model answer, source context (when available), and occasionally a hidden reference answer as inputs. The judge prompt guides the model through structured questions examining whether the answer addresses the query, whether factual claims are supported by context, and whether the answer maintains clarity, proper formatting, and consistent voice. Judges respond with both justifications and scores (scalar or categorical depending on the metric). Importantly, the team recognized that judge models and rubrics themselves require evaluation and iteration. They ran spot-checks every few weeks on sampled outputs with manual labels, creating calibration sets for tuning judge prompts, benchmarking human-model agreement rates, and tracking drift over time. When judge behavior diverged from gold standards, they updated prompts or underlying models. While LLM judges automated most coverage, human spot-audits remained essential, with engineers manually reviewing 5-10% of regression suites for each release. Discrepancies were logged, traced to prompt bugs or model hallucinations, and recurring issues triggered prompt rewrites or more granular scoring approaches. ## Structured Metric Enforcement To make evaluation actionable and enforceable, Dropbox defined three types of metrics with distinct roles in the development pipeline. Boolean gates check conditions like "Citations present?" or "Source present?" and impose hard failures preventing changes from moving forward. Scalar budgets establish thresholds like source F1 ≥ 0.85 or p95 latency ≤ 5 seconds, blocking deployment of changes that violate these constraints. Rubric scores assess dimensions like tone, formatting, and narrative quality, logging results in dashboards for monitoring over time rather than imposing hard blocks. Every model version change, retriever setting adjustment, or prompt modification was checked against these dimensions. If performance dropped below thresholds, changes were blocked. Critically, metrics were integrated into every development stage rather than being afterthoughts. Fast regression tests ran automatically on every pull request, full curated dataset suites ran in staging, and live traffic was continuously sampled and scored in production. Dashboards consolidated results, making it easy to visualize key metrics, pass/fail rates, and trends over time. ## Evaluation Platform Implementation As experimentation cycles accumulated, Dropbox recognized the need for more structured artifact and experiment management, leading them to adopt Braintrust as their evaluation platform. The platform provided four key capabilities that transformed their workflows. First, a central store offering a unified, versioned repository for datasets and experiment outputs. Second, an experiment API where each run was defined by its dataset, endpoint, parameters, and scorers, producing an immutable run ID (with lightweight wrappers simplifying run management). Third, dashboards with side-by-side comparisons highlighting regressions instantly and quantifying trade-offs across latency, quality, and cost. Fourth, trace-level debugging enabling single-click access to retrieval hits, prompt payloads, generated answers, and judge critiques. The shift from spreadsheets to a dedicated platform addressed several critical pain points. Spreadsheets worked for quick demos but broke down during real experimentation, scattering results across files, making reproduction difficult, and preventing reliable side-by-side comparisons. When two people ran the same test with slightly different prompts or model versions, tracking what changed and why became nearly impossible. The evaluation platform provided versioning, reproducibility, automatic regression surfacing, and shared visibility across teams—enabling collaboration without slowdowns. ## Automated Pipeline Integration Dropbox treated prompts, context selection settings, and model choices with the same rigor as application code, requiring identical automated checks. Every pull request triggered approximately 150 canonical queries, automatically judged with results returning in under ten minutes. After merge, the system reran the full suite alongside quick smoke checks for latency and cost, blocking merges if any red lines were crossed. The canonical queries, though small in number, were carefully selected to cover critical scenarios including multiple document connectors, "no-answer" cases, and non-English queries. Each test recorded the exact retriever version, prompt hash, and model choice to guarantee reproducibility. If scores dropped below thresholds—for instance, if too many answers lacked citations—builds stopped. This setup caught regressions at the pull-request level that previously slipped into staging. For large refactors or engine updates that might hide subtle regressions, Dropbox implemented on-demand synthetic sweeps for comprehensive end-to-end evaluation. These sweeps began with golden datasets and could be dispatched as Kubeflow DAGs (directed acyclic graphs in Kubeflow Pipelines), running hundreds of requests in parallel. Each run was logged under a unique run_id facilitating comparison against the last accepted baseline. They focused on RAG-specific metrics including binary answer correctness, completeness, source F1 (measuring precision-recall balance for retrieved sources), and source recall. Drift beyond predefined thresholds triggered automatic flags, and LLMOps tools enabled slicing traces by retrieval quality, prompt version, or model settings to pinpoint problematic stages before changes reached staging. ## Production Monitoring and Live-Traffic Scoring While offline evaluation proved critical, Dropbox recognized that real user queries represent the ultimate test. To catch silent degradations immediately, they continuously sampled live production traffic and scored it using identical metrics and logic as offline suites (with all work guided by Dropbox's AI principles). Each response, along with context and retrieval traces, was logged and routed through automated judgment measuring accuracy, completeness, citation fidelity, and latency in near real-time. Dashboards visible to both engineering and product teams tracked rolling quality and performance medians over one-hour, six-hour, and 24-hour intervals. When metrics drifted beyond set thresholds—such as sudden drops in source F1 or latency spikes—alerts triggered immediately, enabling response before end users experienced impact. Because scoring ran asynchronously in parallel with user requests, production traffic experienced no added latency. This real-time feedback loop enabled quick issue detection, closed gaps between code and user experiences, and maintained reliability as the system evolved. ## Layered Gate Architecture To control risk as changes moved through the pipeline, Dropbox implemented layered gates that gradually tightened requirements and brought evaluation environments closer to real-world usage. The merge gate ran curated regression tests on every change, passing only those meeting baseline quality and performance. The stage gate expanded coverage to larger, more diverse datasets with stricter thresholds, checking for rare edge cases. The production gate continuously sampled real traffic and scored it to catch issues emerging only at scale, with automated alerts and potential rollbacks when metrics dipped below thresholds. By progressively scaling dataset size and realism at each gate, they blocked regressions early while ensuring staging and production evaluations remained closely aligned with real-world behavior. This layered approach represents a sophisticated risk management strategy balancing the need for development velocity with production stability requirements. ## Continuous Improvement Loop Dropbox emphasizes that evaluation isn't a phase but a feedback loop, with systems learning from mistakes evolving faster than any roadmap allows. While gates and live-traffic scoring provide safeguards, building resilient AI systems requires evaluation to drive continuous learning. Every low-scoring output, flaky regression, or drifted metric represents not just a red flag but an opportunity for end-to-end system improvement. By mining low-rated traces from live traffic, they uncovered failure patterns synthetic datasets often missed: retrieval gaps on rare file formats, prompts truncated by context window limits, inconsistent tone in multilingual inputs, or hallucinations triggered by underspecified queries. These hard negatives flowed directly into the next dataset iteration, with some becoming labeled regression suite examples and others spawning new synthetic sweep variants. Over time, this created a virtuous cycle where the system was stress-tested on exactly the edge cases where it previously failed. For riskier experiments like new chunking policies, reranking models, or tool-calling approaches not ready for production gates, they built a structured A/B playground for controlled experiments against consistent baselines. Inputs included golden datasets, user cohorts, or synthetic clusters. Variants covered different retrieval methods, prompt styles, or model configurations. Outputs spanned trace comparisons, judge scores, and latency and cost budgets. This safe space let tweaks prove their value or fail fast without consuming production bandwidth. To accelerate debugging of multi-stage LLM pipeline failures, they developed playbooks guiding engineers to likely causes: checking retrieval logs if documents were never retrieved, reviewing prompt structure and truncation risk if context was included but ignored, or re-running against calibration sets and human labels if judges mis-scored answers. These playbooks became part of triage, ensuring regressions were traced systematically rather than debated. ## Cultural and Organizational Aspects Critically, Dropbox emphasizes the cultural dimension of their evaluation approach. Evaluation wasn't owned by a single team but embedded into everyday engineering practice. Every feature pull request linked to evaluation runs. Every on-call rotation included dashboards and alert thresholds. Every piece of negative feedback was triaged and reviewed. Every engineer owned the impact of their changes on quality, not just correctness. While speed mattered for shipping new products, the cost of mistakes could be high, and predictability came from guardrails built on evaluation. ## Key Lessons and Limitations One of the biggest surprises for the team was discovering that many regressions originated not from swapping models but from editing prompts. A single word change in an instruction could severely impact citation accuracy or formatting quality. Formal gates, rather than human review, became the only reliable safety net. They also learned that judge models and rubrics aren't set-and-forget assets, requiring their own versioning, testing, and recalibration. For evaluating responses in other languages or niche technical domains, specialized judges proved the only way to maintain fair and accurate scoring. Looking critically at this case study, several considerations warrant attention. While Dropbox presents a comprehensive and sophisticated approach, the resource investment required for this level of evaluation rigor is substantial. Not all organizations will have the engineering capacity to implement such extensive automated testing infrastructure, maintain multiple evaluation datasets, develop and calibrate LLM judges, and operate continuous monitoring systems. The reliance on Braintrust as a platform introduces vendor dependency, though the underlying principles could presumably be implemented with other tools or custom infrastructure. The LLM-as-judge approach, while pragmatic, introduces its own complexities and potential failure modes. Judge models can themselves hallucinate or exhibit biases, and the calibration process requires ongoing human oversight. The claim that LLM judges outperform traditional metrics should be understood as context-specific rather than universal—for certain tasks and evaluation criteria, not necessarily all. The spot-audit rate of 5-10% provides some human oversight but still leaves substantial room for undetected judge errors. The case study focuses primarily on text-based conversational AI with RAG, and while they mention extending to images, video, and audio as future work, the current approach's applicability to other LLM use cases (code generation, creative writing, structured data extraction) remains unclear. Different use cases may require fundamentally different evaluation approaches. ## Future Directions Dropbox articulates their vision for making evaluation proactive rather than purely protective. This involves moving beyond accuracy to measure user delight, task success, and confidence in answers. They envision building self-healing pipelines that suggest fixes when metrics drop, shortening debug loops. They plan to extend coverage beyond text to images, audio, and low-resource languages so evaluation reflects how people actually work. The overarching takeaway is that treating evaluation as a first-class discipline—anchored in rigorous datasets, actionable metrics, and automated gates—can transform probabilistic LLMs into dependable products. While the specific implementation details reflect Dropbox's resources and context, the principles of systematic dataset curation, actionable metrics, automated pipeline integration, continuous monitoring, and feedback-driven improvement offer valuable guidance for any organization building production LLM systems. The emphasis on treating evaluation infrastructure with the same rigor as production code represents a mature perspective on LLMOps that acknowledges the unique challenges of deploying probabilistic systems at scale.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.