ZenML

Multi-Path AI Evaluation Framework for Marketplace Trust and Safety

Thumbtack 2026
View original source

Thumbtack, a marketplace connecting customers with local service professionals, developed a comprehensive evaluation framework to ensure reliability, safety, and quality in their expanding GenAI features. Facing challenges inherent to probabilistic AI outputs—including tone inconsistencies, inaccuracies, and potential for harmful content—the company implemented a hybrid evaluation approach combining rule-based checks, AI-as-a-judge scoring, safety reviews, and crowdsourced human validation. The solution features three parallel evaluation paths supporting different team workflows: MLflow-based rapid experimentation, Databricks nightly jobs for deep analysis, and a multi-layer pipeline for high-volume content generation. This flexible architecture has enabled Thumbtack to deploy AI across search, project summaries, pro listings, and marketing content while maintaining trust standards critical to their marketplace model.

Industry

Tech

Technologies

Overview

Thumbtack operates a marketplace platform connecting customers seeking services with local professionals. The company has been integrating generative AI across multiple customer and professional touchpoints, including helping customers articulate their needs, generating project summaries, providing pro recommendations, and creating marketing content at scale. The case study describes their comprehensive approach to evaluating AI systems in production, addressing the unique challenges of probabilistic outputs in a trust-sensitive marketplace environment.

The evaluation framework reflects a pragmatic recognition that the GenAI evaluation landscape remains rapidly evolving. Rather than committing to a single rigid framework, Thumbtack adopted what they describe as a “learning-driven, exploratory approach” that balances speed of deployment with requirements for safety, responsibility, and quality. This is particularly critical in a marketplace context where inaccurate guidance could damage customer trust or unfairly represent service professionals.

Organizational Evolution and Team Structure

The case study reveals an interesting organizational journey. Thumbtack initially took a decentralized approach with individual product teams running their own evaluations. While this worked during early AI adoption, the proliferation of AI features across the company led to duplicated effort and siloed learnings. The company subsequently consolidated evaluation responsibilities into a dedicated cross-functional Evals team comprising applied science, ML infrastructure, content design, and trust & safety expertise. Importantly, product teams retained ownership of their iteration cycles while the central team provided shared tooling, content guidelines, and human oversight infrastructure.

This organizational model represents a thoughtful balance between centralized standards and distributed execution. The dedicated Evals team provides continuity and shared infrastructure, preventing the reinvention of evaluation patterns across teams, while product teams maintain the autonomy needed for rapid experimentation within their specific domains.

Core Evaluation Methodology

Thumbtack’s evaluation approach rests on three foundational pillars: representative data, operationalized quality rubrics, and hybrid evaluation methods.

For representative data, the company curates datasets reflecting realistic customer and professional interactions, both with each other and with the product itself. These datasets are continuously expanded as new use cases emerge, ensuring evaluation coverage evolves alongside product capabilities.

Quality rubrics operationalize design, linguistic, and safety guidelines into concrete evaluation criteria. These rubrics define expectations across multiple dimensions including clarity, tone, groundedness, usefulness, and safety. The conversion of often-subjective content guidelines into structured rubrics represents a critical step in making evaluation scalable and consistent. The case study emphasizes the importance of rubric clarity, noting that “good rubrics make every part of evaluation easier.”

The hybrid evaluation methodology combines four distinct approaches. Rule-based checks validate structure, formatting, length, and schema compliance—deterministic properties that don’t require sophisticated semantic understanding. AI-as-a-judge evaluations use large language models to assess semantic properties like clarity, tone, accuracy, relevance, and groundedness, effectively scaling human judgment capabilities. Safety and trust evaluation involves both expert review from dedicated trust and safety personnel and simple input/output-based safety checks, with plans to expand to full-journey evaluation. Finally, crowdsourced human review via Amazon Mechanical Turk validates representative samples of AI-approved content during both pre-production and production phases.

This multi-method approach reflects a mature understanding that no single evaluation technique suffices for production GenAI systems. Different quality dimensions require different evaluation strategies, and combining methods provides defense in depth against various failure modes.

Three Parallel Evaluation Paths

A distinctive aspect of Thumbtack’s framework is the provision of three parallel evaluation paths, each optimized for different workflows and use cases. This architectural decision acknowledges that different teams have varying needs regarding development speed, data complexity, and quality requirements.

Path A leverages the MLflow GenAI stack for fast-track evaluation with minimal setup. Teams use MLflow templates for evaluator registration, rubric-based scoring, trace logging, model and prompt comparison, and lightweight monitoring. The implementation wraps DeepEval’s GEval scorers with MLflow tracking capabilities. Teams define a criteria set where each criterion represents a slice of a content guideline, with optional weighting by importance and bounding by scoring rubrics. The framework supports free-form criteria definition when needed, providing flexibility for novel use cases. MLflow captures every trace, score, judge model, and run metadata for reproducibility and downstream analysis. The same pattern extends to prompt comparison via a PromptRefiner component for iterative refinement workflows.

The code examples show a straightforward Python API where teams codify content guidelines as reusable rubrics, register evaluators with swappable judge models (enabling model comparison experiments), and execute scoring with automatic trace logging. This path prioritizes developer velocity and ease of adoption, making evaluation accessible to teams without deep MLOps infrastructure expertise.

Path B uses Databricks nightly jobs combined with warehouse queries and DeepEval for deeper analysis or features built on richer datasets. The orchestration flow involves nightly Databricks jobs fetching fresh samples from the data warehouse, DeepEval orchestrating evaluators, and spreadsheet-based labeling enabling rapid calibration and human review. This path supports more complex conversational test cases, as illustrated by the code example that pulls samples from BigQuery via Databricks scheduler, constructs multi-turn conversation test cases, applies registered scorers (organized by categories like voice-and-tone, style, design, relevancy, accuracy, and ethics), logs results to MLflow, writes failing samples to Google Sheets for human review, and sends Slack alerts to relevant teams.

This architecture demonstrates thoughtful integration of evaluation into existing data infrastructure. By running evaluations as scheduled jobs against production data samples, Thumbtack ensures continuous validation against real-world distributions. The integration with collaborative tools like Google Sheets and Slack ensures evaluation results flow naturally into human review and team awareness workflows.

Path C implements a multi-layer evaluation pipeline suited for high-volume content generation requiring both automated and human quality gates. Thumbtack uses this path specifically for AI-generated marketing channel content at scale. The pipeline combines multiple evaluation layers: LLM Judge scoring across 11 dimensions with actionable feedback, MLflow GenAI Scorers applying binary pass/fail checks from configured guidelines, programmatic validation for character limits, keyword presence, and formatting, automated retry loops where evaluation feedback drives content regeneration, and crowdsourced human review via MTurk validating a 5% sample before deployment.

The 11 LLM Judge dimensions span core quality attributes (prompt adherence, topic alignment, clarity, completeness, keyword usage), guideline-based checks (brand compliance, voice/style/structure, SEO compliance, legal/trust/safety, include/exclude rules), and conditional dimensions like dimension alignment scored only when thematic variants such as convenience, cost, expertise, or reliability are enabled. Notably, MLflow GenAI Guidelines Scorers provide a second layer of LLM-based evaluation on many of the same dimensions, producing binary pass/fail judgments as an independent verification mechanism alongside the LLM Judge scores.

This dual-layer LLM evaluation represents a form of ensemble evaluation, increasing confidence in quality gates by requiring multiple independent assessments to agree. The automated retry loop with evaluation-driven regeneration is particularly interesting from an LLMOps perspective—it creates a closed-loop system where quality feedback directly improves outputs without human intervention, though human review still validates the final results.

For marketing experiments specifically, Thumbtack employs a contextual multi-armed bandit (CMAB) in conjunction with predictive models to test and optimize marketing interventions. AI-generated content flows through the same CMAB loop, testing variants and accelerating experiment velocity. The evaluation framework becomes a critical component of the generation and auto-refinement loop in what Thumbtack internally calls “Project Gutenberg.”

Trace Logging and Continuous Monitoring

All evaluation activity is comprehensively logged, capturing inputs and outputs, metadata, evaluator scores, timing information, and version details. This trace logging serves multiple purposes: debugging when evaluations fail or produce unexpected results, reproducibility of evaluation runs for compliance or analysis, and trend analysis for detecting drift or degradation over time.

Automated checks run regularly to detect regressions, drift, tone/style changes, and safety issues. Teams receive quality alerts when metrics shift unexpectedly, enabling rapid response to potential issues. This monitoring layer transforms evaluation from a pre-deployment gate into an ongoing production concern, acknowledging that model updates, prompt changes, or data distribution shifts can degrade performance even after initial validation.

Human Review and Labeling Workflows

Despite extensive automation, Thumbtack maintains that human judgment remains essential. The framework incorporates efficient spreadsheet-based workflows enabling rapid annotation, clear reviewer collaboration, rubric refinement, early safety scans, and calibration for AI-as-judge scorers. The crowdsourced review component via MTurk validates AI-approved content at scale, providing cost-effective human oversight without requiring manual review of every output.

The emphasis on human-AI calibration is particularly noteworthy. The case study mentions expecting and enabling “calibration mode” early so that human and AI judges can define success in meaningful ways, consistent with each other and with product-specific requirements. This suggests an iterative process where human reviewers and AI evaluators are aligned on quality standards before full deployment, reducing the risk of automated evaluation drifting from actual quality requirements.

The acknowledgment that “future versions will incorporate more advanced review tooling” indicates current limitations in the human review experience, suggesting this remains an area for improvement in their evaluation infrastructure.

Use Cases Across the Product Suite

The case study describes five illustrative use cases demonstrating evaluation application across Thumbtack’s product surface:

AI-native search evaluations ensure AI-generated guidance helps users articulate their needs while maintaining relevance and helpful tone. Project summaries condense user input with evaluations checking groundedness, clarity, and safety. Pro listings leverage AI to articulate why particular professionals might fit customer needs, with evaluations checking fairness, accuracy, and grounding in available context. Learning pro preferences involves AI inferring high-level service or preference cues from professional information, with evaluations ensuring claims remain reasonable without overstating or inventing details. SEO and marketing content generation at scale includes evaluations for clarity, brand voice, safety, tone alignment, and accuracy, with judge feedback driving iterative regeneration until content meets quality thresholds.

These use cases span the spectrum from customer-facing search and discovery to professional representation to marketing content, demonstrating the breadth of GenAI deployment across Thumbtack’s platform. Each use case requires somewhat different evaluation criteria, justifying the flexible, multi-path evaluation architecture.

Technical Stack and Tooling

The technical implementation leverages several key components. MLflow serves as the central experiment tracking and model registry platform, with specific use of MLflow GenAI capabilities for Guidelines Scorers and evaluation workflows. DeepEval provides the underlying evaluation framework, particularly the ConversationalGEval and GEval scorers that assess conversational AI outputs. Databricks provides the data platform for scheduled evaluation jobs and integration with the data warehouse. The stack includes integration with BigQuery/Spark SQL for data sampling, Google Sheets for human review workflows, Slack for team notifications and alerts, and Amazon Mechanical Turk for crowdsourced validation.

The choice of established platforms like MLflow and Databricks suggests a preference for leveraging mature tooling rather than building entirely custom infrastructure. DeepEval represents a specialized evaluation framework that complements rather than replaces general-purpose ML infrastructure.

Critical Assessment and Balanced Perspective

While the case study presents an impressive evaluation framework, several aspects warrant critical consideration. The document reads as an engineering blog post from Thumbtack’s team, naturally emphasizing successes while providing limited detail on failures, abandoned approaches, or persistent challenges. The claim that evaluation “ensures” AI interactions strengthen trust rather than undermine it may overstate certainty—evaluation can detect problems and reduce risk, but cannot provide absolute guarantees given the probabilistic nature of LLM outputs.

The reliance on LLM-as-judge evaluation introduces potential circularity where one language model evaluates another’s outputs. While the dual-layer approach with independent evaluators provides some mitigation, both layers still depend on LLM judgment, which itself can be inconsistent, biased, or fail on edge cases. The case study doesn’t discuss inter-annotator agreement metrics, calibration metrics between human and AI judges, or specific thresholds for acceptable disagreement.

The crowdsourced review sample of 5% for high-volume content (in Path C) represents a tradeoff between cost and coverage. While statistically reasonable for detecting systematic issues, it leaves 95% of content unreviewed by humans, accepting some risk of edge-case failures reaching production. The case study doesn’t discuss how this sampling rate was determined or whether it varies based on confidence scores from automated evaluation.

The organizational evolution from decentralized to centralized evaluation suggests initial approaches proved insufficient, though the specifics of what failed or what costs were incurred aren’t detailed. The mention that “future versions will incorporate more advanced review tooling” and the roadmap items suggest current capabilities have gaps, though the nature of these limitations isn’t explicitly discussed.

The parallel paths architecture provides flexibility but also introduces complexity. Teams must understand which path suits their needs, and maintaining three different evaluation orchestration patterns requires ongoing engineering investment. The case study doesn’t discuss governance challenges around ensuring teams actually use evaluation appropriately or how to prevent teams from circumventing evaluation when under time pressure.

Future Roadmap and Evolution

Thumbtack’s evaluation roadmap focuses on several advancement areas as AI moves toward more agent-like workflows. Planned capabilities include a centralized LLM judge and scorer registry ensuring consistent scoring across teams, automated judge-writing workflows inspired by patterns like AutoEval for rubric-based judge generation, evaluation of full task journeys beyond single input/output pairs, tools for authoring, maintaining, and evolving rubrics, AI-assisted labeling tools accelerating human review, streamlined reviewer dashboards and triage tools, cost-aware evaluation orchestration optimizing compute usage, and more advanced red-teaming using third-party and open-source agents.

The shift toward full-journey evaluation is particularly significant. Current approaches largely evaluate individual AI outputs in isolation, but production AI features often involve multi-step workflows where errors can compound or where later steps depend on earlier ones. Journey-level evaluation would better capture real user experience but introduces significant complexity in defining success criteria and attributing failures.

The aspiration that “every AI workflow at Thumbtack ship with evaluation gates covering accuracy, trust, safety, latency, and in-product effectiveness” represents a mature vision for evaluation as a standard requirement rather than optional addition. The inclusion of latency and in-product effectiveness alongside quality dimensions acknowledges that evaluation must consider user experience holistically, not just content quality.

Key Learnings and Recommendations

Thumbtack’s documented learnings provide valuable insights for other organizations building evaluation capabilities. Starting small but intentional with MVP evaluation workflows can deliver significant value without requiring comprehensive infrastructure upfront. Pairing automated methods with human judgment remains essential, with explicit “calibration mode” helping human and AI judges align on success criteria consistent with product requirements. Crowdsourced review on samples can effectively scale human oversight without reviewing every output, though sampling strategies require careful consideration.

The emphasis on rubric clarity as foundational to evaluation effectiveness suggests investing in content design and guideline articulation pays dividends across all evaluation methods. Allowing parallel paths acknowledges different teams have different needs, though this must be balanced against infrastructure complexity. Expecting the landscape to change positions evaluation as an evolving discipline rather than a one-time implementation. Early investment in safety review and red-teaming, even if lightweight initially, helps build these capabilities before they become critical. Finally, openly sharing learnings across teams accelerates collective progress, though the case study doesn’t discuss specific knowledge-sharing mechanisms beyond the centralized Evals team.

The characterization of evaluation as “a living discipline” rather than a fixed framework reflects appropriate humility about the challenges of evaluating generative AI systems. As model capabilities grow, evaluation must grow alongside them, requiring ongoing investment and adaptation rather than set-it-and-forget-it infrastructure.

Production LLMOps Maturity

This case study demonstrates relatively advanced LLMOps maturity. The multi-path architecture supporting different workflow needs, comprehensive trace logging and monitoring, integration with existing data and ML infrastructure, dedicated cross-functional team ownership, human-in-the-loop validation at multiple stages, closed-loop refinement with automated retry mechanisms, and systematic approach to safety and trust evaluation all indicate sophisticated production deployment practices.

However, areas for continued maturation include more advanced journey-level evaluation beyond single outputs, enhanced tooling for human reviewers and annotators, automated mechanisms for rubric generation and maintenance, cost optimization for evaluation compute, and more sophisticated red-teaming and adversarial testing. The roadmap items suggest Thumbtack recognizes these gaps and is actively working to address them.

Overall, Thumbtack’s evaluation framework represents a thoughtful, pragmatic approach to a genuinely difficult problem. The willingness to run multiple parallel systems, invest in dedicated team capacity, and continuously evolve practices as the landscape changes demonstrates organizational commitment to responsible AI deployment in a production marketplace environment where trust is fundamental to business success.

More Like This

Building AI-Native Platforms: Agentic Systems, Infrastructure Evolution, and Production LLM Deployment

Delphi / Seam AI / APIsec 2025

This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.

chatbot content_moderation customer_support +40

Building Custom Agents at Scale: Notion's Multi-Year Journey to Production-Ready Agentic Workflows

Notion 2026

Notion, a knowledge work platform serving enterprise customers, spent multiple years (2022-2026) iterating through four to five complete rebuilds of their agent infrastructure before shipping Custom Agents to production. The core problem was enabling users to automate complex workflows across their workspaces while maintaining enterprise-grade reliability, security, and cost efficiency. Their solution involved building a sophisticated agent harness with progressive tool disclosure, SQL-like database abstractions, markdown-based interfaces optimized for LLM consumption, and a comprehensive evaluation framework. The result was a production system handling over 100 tools, serving majority-agent traffic for search, and enabling workflows like automated bug triaging, email processing, and meeting notes capture that fundamentally changed how their company and customers operate.

chatbot question_answering summarization +50

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90