Company
Thumbtack
Title
Fine-tuned LLM for Message Content Moderation and Trust & Safety
Industry
Tech
Year
2024
Summary (short)
Thumbtack implemented a fine-tuned LLM solution to enhance their message review system for detecting policy violations in customer-professional communications. After experimenting with prompt engineering and finding it insufficient (AUC 0.56), they successfully fine-tuned an LLM model achieving an AUC of 0.93. The production system uses a cost-effective two-tier approach: a CNN model pre-filters messages, with only suspicious ones (20%) processed by the LLM. Using LangChain for deployment, the system has processed tens of millions of messages, improving precision by 3.7x and recall by 1.5x compared to their previous system.
## Overview Thumbtack is a technology company operating a home services marketplace that connects customers with local service professionals such as plumbers, electricians, handymen, and cleaners. As a platform handling significant message volume between customers and service providers, maintaining trust and safety is critical. The company needed to review messages exchanged on their platform to detect and take action against behavior violating their policies, including abusive language, job seeking (requests for employment rather than services), and partnership solicitations. This case study documents how Thumbtack productionized a fine-tuned LLM to dramatically improve their message review capabilities, achieving nearly threefold improvement in precision while processing tens of millions of messages. ## The Problem Thumbtack's existing message review pipeline consisted of two primary components: a rule-based engine for detecting obvious policy violations through flagged words or phrases, and a machine learning model (specifically a Convolutional Neural Network) for identifying more complex issues through contextual analysis. While the rule-based system could catch straightforward violations like explicit offensive language, the CNN model struggled with more nuanced content including sarcasm, implied threats, and subtle policy violations that required deeper understanding of context and intent. The challenge was particularly acute because most communications on the platform are legitimate—only a very small portion violate policies. This class imbalance, combined with the subtlety of many violations, made it difficult for traditional ML approaches to achieve the accuracy needed for production use. False positives would frustrate legitimate users, while false negatives would expose service professionals to inappropriate content. ## Experimentation and Model Development The team took a methodical approach to integrating LLM technology, conducting structured experiments before committing to a production architecture. ### Prompt Engineering Approach The first experiment tested whether off-the-shelf LLM models could solve the problem through prompt engineering alone. The team validated two popular LLM models against a dataset of 1,000 sample messages (90% legitimate, 10% suspicious). They crafted prompts that positioned the LLM as a professional reviewing job requests against specific criteria, asking it to determine message legitimacy. Despite testing multiple prompt patterns, the results were disappointing. The best prompts achieved an Area Under the Curve (AUC) of only 0.56—barely better than random chance and far below what would be acceptable for production use. This is an important finding that should give pause to organizations assuming that prompt engineering alone can solve specialized classification tasks. Domain-specific problems often require domain-specific training, not just clever prompting. ### Fine-tuning Approach Given the failure of prompt engineering, the team pivoted to fine-tuning. This approach proved far more successful. Even with just a few thousand training samples, the model showed considerable improvement. When the dataset was expanded to tens of thousands of labeled samples, the AUC jumped to 0.93—a production-ready level of performance. This stark contrast between prompt engineering (0.56 AUC) and fine-tuning (0.93 AUC) underscores a key lesson: for specialized classification tasks with nuanced domain requirements, fine-tuning remains essential. Off-the-shelf models, even sophisticated ones, may lack the specific pattern recognition needed for domain-specific violation detection. ## Production Architecture and Deployment Considerations The transition from successful experiments to production deployment required addressing two critical challenges: infrastructure integration and cost management. ### Centralized LLM Service with LangChain To manage LLM usage efficiently and maintain consistency across teams, Thumbtack's ML infrastructure team adopted the LangChain framework to create a centralized service bridging internal systems with the LLM model. This architectural decision had several benefits: it enabled easy integration with various LLM models, provided a single point of management for LLM-related services, and positioned the organization to scale LLM adoption across multiple teams in the future. The choice of LangChain as an integration framework suggests a pragmatic approach to LLMOps—using established open-source tooling rather than building custom integration layers. This accelerated deployment and reduced the engineering burden on the team. ### Cost Optimization Through Cascading Architecture One of the most interesting aspects of this case study is how Thumbtack addressed the cost implications of LLM inference at scale. LLMs require GPU resources and are significantly more expensive to run than traditional ML models. Given that Thumbtack processes tens of millions of messages, running every message through the LLM would be prohibitively expensive—especially since the vast majority of messages are legitimate and don't require sophisticated analysis. The solution was elegantly practical: rather than discarding their legacy CNN model, they repurposed it as a pre-filter. By adjusting the CNN's decision threshold, they configured it to identify and pass through most obviously legitimate messages without LLM review. Only the more ambiguous or potentially suspicious messages (approximately 20% of total volume) are forwarded to the LLM for detailed analysis. This cascading architecture represents a common and effective pattern in LLMOps: using cheaper, faster models to handle the bulk of straightforward cases while reserving expensive LLM inference for challenging edge cases where the added capability is most valuable. This approach achieved dramatic cost reduction without compromising review quality. ## Production Results and Performance Since deployment, the system has processed tens of millions of messages. The quantitative improvements are substantial: - Precision improved by a factor of 3.7x compared to the previous system, meaning significantly fewer false positives and more accurate identification of genuinely suspicious messages - Recall improved by 1.5x, enabling detection of more policy violations that would have previously slipped through These metrics represent meaningful improvements in both user experience (fewer legitimate messages incorrectly flagged) and platform safety (more violations detected). ## Key LLMOps Lessons Several aspects of this case study offer valuable insights for organizations considering similar LLM deployments: **Fine-tuning vs. Prompt Engineering**: For specialized classification tasks, prompt engineering alone may be insufficient. The dramatic performance gap (0.56 vs 0.93 AUC) demonstrates that domain-specific fine-tuning can be essential for production-quality results. **Hybrid Architectures**: Legacy models don't necessarily become obsolete when LLMs are introduced. Repurposing existing models as pre-filters can dramatically reduce costs while maintaining quality. This cascading approach is a practical pattern for cost-effective LLM deployment at scale. **Centralized LLM Infrastructure**: Building a centralized service for LLM integration (using frameworks like LangChain) enables consistent management, scaling across teams, and faster future deployments. **Cost-Conscious Deployment**: LLM inference costs can be significant at scale. Thoughtful architecture that routes only necessary traffic to the LLM is essential for sustainable production deployments. **Rigorous Experimentation**: The structured approach of testing prompt engineering first, then pivoting to fine-tuning when results were unsatisfactory, demonstrates good experimental discipline before committing to production investments. ## Limitations and Considerations While the case study presents strong results, some caveats are worth noting. The article does not specify which LLM was used for fine-tuning, making it difficult to assess reproducibility or compare approaches. Details about the fine-tuning process itself (data preparation, training infrastructure, iteration cycles) are limited. Additionally, while the precision and recall improvements are impressive, the absolute values are not disclosed, making it difficult to assess the system's overall performance in context. The long-term maintenance requirements for the fine-tuned model, including retraining frequency and data drift considerations, are also not addressed. Despite these gaps, the case study provides a credible account of a practical LLMOps implementation that balances performance, cost, and operational considerations in a real production environment.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.