Musubi: AI-Powered Trust and Safety Toolkit with Custom Model Training and Adaptive Moderation

Overview

Musubi is a two to three-year-old AI-native company that provides a trust and safety toolkit for platforms with user-generated content, including social networks, dating apps, marketplaces, and AI companies offering inference endpoints. The founding team includes former executives from Grindr who experienced trust and safety challenges firsthand. The company’s core mission is to empower trust and safety teams to move from reactive firefighting to proactive platform health management by automating content moderation decisions while maintaining human oversight for nuanced cases.

The fundamental problem Musubi addresses is the economic and psychological asymmetry in content moderation: it has become cheaper to generate harmful content or spam using AI than it is to review it, and human moderators suffer significant mental health impacts from reviewing thousands of pieces of potentially traumatic content daily. Traditional content moderation relies heavily on human moderators who face monotonous or traumatizing work, creating both a cost center for businesses and a humanitarian concern.

Technical Architecture and Model Strategy

Musubi employs a sophisticated hybrid approach that combines traditional machine learning with large language models, strategically selecting the right tool for each task based on accuracy, latency, and cost requirements. The company does not use a one-size-fits-all solution but instead offers what they describe as a menu of models optimized for different purposes.

For tabular data and user profile features, Musubi relies on traditional ensemble machine learning models which have proven highly effective at classification tasks. These models excel at processing structured data with multiple features or columns describing user profiles and behavior patterns. The advantage of traditional ML in this context is performance, cost-efficiency, and the ability to handle high-throughput scenarios where platforms need to process millions of pieces of content.

LLMs are strategically deployed in areas requiring reasoning, nuance, and contextual understanding. One key application is providing explanatory reasoning for moderation decisions. When a piece of content is flagged, moderators need to categorize the specific reason, which can involve fine-grained subcategories. For example, rather than simply flagging content as sexually explicit, the system needs to identify specific subcategories of policy violations. LLMs excel at this task by reviewing content and attributing specific policy violations with appropriate reasoning.

Another critical use of LLMs is in feature engineering. The team uses LLMs to extract specific signals from images or message histories that are tailored to each customer’s unique policies. While traditional Python libraries could perform some of these extractions, LLMs offer the flexibility to be highly specific to customer policies and adapt to nuanced requirements that vary across platforms.

Custom Model Training and Deployment Pipeline

A distinguishing feature of Musubi’s approach is custom model training for each customer rather than deploying generic off-the-shelf models. This strategy emerged from the recognition that every platform has unique policies, content types, and threat landscapes. A dating app’s moderation needs differ fundamentally from a news site with comments or a marketplace, not just in strictness levels but in the specific types of violations and edge cases that matter.

The model training pipeline begins with customers providing a snapshot of historical activity on their platform, including examples of past moderation decisions. From this data, Musubi trains machine learning models and fine-tunes large language models to match the customer’s specific decision-making patterns and policy requirements. The team has built this as a repeatable platform using reusable building blocks rather than bespoke Jupyter notebook analyses for each customer.

The pipeline architecture includes components for data ingestion at regular cadence, sampling, data cleaning, setting training parameters, model stacking, evaluation, and deployment. This modular approach allows the small engineering team to scale across multiple customers while still delivering custom performance. While most LLM work involves fine-tuning existing models, the team has also explored full end-to-end training of smaller models for specific tasks like creating instruction-aware embeddings that can focus on whether content falls into specific categories.

The deployment strategy emphasizes starting with high-impact but constrained problem areas rather than unleashing AI across an entire platform at once. Initial deployments might focus on clear-cut areas like spam detection or illegal activity before gradually expanding scope as confidence and performance are validated in production.

Evaluation and Quality Assurance

Evaluation presents unique challenges in content moderation because there is no objective ground truth for many decisions. Musubi has developed a sophisticated multi-layered evaluation approach that acknowledges the inherent subjectivity while still maintaining quality standards.

The primary evaluation method involves comparing AI decisions against human moderator decisions using holdout sets. A percentage of content is withheld from the AI’s automated action and routed to human moderators, allowing direct comparison of how the AI and humans would have handled the same content. Interestingly, this comparison revealed that the AI sometimes outperforms human moderators by catching global trends or patterns that individual moderators might miss, transforming evaluation from a simple report card into a learning tool for moderation teams.

For LLM-based content moderation, customers create golden datasets consisting of labeled examples showing how content should be moderated according to their policies. These datasets serve as benchmarks for testing policy configurations against. However, the team acknowledges that these are often more accurately described as silver datasets because they frequently contain inconsistencies or gaps.

To address golden dataset quality, Musubi has built tooling that evaluates the evaluation datasets themselves, checking whether they are representative of production data and aligned with stated policies. Metrics help identify when large portions of production content types are missing from the test set, which would create blind spots in policy testing.

An innovative approach Musubi employs is using LLMs as judges to referee between human and AI decisions. When there is disagreement, a reasoning model reviews both the human moderator’s decision and the AI’s decision alongside the customer’s policy document, serving as a tiebreaker. This helps identify cases where either the AI or the human may have been correct despite disagreeing.

The evaluation strategy also includes automated policy optimization through an agentic workflow. This system performs error analysis on golden dataset results, suggests changes to correct mislabeled examples in the dataset itself, and recommends policy refinements to improve accuracy. The system operates with human oversight, presenting recommendations that users can accept or reject rather than automatically implementing changes.

Production Deployment and Monitoring

Musubi has developed sophisticated production monitoring tailored to the unique challenges of adaptive trust and safety systems. The company provides visibility into platform activity, moderation decisions, and model performance through multiple channels.

One approach involves visualizing platform network activity and moderation decisions across visual maps, helping teams identify emerging trends. To complement this detailed investigative tool, the system includes alerting for new emerging trends, scam attacks, or shifts in the threat landscape that might require attention. This pairing of in-depth investigation tools with outcome-based alerts exemplifies the balance between keeping humans informed and not overwhelming them with data.

For traditional machine learning models, the team has robust observability including holdout set analysis, agreement metrics with human moderators, and real-time awareness of model performance in production. They are working to extend similar monitoring capabilities to LLM-based models in production, including detecting when data distributions shift or when content clustering patterns change in ways that might indicate the model is no longer performing optimally.

The system learns continuously from human moderator decisions through a feedback loop. Rather than requiring massive upfront training datasets, the models can improve with relatively small daily feedback volumes. When moderators review just 10 to 100 decisions per day that the AI held out for comparison, the model can become more confident and expand to make thousands of additional automated decisions based on that feedback. Some customers have reorganized their moderation workflows to prioritize these holdout decisions specifically to accelerate model improvement.

Latency, Cost, and Performance Optimization

One of the significant technical challenges the team has tackled is meeting diverse latency requirements across different customer use cases. While some applications can tolerate single-digit second response times, others require sub-second inference for real-time moderation scenarios.

For latency-sensitive clients, the team has conducted extensive benchmarking of model architectures and developed techniques including short-circuiting approaches that trade small amounts of accuracy for significant throughput gains. For customers processing 10 to 100 million content items monthly, GPU budgets can quickly reach hundreds of thousands of dollars annually if not optimized carefully.

The approach to cost management involves presenting customers with realistic estimates based on their volume and helping them understand the accuracy-latency-cost tradeoffs. Different parts of a platform may have different requirements, with some areas justifying higher costs for maximum accuracy while others benefit from faster, cheaper models that still perform well enough. This flexibility allows customers to optimize their overall spending while maintaining quality where it matters most.

Importantly, even expensive LLM-based solutions often prove more cost-effective than human moderation alternatives. Depending on content type, human moderation can cost 15 to 50 cents per item reviewed, making automated solutions economically attractive even when LLM inference costs are significant.

Customer Onboarding and Tooling Philosophy

The onboarding process begins with reverse demos where customers demonstrate their existing moderation systems and explain their specific challenges. Common pain points include high false positive rates from generic blackbox scoring models, inability to adapt to evolving attack patterns, and moderation queues that overwhelm human teams.

Musubi’s product philosophy centers on providing customers the same tools the internal team uses rather than building separate simplified interfaces. This empowers trust and safety teams to become more self-sufficient and iterate on their own policies and configurations. The team provides initial onboarding to introduce concepts and best practices, but customers can then successfully iterate independently.

This toolkit approach addresses the reality that every customer needs something different. Rather than attempting to build perfectly flexible systems from the beginning, the team follows a pattern of building solutions for specific customer needs, then generalizing those solutions when they identify patterns applicable to other customers. This pragmatic approach to product development allows the small team to balance customization with scalability.

The team is actively developing more sophisticated orchestration and workflow capabilities that allow non-technical trust and safety team members to configure complex moderation flows without relying on engineering resources. This might include analyzing content first, then escalating for holistic profile review if violations are found, or routing content to different policy rulesets based on context like forum type or user demographics.

Advanced Capabilities and Future Direction

An emerging area of focus is agentic workflows that reduce the complexity customers face while interacting with sophisticated data science tools. Rather than requiring trust and safety teams to understand concepts like confusion matrices or embedding spaces, the vision is to allow natural language interaction where users describe their goals and the system handles the technical implementation details.

For example, policy optimization can be framed as an agentic loop where the system iteratively tests policies, identifies errors, suggests improvements, and validates changes against golden datasets, all while keeping humans in the loop for approval. The challenge is ensuring these agents don’t lead customers astray, such as by overfitting policies to test datasets. The team is experimenting with multi-pass approaches where one agent makes recommendations and another checks those recommendations for potential issues like overfitting.

The company is also investing in bias analysis tooling that helps ensure moderation is applied fairly across different demographic groups. This involves techniques like modifying golden dataset examples to change language, race references, or other demographic markers, then checking whether the model maintains consistent decisions across these variations.

Looking forward, the team is excited about reducing customer burden through proactive alerting rather than requiring constant manual monitoring, improving observability for LLM-based production systems to match the maturity of their traditional ML monitoring, and enabling more flexible workflow orchestration that adapts to diverse customer needs without requiring custom engineering work.

The overall vision is moving trust and safety teams from reactive mode, where they struggle to keep up with moderation queues, to proactive mode where they can focus on strategic work like refining policies, identifying emerging threats, and fostering positive community dynamics rather than simply triaging harmful content. By automating clear-cut decisions and highlighting areas needing human judgment, the platform aims to transform trust and safety from a cost center into a strategic function that genuinely improves platform health and user experience.

AI-Powered Trust and Safety Toolkit with Custom Model Training and Adaptive Moderation

Industry

Technologies

Overview

Technical Architecture and Model Strategy

Custom Model Training and Deployment Pipeline

Evaluation and Quality Assurance

Production Deployment and Monitoring

Latency, Cost, and Performance Optimization

Customer Onboarding and Tooling Philosophy

Advanced Capabilities and Future Direction

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Building Custom Agents at Scale: Notion's Multi-Year Journey to Production-Ready Agentic Workflows

AI Agents in Production: Multi-Enterprise Implementation Strategies