ZenML

Building a Delicate Text Detection System for Content Safety

Grammarly 2024
View original source

Grammarly developed a novel approach to detect delicate text content that goes beyond traditional toxicity detection, addressing a gap in content safety. They created DeTexD, a benchmark dataset of 40,000 training samples and 1,023 test paragraphs, and developed a RoBERTa-based classification model that achieved 79.3% F1 score, significantly outperforming existing toxic text detection methods for identifying potentially triggering or emotionally charged content.

Industry

Tech

Technologies

Overview

Grammarly, the widely-used AI-powered writing assistant, developed a research initiative to address a gap in content safety that existing toxicity detection systems were failing to capture. Their work on “delicate text” detection represents an important contribution to the broader field of AI safety, particularly relevant as LLMs become more prevalent in production environments where they may encounter or generate sensitive content.

The core insight driving this research is that harmful text is not limited to explicitly toxic or offensive content. Delicate text, as defined by Grammarly’s researchers, encompasses any text that is emotionally charged or potentially triggering, where engaging with it has the potential to result in harm. This includes content about self-harm, mental health issues, controversial political topics, discussions of race, gender, religion, and socioeconomic status—content that may not contain profanity or explicit hate speech but still presents risks for users or AI agents exposed to it.

The Problem with Existing Approaches

The research highlights a significant limitation in current content moderation and safety systems. Traditional toxicity detection methods, including widely-used commercial APIs like Google’s Perspective API and OpenAI’s moderation and content filter APIs, are designed to detect explicitly offensive, hateful, or abusive language. However, they systematically underperform when it comes to identifying delicate content that falls outside these narrower definitions.

The Grammarly team evaluated multiple existing approaches against their new benchmark, including HateBERT fine-tuned on various datasets (AbusEval, HatEval, OffensEval), Google’s Perspective API, and OpenAI’s content moderation tools. The results were revealing: even the best-performing existing methods achieved F1 scores well below the 79.3% achieved by Grammarly’s purpose-built baseline model. Google’s Perspective API achieved only 42.3% F1, while OpenAI’s moderation API reached just 31.1% F1.

This performance gap has direct implications for LLM operations. Systems that rely solely on toxicity detection may allow delicate content to pass through undetected, potentially exposing users to triggering content or allowing AI systems to generate responses about sensitive topics without appropriate safeguards.

Dataset Construction Methodology

The creation of the DeTexD dataset followed a rigorous methodology that offers valuable lessons for teams building specialized datasets for content safety applications. The data sourcing employed two complementary techniques:

The annotation process addressed the inherent subjectivity of determining what constitutes delicate content. Expert linguists with prior experience in similar annotation tasks performed a two-stage annotation process: first identifying whether texts were delicate or not, then rating the risk level of delicate texts. Final labels were determined by majority vote among annotators. The team provided detailed examples and instructions to annotators to improve consistency.

The resulting dataset includes 40,000 labeled samples for training and 1,023 paragraphs for benchmark evaluation. Both the benchmark dataset and the baseline model have been released publicly through Hugging Face, along with annotation guidelines, demonstrating a commitment to reproducibility and community contribution.

Model Architecture and Training

For their baseline model, the team chose to fine-tune a RoBERTa-based classifier on the DeTexD Training dataset. RoBERTa (Robustly Optimized BERT Pretraining Approach) represents a well-established transformer architecture that has proven effective for text classification tasks. The choice of RoBERTa provides a good balance between performance and computational efficiency, making it suitable for production deployment scenarios.

The fine-tuned model, released as grammarly/detexd-roberta-base on Hugging Face, provides a ready-to-use solution for teams looking to incorporate delicate text detection into their applications. This is a significant operational advantage, as it eliminates the need for other organizations to collect and annotate their own datasets from scratch.

Evaluation Results and Analysis

The evaluation results provide important insights for practitioners considering how to implement content safety in production LLM systems. The comparison table in the paper shows that the baseline model achieves 81.4% precision and 78.3% recall, with an F1 score of 79.3%. This balanced performance is notable because many existing methods show extreme trade-offs between precision and recall.

For example, HateBERT fine-tuned on HatEval achieves 95.2% precision but only 6.0% recall at its default threshold—meaning it catches very little delicate content despite being highly accurate when it does flag something. When calibrated to optimize F1 score, this flips to 41.1% precision and 86.0% recall, catching more content but with many false positives.

The analysis also confirmed the researchers’ hypothesis that delicate text detection and toxic text detection are fundamentally different tasks. The fine-tuned model tends to be more permissive with texts containing profanities unrelated to sensitive topics, while being more likely to flag discussions of race, violence, and sexuality even when not labeled as toxic by traditional metrics. This distinction is crucial for production systems that need nuanced content handling.

LLMOps Implications

This research has several important implications for teams operating LLMs in production:

The first is the recognition that content safety is multi-dimensional. Organizations deploying LLMs should not rely solely on toxicity detection but should consider broader categories of potentially harmful content. The DeTexD benchmark provides a way to evaluate how well existing safety measures capture delicate content.

The public release of artifacts—including the benchmark dataset, the trained model, and annotation guidelines—enables other teams to incorporate delicate text detection into their safety pipelines or to extend this research for their specific domains. The availability of the model on Hugging Face significantly lowers the barrier to adoption.

The paper also emphasizes responsible use of these tools, with the authors explicitly noting that they do not recommend using these artifacts without proper due diligence for privacy, security, sensitivity, legal, and compliance measures. This reflects an understanding that content moderation tools must be deployed thoughtfully within broader governance frameworks.

For teams building LLM-powered applications that may receive or generate content about mental health, medical topics, political issues, or other sensitive areas, the DeTexD approach offers a complementary layer of protection beyond standard toxicity filters. This is particularly relevant for customer-facing applications, content moderation systems, and AI assistants that interact with vulnerable populations.

Limitations and Considerations

While this research represents a valuable contribution, practitioners should be aware of certain limitations. The definition of “delicate” text is inherently subjective and culturally dependent—what is considered delicate may vary across communities and contexts. The annotation was performed by expert linguists, but their perspectives may not fully represent the diversity of potential users.

The benchmark dataset, while substantial at over 40,000 samples, focuses on English-language content from specific online sources. Teams operating in multilingual environments or different cultural contexts may need to develop supplementary datasets.

Additionally, the research was published through an academic workshop, and while performance metrics are provided, there is limited information about inference latency, computational requirements, or how the model performs at scale in production environments. Teams considering adoption would need to conduct their own performance testing for their specific deployment scenarios.

Conclusion

Grammarly’s DeTexD research addresses a meaningful gap in content safety for AI systems. By distinguishing delicate text from purely toxic content and providing publicly available tools and benchmarks, the work enables more nuanced and comprehensive safety measures in production LLM deployments. For organizations serious about responsible AI deployment, incorporating delicate text detection alongside traditional toxicity filtering represents a more robust approach to user protection.

More Like This

Advanced Fine-Tuning Techniques for Multi-Agent Orchestration at Scale

Amazon 2026

Amazon teams faced challenges in deploying high-stakes LLM applications across healthcare, engineering, and e-commerce domains where basic prompt engineering and RAG approaches proved insufficient. Through systematic application of advanced fine-tuning techniques including Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and cutting-edge reasoning optimizations like Group-based Reinforcement Learning from Policy Optimization (GRPO) and Direct Advantage Policy Optimization (DAPO), three Amazon business units achieved production-grade results: Amazon Pharmacy reduced dangerous medication errors by 33%, Amazon Global Engineering Services achieved 80% human effort reduction in inspection reviews, and Amazon A+ Content improved quality assessment accuracy from 77% to 96%. These outcomes demonstrate that approximately one in four high-stakes enterprise applications require advanced fine-tuning beyond standard techniques to achieve necessary performance levels in production environments.

healthcare customer_support content_moderation +43

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik 2025

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot +53

Forward Deployed Engineering: Bringing Enterprise LLM Applications to Production

OpenAI 2025

OpenAI's Forward Deployed Engineering (FDE) team, led by Colin Jarvis, embeds with enterprise customers to solve high-value problems using LLMs and deliver production-grade AI applications. The team focuses on problems worth tens of millions to billions in value, working with companies across industries including finance (Morgan Stanley), manufacturing (semiconductors, automotive), telecommunications (T-Mobile, Klarna), and others. By deeply understanding customer domains, building evaluation frameworks, implementing guardrails, and iterating with users over months, the FDE team achieves 20-50% efficiency improvements and high adoption rates (98% at Morgan Stanley). The approach emphasizes solving hard, novel problems from zero-to-one, extracting learnings into reusable products and frameworks (like Swarm and Agent Kit), then scaling solutions across the market while maintaining strategic focus on product development over services revenue.

customer_support healthcare code_generation +42