## Overview
Grammarly, a widely-used AI writing assistant platform, developed CoEdIT (Collaborative Editing with Instruction Tuning), an open-source instruction-tuned large language model specifically designed for text editing tasks. This case study presents an interesting approach to LLMOps where the focus shifts from building ever-larger general-purpose models to creating smaller, task-specific models that can outperform their larger counterparts on targeted use cases. The work was published and accepted as a Findings paper at EMNLP 2023, one of the premier conferences in natural language processing.
The core insight driving this work is that general-purpose LLMs, while capable across a broad range of tasks, may not be optimal for specific use cases like text editing. By narrowing the focus and creating a "specialist" model through instruction tuning on a carefully curated dataset, Grammarly demonstrated that significant performance gains and efficiency improvements can be achieved simultaneously.
## Problem Statement and Motivation
The Grammarly team identified several critical gaps in existing approaches to developing text editing models using LLMs:
- **Lack of instruction tuning for editability**: Most existing models were not trained with instruction tuning, limiting their usability and interpretability for editing tasks where natural language instructions are essential.
- **Training on undersized models**: Previous work often focused on smaller architectures that couldn't capture the full complexity of text editing tasks.
- **Training on overly general datasets**: Models trained on broad task distributions (what the authors call "sparse task distribution") performed suboptimally on the narrower but more demanding domain of text editing.
- **Lack of open-source availability**: Many high-performing models were proprietary, limiting reproducibility and broader adoption.
The team hypothesized that fine-tuning on a "dense task distribution" — tasks that are closely related to each other within the text editing domain — would enable better performance and generalization to adjacent tasks. This is analogous to training a human specialist who becomes expert in a specific domain rather than a generalist who knows a little about everything.
## Technical Approach and Implementation
### Dataset Construction
A critical aspect of successful instruction tuning is the quality and design of the training dataset. The Grammarly team built upon their previous work with the IteraTeR+ dataset, which contains various text editing tasks focused on non-meaning-changing edits. The process involved several key steps:
The team translated edit categories (Fluency, Coherence, Clarity, Style) into natural language instructions like "Make this more coherent." This translation from categorical labels to natural language is essential for instruction tuning as it teaches the model to respond to human-like commands.
For subjective categories like Style, the team introduced specific sub-intentions including Paraphrasing, Formality Style Transfer, and Neutralization. This granularity helps the model understand nuanced differences between editing intents.
To improve robustness to different phrasings, the team created paraphrases of instruction templates and added them to the dataset. For example, ensuring the model could respond appropriately to both "write" and "rewrite" as essentially equivalent instructions. This is an important consideration for production systems where users may phrase their requests in varied ways.
### Model Training
The team fine-tuned pre-trained FLAN-T5 models at three different scales:
- **CoEdIT-L**: 770 million parameters
- **CoEdIT-XL**: 3 billion parameters
- **CoEdIT-XXL**: 11 billion parameters
The choice of FLAN-T5 as the base model is notable because FLAN-T5 is itself an instruction-tuned model, meaning the team performed additional specialized instruction tuning on top of an already instruction-tuned foundation. This approach leverages the general instruction-following capabilities while adding domain-specific expertise.
### Evaluation Methodology
The evaluation strategy employed by Grammarly is worth examining closely as it represents a thoughtful approach to assessing LLM quality in production contexts where subjective judgment plays a significant role.
**Comparison Groups**: The team established four comparison groups to contextualize CoEdIT's performance:
- A no-edit baseline (copying source with instruction removed)
- Supervised text editing models trained on iterative revision tasks
- Instruction-tuned LLMs like ChatGPT and GPT-3-Edit
- Decoder-only LLMs without instruction tuning like GPT-3 and LLaMA
**Quantitative Analysis**: The models were evaluated against standard test sets from multiple text editing benchmarks, covering syntactic, semantic, and stylistic edit requirements. This multi-dimensional evaluation is important for understanding model capabilities across different editing scenarios.
**Qualitative Analysis (Human Evaluation)**: Recognizing the inherent subjectivity in judging writing quality, the team conducted human evaluations where expert evaluators compared outputs from CoEdIT-XL (3B parameters) and GPT-3-Edit (175B parameters) across fluency, accuracy, and meaning preservation dimensions.
**Adjacent Task Evaluation**: To test generalization capabilities, the team evaluated CoEdIT on tasks it wasn't explicitly trained on, including sentence compression and politeness transfer. This evaluation is particularly important for production systems where users may request variations of trained tasks.
**Composite Task Evaluation**: Real-world editing often involves multi-step instructions like "make the text simpler, paraphrase it, and make it formal." The team developed CoEdIT-Composite by enriching the training set with multi-part tasks and evaluated it separately against the base CoEdIT-XL and GPT-3-Edit.
## Results and Performance
The results demonstrated that task-specific instruction tuning can yield dramatic efficiency gains without sacrificing — and indeed improving — performance:
- Even the smallest model, CoEdIT-L (770M parameters), outperformed supervised text editing models, instruction-tuned models, and general-purpose LLMs.
- CoEdIT achieved these results with **12 to 60 times fewer parameters** compared to models like GPT-3-Edit (175B parameters).
- In human evaluations, evaluators preferred CoEdIT's output **64% of the time** compared to just **10% for GPT-3-Edit** (with the remaining percentage presumably being ties or no preference).
- On adjacent tasks (sentence compression and politeness transfer), CoEdIT outperformed competitors including GPT-3-Edit, validating the hypothesis that dense task training enables better generalization to related tasks.
- For composite tasks, CoEdIT-Composite was preferred over GPT-3-Edit (38% vs 34%) and the original CoEdIT-XL (34% vs 21%), though the closer margins indicate room for improvement.
## LLMOps Considerations and Implications
This case study offers several valuable lessons for LLMOps practitioners:
**Model Sizing and Efficiency**: The dramatic parameter reduction (up to 60x) while maintaining or improving performance has significant implications for deployment costs, latency, and infrastructure requirements. Smaller models are cheaper to host, faster to run inference on, and can potentially be deployed on edge devices or in resource-constrained environments.
**Task-Specific vs. General-Purpose Models**: The "specialist vs. generalist" framing provides a useful mental model for deciding when to use general-purpose LLMs versus fine-tuned models. For well-defined application domains, task-specific instruction tuning can yield substantial benefits.
**Dataset Quality and Design**: The careful attention to dataset construction — including natural language instruction templates, sub-intention categorization, and paraphrase augmentation — highlights the importance of high-quality training data for instruction tuning success.
**Multi-Dimensional Evaluation**: The combination of quantitative benchmarks, human evaluation, adjacent task testing, and composite task assessment provides a comprehensive evaluation framework that accounts for the subjective nature of text quality while still producing actionable metrics.
**Open Source Strategy**: By releasing the models and data publicly, Grammarly enables reproducibility and community contribution while positioning itself as a thought leader in the space. This is a strategic choice that balances competitive advantage with the benefits of open research.
## Limitations and Future Directions
The authors acknowledge several areas for future improvement:
- Expanding capability to handle longer texts, which is a common limitation of transformer-based models due to context length constraints.
- Better accounting for prompt sensitivity in training and testing, addressing the known issue that small variations in prompts can lead to different outputs.
- The closer margins in composite task evaluations suggest there's room for improvement in handling complex multi-step editing instructions.
## Critical Assessment
While the results are impressive, it's worth noting some caveats:
- The comparison with GPT-3-Edit uses OpenAI's API-based model, which may not represent the most current state-of-the-art from OpenAI.
- Human evaluation was conducted with "expert evaluators" whose specific qualifications and potential biases are not detailed.
- The production deployment specifics (latency, throughput, infrastructure) are not discussed, leaving questions about real-world operational characteristics.
- As Grammarly is presenting their own research, there may be inherent biases in how comparisons are framed and results are presented.
Despite these caveats, the work represents a valuable contribution to the LLMOps landscape by demonstrating that thoughtful specialization can achieve better results than brute-force scaling, with significant implications for cost, efficiency, and practical deployment of LLMs in production writing assistance applications.