Company
Atlassian
Title
ML-Based Comment Ranker for LLM Code Review Quality Improvement
Industry
Tech
Year
2025
Summary (short)
Atlassian developed a machine learning-based comment ranker to improve the quality of their LLM-powered code review agent by filtering out noisy, incorrect, or unhelpful comments. The system uses a fine-tuned ModernBERT model trained on proprietary data from over 53K code review comments to predict which LLM-generated comments will lead to actual code changes. The solution improved code resolution rates from ~33% to 40-45%, approaching human reviewer performance of 45%, while maintaining robustness across different underlying LLMs and user bases, ultimately reducing PR cycle times by 30% and serving over 10K monthly active users reviewing 43K+ pull requests.

## Overview

Atlassian's case study demonstrates a sophisticated approach to improving LLM-powered code review systems through the implementation of a machine learning-based comment ranker. The DevAI organization developed this solution as part of their broader Rovo Dev agents ecosystem, specifically targeting the quality challenges inherent in LLM-generated code review comments. The system has achieved significant scale, serving over 10,000 monthly active users and processing more than 43,000 pull requests monthly in its beta phase.

The core problem addressed by this LLMOps implementation centers on the inherent noisiness of LLM-generated content in production environments. Without proper filtering mechanisms, LLM-generated code review comments frequently exhibited issues such as being overly nit-picky, factually incorrect, or simply unhelpful to developers. This challenge is particularly acute in code review scenarios where the quality and relevance of feedback directly impact developer productivity and code quality outcomes.

## Technical Architecture and Implementation

The comment ranker operates as a post-processing layer in the LLM pipeline, taking LLM-generated comments as input and applying machine learning-based filtering to select only the most valuable comments for presentation to users. The system leverages a fine-tuned ModernBERT model, which represents a recent advancement in the BERT family of transformer models. The choice of ModernBERT reflects thoughtful consideration of the natural language processing requirements for understanding code review comments in their contextual setting.

The training approach demonstrates a sophisticated understanding of LLMOps principles by utilizing proprietary user interaction data as ground truth. Specifically, the system defines "code resolution" as a binary outcome indicating whether a pull request author made code changes in response to a comment, using this behavioral signal as the primary training target. This approach is particularly valuable because it captures actual user utility rather than relying on subjective quality assessments or synthetic labeling approaches.

The model fine-tuning process follows established best practices for transformer-based models, involving tokenization of comment text, addition of classification layers, and iterative training with backpropagation over multiple epochs. The implementation leverages GPU resources for efficient training, with fine-tuning cycles completing within hours rather than days or weeks. This efficiency is crucial for maintaining model freshness and responsiveness to changing patterns in code review quality.

## Production Deployment and Model Management

One of the most impressive aspects of this LLMOps implementation is its robust approach to model lifecycle management in production. The system addresses the critical challenge of model drift and degradation over time through systematic model refresh strategies. The team observed concrete examples of performance degradation when expanding from limited repository coverage to all Atlassian repositories, with code resolution rates dropping from approximately 40% to 33%. However, the retraining process successfully restored performance levels, demonstrating effective model management practices.

The production deployment strategy shows careful consideration of threshold optimization through A/B testing methodologies. Comments must exceed a propensity score threshold to be posted, with this threshold being empirically determined and continuously optimized based on online performance metrics. This approach balances precision and recall considerations while maintaining user experience quality.

The system demonstrates remarkable robustness across different underlying LLM models, successfully maintaining consistent performance when transitioning from GPT-4o to Claude Sonnet 3.5. This model-agnostic behavior suggests that the comment ranker has learned generalizable patterns about comment quality that transcend specific LLM architectures or training approaches. This characteristic is particularly valuable for LLMOps practitioners who need to maintain system performance while adapting to evolving foundation model capabilities.

## Evaluation and Metrics

The evaluation framework employed in this case study exemplifies best practices in LLMOps measurement and monitoring. The primary metric, code resolution rate (CRR), provides a direct connection between model performance and business value, measuring the percentage of comments that lead to actual code changes. This metric choice demonstrates understanding that traditional machine learning metrics like accuracy or F1 score may not capture the true utility of comments in a production code review environment.

The achievement of 40-45% code resolution rates, approaching the human benchmark of 45%, represents significant progress in LLM quality management. However, it's important to note that these benchmarks are internal to Atlassian and may not generalize across different development environments, coding standards, or organizational cultures. The comparison to human performance provides valuable context but should be interpreted within the specific operational context of Atlassian's development practices.

The system's performance across different user bases, with external beta customers showing even better metrics than internal users, suggests strong generalization capabilities. However, this could also reflect selection bias in the beta customer population or differences in usage patterns that favor the model's strengths.

## Data Strategy and Ground Truth Generation

The approach to ground truth generation represents a sophisticated understanding of user behavior analytics in LLMOps contexts. By leveraging actual user actions (code changes following comments) rather than explicit ratings or surveys, the system captures implicit feedback that may be more reliable and scalable than traditional evaluation approaches. The dataset of over 53,000 comments with associated outcomes provides substantial training signal, though the quality and representativeness of this data depends heavily on the diversity of Atlassian's internal development practices.

The expansion from approximately 10,000 to 53,000 training examples when rolling out to all internal repositories demonstrates the importance of data scale in maintaining model performance across diverse contexts. This scaling approach provides valuable insights into data requirements for similar LLMOps implementations, though organizations with different scales or development practices may require different data volumes.

## Integration with Broader LLM Infrastructure

The comment ranker's integration with the broader code review pipeline demonstrates thoughtful LLMOps architecture design. The system operates downstream from the primary LLM comment generation while upstream from user presentation, creating a clear separation of concerns that enables independent optimization of generation and filtering components. This architectural approach allows for experimentation with different LLM providers without requiring retraining of the ranking model, as evidenced by the smooth transitions between GPT-4o and Claude Sonnet 3.5.

The system's ability to complement rather than replace LLM capabilities represents a mature approach to LLMOps that recognizes the strengths and limitations of different model types. Rather than attempting to solve quality issues through prompt engineering or LLM fine-tuning alone, the hybrid approach leverages specialized models optimized for specific tasks within the overall pipeline.

## Challenges and Limitations

While the case study presents impressive results, several limitations and challenges warrant consideration. The reliance on code resolution as the primary quality signal may miss comments that provide valuable educational content or prevent future issues without requiring immediate code changes. Additionally, the binary nature of the resolution signal may not capture the nuanced value that different types of comments provide to development teams.

The model's current limitation to textual comment content, while showing strong performance, represents a missed opportunity to leverage richer contextual signals such as code complexity metrics, developer experience levels, or historical patterns in similar code contexts. The future work section acknowledges these limitations and outlines plans for feature engineering to incorporate additional signals.

The system's performance characteristics may be heavily influenced by Atlassian's specific development culture, coding standards, and review practices. Organizations with different approaches to code review, different programming languages, or different quality standards may experience different results when implementing similar approaches.

## Operational Considerations

The case study demonstrates strong attention to operational concerns that are critical for production LLMOps success. The need for regular model retraining to address data drift shows understanding of the dynamic nature of software development patterns and the importance of model freshness in maintaining performance. The timeline for model refresh and the infrastructure requirements for GPU-based training represent significant operational overhead that organizations must consider when implementing similar systems.

The cost and latency considerations mentioned as goals for the comment ranker highlight important tradeoffs in LLMOps implementations. While filtering can reduce downstream LLM calls and improve quality, it introduces additional model inference overhead and complexity. The net impact on system performance and cost requires careful monitoring and optimization.

## Future Directions and Scalability

The outlined future work around feature engineering and expanded data signals suggests a roadmap for continued improvement that aligns with LLMOps best practices. The planned incorporation of code diff analysis, file extensions, and associated ticket information could significantly enhance model performance by providing richer context for quality assessment.

The system's demonstrated scalability from internal dogfooding to over 400 external beta customers provides confidence in the approach's generalizability, though the long-term sustainability of manual model refresh processes may require additional automation as the system scales further. The integration with broader development workflows and the ability to handle diverse codebases suggests strong architectural foundations for continued growth.

This case study represents a comprehensive example of production LLMOps implementation that addresses real quality challenges through thoughtful application of machine learning, careful evaluation methodology, and robust operational practices. The combination of immediate business impact and technical sophistication makes it a valuable reference for organizations implementing similar LLM-powered development tools.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.