Stack Overflow: Hybrid ML and LLM Approach for Automated Question Quality Feedback

LLMOps Database

Tech

Stack Overflow

Company

Stack Overflow

Title

Hybrid ML and LLM Approach for Automated Question Quality Feedback

Industry

Tech

Link

https://stackoverflow.blog/2025/03/12/a-look-under-the-hood-how-and-why-we-built-question-assistant/

Year

2025

Summary (short)

Stack Overflow developed Question Assistant to provide automated feedback on question quality for new askers, addressing the repetitive nature of human reviewer comments in their Staging Ground platform. Initial attempts to use LLMs alone to rate question quality failed due to unreliable predictions and generic feedback. The team pivoted to a hybrid approach combining traditional logistic regression models trained on historical reviewer comments to flag quality indicators, paired with Google's Gemini LLM to generate contextual, actionable feedback. While the solution didn't significantly improve approval rates or review times, it achieved a meaningful 12% increase in question success rates (questions that remain open and receive answers or positive scores) across two A/B tests, leading to full deployment in March 2025.

Tags

## Overview Stack Overflow developed Question Assistant as an automated feedback system to help question askers improve their posts before public submission. The use case emerged from their Staging Ground platform, where human reviewers repeatedly provided the same feedback to new users about question quality issues. This case study is particularly noteworthy because it demonstrates the limitations of pure LLM approaches and illustrates how combining traditional machine learning with generative AI can produce more reliable production systems. The problem Stack Overflow faced was twofold: human reviewers were spending time providing repetitive feedback on common question quality issues, and new askers weren't receiving timely guidance. While Staging Ground had already improved question quality overall, the manual review process was slow and reviewers found themselves repeatedly suggesting similar improvements around context, formatting, error details, and reproducibility. ## Initial LLM Approach and Its Failures Stack Overflow's first instinct, leveraging their partnership with Google, was to use Gemini to directly evaluate question quality across three categories: context and background, expected outcome, and formatting and readability. These categories were defined in prompts and the team attempted to have the LLM provide quality ratings for questions in each category. This pure LLM approach revealed several critical production challenges that are important for understanding LLMOps limitations. The LLM could not reliably predict quality ratings that correlated with the feedback it provided. The feedback itself was repetitive and didn't correspond properly with the intended categories—for instance, all three categories would regularly include feedback about library or programming language versions regardless of relevance. More problematically, the quality ratings and feedback wouldn't appropriately change when question drafts were updated, which would have made the system useless for iterative improvement. The team recognized a fundamental issue: for an LLM to reliably rate question quality, they needed to define through data what a quality question actually is. The subjective nature of "quality" meant the LLM lacked the grounding necessary for consistent predictions. This led them to attempt creating a ground truth dataset through a survey of 1,000 question reviewers, asking them to rate questions on a 1-5 scale across the three categories. However, with only 152 complete responses and a low Krippendorff's alpha score, the labeled data proved unreliable for training and evaluation purposes. The inter-rater disagreement suggested that even human reviewers couldn't consistently agree on numerical quality ratings. This failed approach yielded an important insight: numerical ratings don't provide actionable feedback. A score of "3" in a category doesn't tell the asker what, how, or where to improve. This realization led to the architectural pivot that defines this case study. ## The Hybrid Architecture: Traditional ML + LLM Rather than using an LLM alone, Stack Overflow designed a hybrid system where traditional machine learning models perform classification and the LLM generates contextual feedback. The architecture works as follows: Individual logistic regression models were built for specific, actionable feedback indicators. Instead of predicting a subjective quality score, each binary classifier determines whether a question needs feedback for a specific issue. The team started with the "context and background" category, breaking it into four concrete indicators: problem definition (lacking information about goals), attempt details (missing information on what was tried), error details (missing error messages or debugging logs), and missing minimal reproducible example (MRE). These indicators were derived from clustering reviewer comments on historical Staging Ground posts to identify common themes. Conveniently, these themes aligned with existing comment templates and question close reasons, providing a natural source of training data from past human decisions. The reviewer comments and close comments were vectorized using TF-IDF (term frequency inverse document frequency) before being fed to logistic regression models for classification. The LLM component enters the workflow after classification. When an indicator model flags a question, the system sends preloaded response text along with the question content to Gemini, accompanied by system prompts. Gemini synthesizes these inputs to produce feedback that addresses the specific indicator but is tailored to the particular question, avoiding the generic responses that plagued the pure LLM approach. ## Production Infrastructure The production infrastructure reveals important LLMOps patterns for hybrid systems. Models were trained and stored within Azure Databricks, leveraging their ecosystem for ML model management. For serving, a dedicated service running on Azure Kubernetes downloads models from Databricks Unity Catalog and hosts them to generate predictions when feedback is requested. This separation of training infrastructure (Databricks) from serving infrastructure (Kubernetes) is a common pattern for production ML systems. The team implemented comprehensive observability and evaluation pipelines. Events were collected through Azure Event Hub, and predictions and results were logged to Datadog to understand whether generated feedback was helpful and to support future model iterations. This instrumentation is critical for understanding production LLM behavior and performance over time. ## Experimentation and Evaluation The deployment followed a rigorous two-stage experimental approach with clearly defined success metrics. The first experiment targeted Staging Ground, focusing on new askers who likely needed the most help. It was structured as an A/B test with eligible askers split 50/50 between control (no Gemini assistance) and variant (Gemini assistance) groups. The original goal metrics were increasing question approval rates to the main site and reducing review time. Interestingly, the results were inconclusive for the original metrics—neither approval rates nor average review times significantly improved. This represents a common scenario in production AI systems where the solution doesn't achieve the initially hypothesized impact. However, rather than considering this a failure, the team examined alternative success metrics and discovered a meaningful finding: questions that received Question Assistant feedback showed increased "success rates," defined as questions that remain open on the site and either receive an answer or achieve a post score of at least +2. This suggests the system improved the actual quality of questions, even if it didn't speed up the review process. The second experiment expanded to all eligible askers on the main Ask Question page with Ask Wizard, validating findings beyond just new users. This experiment confirmed the results and demonstrated that Question Assistant could help more experienced askers as well. The consistency of findings—a steady +12% improvement in success rates across both experiments—provided confidence for full deployment. The team made Question Assistant available to all askers on Stack Overflow on March 6, 2025, representing the transition from experimentation to full production deployment. ## LLMOps Insights and Tradeoffs This case study offers several important lessons for LLMOps practitioners. The most significant is the recognition that pure LLM approaches may fail for tasks requiring consistent, reliable classification, especially when the ground truth is inherently subjective or undefined. Stack Overflow's willingness to pivot from a pure LLM approach to a hybrid architecture demonstrates mature engineering judgment—they used the right tool for each part of the problem rather than forcing an LLM to handle everything. The hybrid architecture provides important tradeoffs. Traditional ML models (logistic regression) offer reliability, interpretability, and consistency for classification tasks where sufficient training data exists from past human decisions. The LLM component provides flexibility and natural language generation capabilities to make feedback specific and contextual rather than templated. This division of labor plays to each technology's strengths while mitigating weaknesses. The case study also highlights the importance of proper evaluation methodology in production LLM systems. The team's discovery that their original success metrics weren't improving, but alternative metrics showed meaningful impact, demonstrates the value of comprehensive instrumentation and willingness to examine results from multiple angles. In many organizations, the project might have been canceled when approval rates and review times didn't improve, but Stack Overflow's data-driven approach revealed the actual value being delivered. The use of existing data sources—historical reviewer comments, close reasons, and comment templates—as training data for the indicator models is an excellent example of leveraging domain-specific knowledge and past human judgments. This approach is likely more reliable than attempting to create new labeled datasets through surveys, as their failed ground truth experiment demonstrated. The production infrastructure choices reflect pragmatic LLMOps patterns: using managed services (Azure Databricks) for training, containerized deployment (Kubernetes) for serving, centralized model storage (Unity Catalog), and comprehensive observability (Event Hub, Datadog). These choices balance operational complexity with scalability and maintainability requirements. ## Limitations and Context While the case study reports positive results, it's worth noting that the claims should be evaluated critically. The 12% improvement in success rates is meaningful but relatively modest, and the system didn't achieve its original goals of faster reviews or higher approval rates. The feedback quality relies on Gemini's capabilities, which aren't detailed extensively—we don't know about prompt engineering specifics, token costs, latency, or failure modes in production. The case study doesn't discuss important operational considerations like monitoring for drift in the logistic regression models, how they handle Gemini API failures or rate limits, costs associated with running predictions at scale, or how they prevent the system from providing harmful or incorrect feedback. These are critical concerns for any production LLM system. The generalizability of this approach is also worth considering. Stack Overflow has unique advantages: decades of historical data on question quality, clear community guidelines for what constitutes a good question, and a large corpus of human reviewer feedback to train on. Organizations without similar resources might struggle to replicate this approach. ## Future Directions The team indicates that Community Product teams are exploring ways to iterate on the indicator models and further optimize the question-asking experience. This suggests ongoing investment in the hybrid approach rather than returning to pure LLM solutions. Potential improvements might include adding more indicator categories beyond "context and background," refining the models as more feedback data accumulates, or personalizing feedback based on asker experience level. Overall, this case study represents a mature approach to production LLM deployment that recognizes both the capabilities and limitations of generative AI, combines it appropriately with traditional techniques, and uses rigorous experimentation to validate impact. The willingness to pivot when initial approaches failed and to recognize value in unexpected metrics demonstrates the kind of pragmatic engineering judgment necessary for successful LLMOps.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source