Stack Overflow: Building AI Products at Stack Overflow: From Conversational Search to Technical Benchmarking

Company

Stack Overflow

Title

Building AI Products at Stack Overflow: From Conversational Search to Technical Benchmarking

Industry

Tech

Link

https://www.youtube.com/watch?v=6tn8qLDvLNg

Year

2025

Summary (short)

Stack Overflow faced a significant disruption when ChatGPT launched in late 2022, as developers began changing their workflows and asking AI tools questions that would traditionally be posted on Stack Overflow. In response, the company formed an "Overflow AI" team to explore how AI could enhance their products and create new revenue streams. The team pursued two main initiatives: first, developing a conversational search feature that evolved through multiple iterations from basic keyword search to semantic search with RAG, ultimately being rolled back due to insufficient accuracy (below 70%) for developer expectations; and second, creating a data licensing business that involved fine-tuning models with Stack Overflow's corpus and developing technical benchmarks to demonstrate improved model performance. The initiatives showcased rapid iteration, customer-focused evaluation methods, and ultimately led to a new revenue stream while strengthening Stack Overflow's position in the AI era.

## Overview Stack Overflow's journey into production AI systems began in late 2022, right as ChatGPT launched and fundamentally disrupted the developer Q&A landscape. Ellen Brandenburgger, who led product management at Stack Overflow's data licensing team, describes joining the company just two weeks before ChatGPT's release—a timing that meant she experienced the "before and after" of AI's impact on one of the most beloved developer platforms. Stack Overflow, known as the primary destination where developers seek answers to technical questions (with over 90% of developers using it weekly), suddenly faced a shift in user behavior as some questions began being directed to AI tools instead. Rather than viewing this as purely catastrophic, Stack Overflow's leadership responded by forming a dedicated team called "Overflow AI" tasked with understanding what was "just now possible" with this new technology. The team's mandate was to explore jobs-to-be-done within their products that AI might help unlock, while leveraging Stack Overflow's core strengths: a rich corpus of 14+ million technical questions and answers, strong community engagement, and a trusted brand among developers. ## First Initiative: Conversational Search ### Problem Context and Initial Approach The first major AI product effort focused on what the team called "conversational search" or a conversational AI feature. The impetus came from understanding user workflows: most developers actually discover Stack Overflow content through Google searches rather than directly navigating to the site. They land on a specific question page, but if that answer doesn't quite match their context, the experience breaks down. Users might try searching again with keywords, but Stack Overflow's search interface wasn't optimized for this, and alternative paths (clicking related questions, etc.) weren't solving the problem well. Additionally, developers highlighted a speed problem: asking a question on Stack Overflow meant waiting days for community responses, whereas ChatGPT provided immediate answers, even if they were sometimes less accurate. The team identified they needed to solve for both answer quality and response speed. The team also recognized contextual challenges inherent in technical Q&A. An answer that's correct for Python 7.0 might not be correct for Python 8.0. Stack Overflow's reputation system tended to favor older answers that had accumulated more votes over time, but sometimes more recent answers were more relevant. These context and recency issues presented opportunities where AI could help identify the right answer for the right situation. ### Iterative Development Process What's particularly instructive about Stack Overflow's approach is how they embraced rapid iteration and continuous learning. They organized approximately five different product-engineering-design pods, each focused on different outcome areas (user activation, retention, enterprise features). Teams met weekly to share experiments, creating a culture where failing fast was celebrated. Teams would record demos that got shared across Slack, generating organizational excitement even for experiments that didn't pan out. Ellen emphasizes that her team wasn't initially expert in any of these AI technologies. They held "lunch and learns" on Fridays where team members would share articles about new technologies and discuss potential applications. This reflects an important principle: teams don't need to master the technology before adopting it—they can learn as they go, taking "one bite of the apple at a time" as Ellen puts it. ### Technical Evolution Through Four Versions **Version 1: Chatbot on Keyword Search** The first implementation was remarkably simple: they put a conversational chat interface on top of Stack Overflow's existing lexical (keyword-based) search engine. The team consisted of a PM, designer, and a very engaged tech lead who was himself a developer and Stack Overflow user. They leveraged existing internal services and search endpoints to quickly prototype. Predictably, this version produced poor results—asking questions conversationally to a keyword search engine yielded answers no better than the regular search box. **Version 2: Semantic Search** The team's next insight came from working with Stack Overflow's platform teams: they needed to move from lexical to semantic search. Semantic search allows for more human-like, conversational queries rather than requiring precise keywords. This improved results, allowing users to pose actual questions instead of keyword strings. However, a critical limitation remained: the system could only pull from Stack Overflow's existing corpus, and whenever a question veered outside that technical knowledge base, the product still failed. **Version 3: Hybrid Approach with External LLM** The third iteration addressed the knowledge gap by combining semantic search over Stack Overflow's corpus with a fallback to external models (GPT-4 was mentioned as the "premier model" in early 2023). The user experience resembled early 2020s conversational chat interfaces like Intercom. When Stack Overflow's corpus didn't have a good answer, the system would kick the query out to GPT-4. **Version 4: Adding RAG for Attribution** The fourth version introduced Retrieval Augmented Generation (RAG) specifically to solve a trust problem. Stack Overflow's annual developer survey had identified what they called the "trust gap"—as AI usage increased, developer trust in AI actually decreased. The team found that attribution was critical: developers needed to know where answers came from, whether from community responses or from AI models. Since answers typically blended multiple sources (various Stack Overflow questions and AI-generated content), they needed proportional attribution rather than just holistic sourcing. Implementing RAG allowed the system to retrieve specific bits of content from the vectorized semantic search layer and then reaggregate it, maintaining links to sources throughout. This provided the transparency developers required to trust the answers. While the initial thinking was that RAG would primarily enable attribution, the team recognized it could also improve answer quality—a dual benefit Stack Overflow has continued to pursue. ### Evaluation Methodology Stack Overflow's evaluation approach for conversational search provides a model for teams building AI products. They started remarkably simply: Google spreadsheets with straightforward percentage calculations. They didn't have sophisticated tooling initially—they built up their evaluation practices iteratively, just like the product itself. The team launched a limited beta with approximately 1,000 users who were existing Stack Overflow community members. Importantly, they selected users across diverse technical domains—from data science and analytics to front-end engineering to DevOps—because Stack Overflow's corpus spans this breadth and they needed representative coverage. Their evaluation methodology included several components: **Subject Matter Expert Ratings**: They asked domain experts to pose questions and rate the answers as correct or incorrect. They logged questions and LLM responses in spreadsheets, tracked user ratings, and calculated simple percentages. Critically, they grouped results by Stack Overflow "tags" (essentially programming languages or areas of expertise like Python, DevOps, etc.) to understand where the system performed well versus poorly. **Qualitative Research**: A researcher on the team conducted interviews with users who gave high ratings and low ratings, comparing themes between the groups to understand what differentiated good answers from poor ones. **Log Analysis**: Beyond just evaluating correctness, they examined logs for trust and safety issues, given Stack Overflow's scale (millions of daily users) and need for strong moderation. They looked for patterns that breached acceptable answer guardrails. **Refined Metrics**: Through error analysis, the team realized that simple "correct/incorrect" ratings were insufficient. They broke evaluation into three dimensions: - **Accuracy**: Is the answer technically correct? - **Relevance**: Is it appropriate for this specific context (e.g., the right version of a programming language)? - **Completeness**: Does it provide all necessary information without extraneous details? This granular breakdown helped them better understand different types of failures and where improvements were needed. Ellen's background as a qualitative researcher proved invaluable here—she applied research methods for categorizing and finding patterns in unstructured data to the AI evaluation challenge. ### The Decision to Roll Back Despite all the iteration and learning, conversational search never achieved sufficient accuracy. Even after four versions and refined evaluation metrics, the system wasn't reaching 70% accuracy—well below the standards developers expect from technical answers. The team assessed the ROI of continuing to extend the feature and improve accuracy further, ultimately deciding there were "bigger, better opportunities to pursue." This decision to roll back a feature after substantial investment is noteworthy and praiseworthy. Many products ship AI features that don't truly work, degrading user experience. Stack Overflow demonstrated discipline in recognizing when something wasn't meeting their quality bar and pulling it back. Importantly, the tech lead on the project later reflected that going through this experience was necessary to reach their eventual success—the learning wasn't wasted. ## Second Initiative: Data Licensing and Technical Benchmarking ### Market Opportunity Recognition Moving forward 6-9 months to late 2023/early 2024, Stack Overflow identified a different AI opportunity driven by inbound demand. Their instrumentation showed attempted scraping of their site was "off the charts"—AI providers were trying to access Stack Overflow's Q&A data for their models. Moreover, some providers were approaching Stack Overflow directly asking to purchase or license access to the data. This created what Ellen describes as "a gift from heaven" for a product person: customers proactively asking to buy something before you've even built a formal product around it. ### Strategic Positioning Stack Overflow faced a strategic question: would licensing data to AI labs cannibalize their core business? The company's approach centered on creating "virtuous cycles of community engagement." The thesis was that foundation models didn't actually have the level of accuracy being advertised in the market, particularly for technical Q&A. Developer expectations remained higher than model reality. Stack Overflow's strategy involved not just licensing data, but also working with partners on product integrations that would drive traffic or engagement back to Stack Overflow, restarting the knowledge creation cycle. This could take enterprise-facing forms or developer tools integrations, but the principle remained: license data in ways that ultimately strengthen rather than weaken Stack Overflow's community ecosystem. ### Building Technical Benchmarks Ellen's team developed a novel approach: creating benchmarks that could demonstrate whether and how much Stack Overflow data improved model performance on technical questions. This served dual purposes—validating the value proposition for customers while also providing Stack Overflow with tools to evaluate different models. **Initial Proof of Concept**: Working with Prosus (Stack Overflow's private equity owner) and their AI lab, Ellen's team took an open-source model (Llama 3) and fine-tuned it with Stack Overflow data. Fine-tuning here means actual training that produces a model with different weights, not RAG or prompt engineering. They could then measure whether model performance improved across their three key dimensions (accuracy, relevance, completeness) when Stack Overflow knowledge was added. The answer was yes—all three metrics improved. **Comparative Evaluation**: The team then used this benchmark to evaluate various third-party models (those not including Stack Overflow data) to understand their relative performance on technical questions. This provided Stack Overflow with a tool for assessing the landscape and identifying which models might benefit most from their data. **Addressing Data Leakage Concerns**: A critical technical challenge was preventing data leakage—the situation where benchmark test data inadvertently appears in training data, artificially inflating performance metrics. This was especially sensitive because Stack Overflow was both creating the benchmark and had a business interest in models performing better with their data. Their approach to preventing leakage included: - Using questions and answers not publicly available to any model being tested, ensuring all models encountered the data for the first time during benchmarking - Leveraging content that had been removed from the Stack Overflow site or existed in "walled gardens" (like their enterprise product Stack Overflow for Teams) - Providing full transparency by sharing their dataset and research papers so data scientists could examine their methods and poke holes in the approach **Benchmark Design and Validation**: The benchmark operated on approximately 250 questions that were refreshed monthly to keep it "living and breathing." As new content emerged or ratings changed, they incorporated new material and adjusted weights to ensure the benchmark stayed current. This regular refresh also helped prevent overfitting as models might otherwise optimize specifically for a static benchmark. Quality assurance involved three layers: - A "golden dataset" from Stack Overflow's curated content - Subject matter experts (10-15 people) validating AI-generated answers across different domains - Automated checks as a third validation layer For determining which questions to include, Ellen used relatively straightforward quality indicators from Stack Overflow's platform: - User vote scores on answers and questions - Presence of an "accepted answer" (validated by a community expert) - Comment activity (used more technically than socially on Stack Overflow) - Page view counts - Growth rate of page views (as an indicator of emerging relevance) Questions needed to meet thresholds across multiple indicators to qualify for the benchmark. This approach succeeded because Stack Overflow's existing community signals—built over years—naturally identified high-quality, representative technical content. ### Business Model Validation The data licensing business validated Stack Overflow's strategic positioning in several ways. Multiple potential customers told the Stack Overflow team that if someone had set out 15 years ago to create the perfect dataset for training LLMs on code, it would look like Stack Overflow's corpus. The community norms, voting systems, expert validation, and breadth of technical coverage that made Stack Overflow valuable to developers also made it exceptionally valuable for training AI models. This created an interesting future possibility: as LLMs consume the entire internet and face data scarcity, Stack Overflow's model of human-driven knowledge creation with strong quality signals might represent a path forward for continued AI improvement. The company potentially sits at the intersection of where high-quality human knowledge contribution meets AI training needs. ## Key LLMOps Principles and Lessons ### Embracing Iterative Learning Both initiatives showcase the power of rapid iteration and learning in the face of uncertainty. Stack Overflow didn't try to design the perfect AI product upfront. They ran experiments, shared learnings weekly, celebrated both successes and failures publicly, and continuously evolved their approach based on what they discovered. The conversational search feature went through at least four distinct technical implementations, each addressing specific limitations discovered in the previous version. Ellen's advice to "time box your learning and experimentation" reflects this principle. Rather than trying to master all AI technologies before starting, teams should focus on taking manageable bites—what can we learn this week?—and build knowledge incrementally. ### Starting Simple and Adding Complexity Judiciously The conversational search evolution demonstrates starting with the simplest possible implementation (chat UI on keyword search) and only adding technical complexity when specific problems demanded it. Teams shouldn't jump immediately to sophisticated techniques like RAG, fine-tuning, or multi-agent systems. Start with what you can build today, identify where it breaks, and apply the right technology to address that specific failure mode. This approach also helps teams learn the technologies gradually rather than being overwhelmed. Stack Overflow's team learned about semantic search, RAG, and fine-tuning in sequence as each became necessary, not all at once. ### Evaluation Must Be Use-Case Specific While vendor-provided evaluation metrics can provide starting templates, truly understanding product quality requires custom evaluation tailored to your specific use case. Stack Overflow's realization that they needed to separate accuracy, relevance, and completeness—three dimensions that might initially seem similar—proved critical to understanding their product's performance. Furthermore, their evaluation combined multiple approaches: - Quantitative metrics (percentage correctness) broken down by domain - Qualitative research to understand why users rated answers as good or poor - Log analysis for trust and safety issues - Subject matter expert validation No single evaluation method suffices for production AI systems. Teams need multiple lenses to truly understand quality and trust. ### Human Evaluation Remains Essential Despite the appeal of fully automated evaluation, both Stack Overflow initiatives relied heavily on human judgment. Subject matter experts validated answers, researchers conducted qualitative interviews, and domain experts helped design benchmarks. Ellen explicitly noted she wasn't a software engineer and couldn't judge answer correctness herself—finding users who could was a necessity, not a luxury. Human evaluation serves multiple purposes: it provides ground truth for automated systems, it uncovers nuanced quality dimensions machines might miss, and it helps teams understand the "why" behind successes and failures, not just the "what." ### The Importance of Domain Expertise and Existing Assets Stack Overflow's AI initiatives succeeded in part because they leveraged existing assets: community-validated Q&A content, quality signals from voting and engagement, domain diversity, and an engaged user base willing to provide feedback. Teams should inventory their existing assets—whether data, domain expertise, user relationships, or platform capabilities—and consider how these can strengthen AI product development rather than starting from scratch. Ellen noted that Stack Overflow was essentially "created for LLMs" even though it preceded them by 15 years. The same factors that made it valuable to developers—validated knowledge, diverse expertise, strong quality signals—made it perfect for training AI systems. ### Managing Non-Determinism and Probabilistic Thinking Ellen's closing insight addresses a fundamental shift in product development: teams must now "think about building products and evaluating products in probabilities rather than certainties." Everything becomes non-deterministic—what the product will do, how users will interact with it, what outcomes it will drive. Getting comfortable with ranges of outcomes rather than single certainties represents perhaps the biggest mindset shift product managers and engineers must make when working with LLMs in production. This affects everything from how teams set success criteria to how they communicate expectations to stakeholders to how they design user experiences that account for variable quality. ### The Value of Disciplined Product Decisions Perhaps the most underappreciated lesson is Stack Overflow's willingness to roll back conversational search despite substantial investment. In an era where many companies ship AI features that don't truly work—degrading user experience in the name of appearing "AI-forward"—Stack Overflow demonstrated discipline in recognizing when something didn't meet their quality bar. This decision was enabled by clear metrics (sub-70% accuracy against developer expectations) and honest ROI assessment. The team could weigh the resources required for further improvement against other opportunities. Importantly, leadership recognized the learning value even in the "failed" product—it set foundations for subsequent success. ### Trust and Transparency in Developer Tools The "trust gap" Stack Overflow identified—rising AI usage coupled with declining trust—highlights why transparency matters particularly for technical audiences. Developers demand to understand where answers come from, how systems work, and what limitations exist. This drove the RAG implementation for attribution even before considering RAG for answer quality improvement. Product teams building for technical users should expect higher skepticism and greater demands for explainability than they might encounter in consumer contexts. This isn't a barrier but rather an opportunity: solving for transparency and trust can differentiate products in ways pure accuracy improvements cannot. Stack Overflow's journey from disruption to new revenue streams illustrates how companies can respond to AI-driven market changes not just defensively but by finding new value creation opportunities. The initiatives showcase mature LLMOps practices even as the team was learning: rapid iteration, customer-focused evaluation, human-in-the-loop validation, technical sophistication applied judiciously, and disciplined product decisions. Most importantly, they demonstrate that teams don't need to be experts before starting—they can learn as they go, taking one bite of the apple at a time, building both products and expertise iteratively.

Start deploying reproducible AI workflows today