ZenML

Building AI Products at Stack Overflow: From Conversational Search to Technical Benchmarking

Stack Overflow 2025
View original source

Stack Overflow faced a significant disruption when ChatGPT launched in late 2022, as developers began changing their workflows and asking AI tools questions that would traditionally be posted on Stack Overflow. In response, the company formed an "Overflow AI" team to explore how AI could enhance their products and create new revenue streams. The team pursued two main initiatives: first, developing a conversational search feature that evolved through multiple iterations from basic keyword search to semantic search with RAG, ultimately being rolled back due to insufficient accuracy (below 70%) for developer expectations; and second, creating a data licensing business that involved fine-tuning models with Stack Overflow's corpus and developing technical benchmarks to demonstrate improved model performance. The initiatives showcased rapid iteration, customer-focused evaluation methods, and ultimately led to a new revenue stream while strengthening Stack Overflow's position in the AI era.

Industry

Tech

Technologies

Overview

Stack Overflow’s journey into production AI systems began in late 2022, right as ChatGPT launched and fundamentally disrupted the developer Q&A landscape. Ellen Brandenburgger, who led product management at Stack Overflow’s data licensing team, describes joining the company just two weeks before ChatGPT’s release—a timing that meant she experienced the “before and after” of AI’s impact on one of the most beloved developer platforms. Stack Overflow, known as the primary destination where developers seek answers to technical questions (with over 90% of developers using it weekly), suddenly faced a shift in user behavior as some questions began being directed to AI tools instead.

Rather than viewing this as purely catastrophic, Stack Overflow’s leadership responded by forming a dedicated team called “Overflow AI” tasked with understanding what was “just now possible” with this new technology. The team’s mandate was to explore jobs-to-be-done within their products that AI might help unlock, while leveraging Stack Overflow’s core strengths: a rich corpus of 14+ million technical questions and answers, strong community engagement, and a trusted brand among developers.

Problem Context and Initial Approach

The first major AI product effort focused on what the team called “conversational search” or a conversational AI feature. The impetus came from understanding user workflows: most developers actually discover Stack Overflow content through Google searches rather than directly navigating to the site. They land on a specific question page, but if that answer doesn’t quite match their context, the experience breaks down. Users might try searching again with keywords, but Stack Overflow’s search interface wasn’t optimized for this, and alternative paths (clicking related questions, etc.) weren’t solving the problem well.

Additionally, developers highlighted a speed problem: asking a question on Stack Overflow meant waiting days for community responses, whereas ChatGPT provided immediate answers, even if they were sometimes less accurate. The team identified they needed to solve for both answer quality and response speed.

The team also recognized contextual challenges inherent in technical Q&A. An answer that’s correct for Python 7.0 might not be correct for Python 8.0. Stack Overflow’s reputation system tended to favor older answers that had accumulated more votes over time, but sometimes more recent answers were more relevant. These context and recency issues presented opportunities where AI could help identify the right answer for the right situation.

Iterative Development Process

What’s particularly instructive about Stack Overflow’s approach is how they embraced rapid iteration and continuous learning. They organized approximately five different product-engineering-design pods, each focused on different outcome areas (user activation, retention, enterprise features). Teams met weekly to share experiments, creating a culture where failing fast was celebrated. Teams would record demos that got shared across Slack, generating organizational excitement even for experiments that didn’t pan out.

Ellen emphasizes that her team wasn’t initially expert in any of these AI technologies. They held “lunch and learns” on Fridays where team members would share articles about new technologies and discuss potential applications. This reflects an important principle: teams don’t need to master the technology before adopting it—they can learn as they go, taking “one bite of the apple at a time” as Ellen puts it.

Technical Evolution Through Four Versions

Version 1: Chatbot on Keyword Search
The first implementation was remarkably simple: they put a conversational chat interface on top of Stack Overflow’s existing lexical (keyword-based) search engine. The team consisted of a PM, designer, and a very engaged tech lead who was himself a developer and Stack Overflow user. They leveraged existing internal services and search endpoints to quickly prototype. Predictably, this version produced poor results—asking questions conversationally to a keyword search engine yielded answers no better than the regular search box.

Version 2: Semantic Search
The team’s next insight came from working with Stack Overflow’s platform teams: they needed to move from lexical to semantic search. Semantic search allows for more human-like, conversational queries rather than requiring precise keywords. This improved results, allowing users to pose actual questions instead of keyword strings. However, a critical limitation remained: the system could only pull from Stack Overflow’s existing corpus, and whenever a question veered outside that technical knowledge base, the product still failed.

Version 3: Hybrid Approach with External LLM
The third iteration addressed the knowledge gap by combining semantic search over Stack Overflow’s corpus with a fallback to external models (GPT-4 was mentioned as the “premier model” in early 2023). The user experience resembled early 2020s conversational chat interfaces like Intercom. When Stack Overflow’s corpus didn’t have a good answer, the system would kick the query out to GPT-4.

Version 4: Adding RAG for Attribution
The fourth version introduced Retrieval Augmented Generation (RAG) specifically to solve a trust problem. Stack Overflow’s annual developer survey had identified what they called the “trust gap”—as AI usage increased, developer trust in AI actually decreased. The team found that attribution was critical: developers needed to know where answers came from, whether from community responses or from AI models. Since answers typically blended multiple sources (various Stack Overflow questions and AI-generated content), they needed proportional attribution rather than just holistic sourcing.

Implementing RAG allowed the system to retrieve specific bits of content from the vectorized semantic search layer and then reaggregate it, maintaining links to sources throughout. This provided the transparency developers required to trust the answers. While the initial thinking was that RAG would primarily enable attribution, the team recognized it could also improve answer quality—a dual benefit Stack Overflow has continued to pursue.

Evaluation Methodology

Stack Overflow’s evaluation approach for conversational search provides a model for teams building AI products. They started remarkably simply: Google spreadsheets with straightforward percentage calculations. They didn’t have sophisticated tooling initially—they built up their evaluation practices iteratively, just like the product itself.

The team launched a limited beta with approximately 1,000 users who were existing Stack Overflow community members. Importantly, they selected users across diverse technical domains—from data science and analytics to front-end engineering to DevOps—because Stack Overflow’s corpus spans this breadth and they needed representative coverage.

Their evaluation methodology included several components:

Subject Matter Expert Ratings: They asked domain experts to pose questions and rate the answers as correct or incorrect. They logged questions and LLM responses in spreadsheets, tracked user ratings, and calculated simple percentages. Critically, they grouped results by Stack Overflow “tags” (essentially programming languages or areas of expertise like Python, DevOps, etc.) to understand where the system performed well versus poorly.

Qualitative Research: A researcher on the team conducted interviews with users who gave high ratings and low ratings, comparing themes between the groups to understand what differentiated good answers from poor ones.

Log Analysis: Beyond just evaluating correctness, they examined logs for trust and safety issues, given Stack Overflow’s scale (millions of daily users) and need for strong moderation. They looked for patterns that breached acceptable answer guardrails.

Refined Metrics: Through error analysis, the team realized that simple “correct/incorrect” ratings were insufficient. They broke evaluation into three dimensions:

This granular breakdown helped them better understand different types of failures and where improvements were needed.

Ellen’s background as a qualitative researcher proved invaluable here—she applied research methods for categorizing and finding patterns in unstructured data to the AI evaluation challenge.

The Decision to Roll Back

Despite all the iteration and learning, conversational search never achieved sufficient accuracy. Even after four versions and refined evaluation metrics, the system wasn’t reaching 70% accuracy—well below the standards developers expect from technical answers. The team assessed the ROI of continuing to extend the feature and improve accuracy further, ultimately deciding there were “bigger, better opportunities to pursue.”

This decision to roll back a feature after substantial investment is noteworthy and praiseworthy. Many products ship AI features that don’t truly work, degrading user experience. Stack Overflow demonstrated discipline in recognizing when something wasn’t meeting their quality bar and pulling it back. Importantly, the tech lead on the project later reflected that going through this experience was necessary to reach their eventual success—the learning wasn’t wasted.

Second Initiative: Data Licensing and Technical Benchmarking

Market Opportunity Recognition

Moving forward 6-9 months to late 2023/early 2024, Stack Overflow identified a different AI opportunity driven by inbound demand. Their instrumentation showed attempted scraping of their site was “off the charts”—AI providers were trying to access Stack Overflow’s Q&A data for their models. Moreover, some providers were approaching Stack Overflow directly asking to purchase or license access to the data.

This created what Ellen describes as “a gift from heaven” for a product person: customers proactively asking to buy something before you’ve even built a formal product around it.

Strategic Positioning

Stack Overflow faced a strategic question: would licensing data to AI labs cannibalize their core business? The company’s approach centered on creating “virtuous cycles of community engagement.” The thesis was that foundation models didn’t actually have the level of accuracy being advertised in the market, particularly for technical Q&A. Developer expectations remained higher than model reality.

Stack Overflow’s strategy involved not just licensing data, but also working with partners on product integrations that would drive traffic or engagement back to Stack Overflow, restarting the knowledge creation cycle. This could take enterprise-facing forms or developer tools integrations, but the principle remained: license data in ways that ultimately strengthen rather than weaken Stack Overflow’s community ecosystem.

Building Technical Benchmarks

Ellen’s team developed a novel approach: creating benchmarks that could demonstrate whether and how much Stack Overflow data improved model performance on technical questions. This served dual purposes—validating the value proposition for customers while also providing Stack Overflow with tools to evaluate different models.

Initial Proof of Concept: Working with Prosus (Stack Overflow’s private equity owner) and their AI lab, Ellen’s team took an open-source model (Llama 3) and fine-tuned it with Stack Overflow data. Fine-tuning here means actual training that produces a model with different weights, not RAG or prompt engineering. They could then measure whether model performance improved across their three key dimensions (accuracy, relevance, completeness) when Stack Overflow knowledge was added. The answer was yes—all three metrics improved.

Comparative Evaluation: The team then used this benchmark to evaluate various third-party models (those not including Stack Overflow data) to understand their relative performance on technical questions. This provided Stack Overflow with a tool for assessing the landscape and identifying which models might benefit most from their data.

Addressing Data Leakage Concerns: A critical technical challenge was preventing data leakage—the situation where benchmark test data inadvertently appears in training data, artificially inflating performance metrics. This was especially sensitive because Stack Overflow was both creating the benchmark and had a business interest in models performing better with their data.

Their approach to preventing leakage included:

Benchmark Design and Validation: The benchmark operated on approximately 250 questions that were refreshed monthly to keep it “living and breathing.” As new content emerged or ratings changed, they incorporated new material and adjusted weights to ensure the benchmark stayed current. This regular refresh also helped prevent overfitting as models might otherwise optimize specifically for a static benchmark.

Quality assurance involved three layers:

For determining which questions to include, Ellen used relatively straightforward quality indicators from Stack Overflow’s platform:

Questions needed to meet thresholds across multiple indicators to qualify for the benchmark. This approach succeeded because Stack Overflow’s existing community signals—built over years—naturally identified high-quality, representative technical content.

Business Model Validation

The data licensing business validated Stack Overflow’s strategic positioning in several ways. Multiple potential customers told the Stack Overflow team that if someone had set out 15 years ago to create the perfect dataset for training LLMs on code, it would look like Stack Overflow’s corpus. The community norms, voting systems, expert validation, and breadth of technical coverage that made Stack Overflow valuable to developers also made it exceptionally valuable for training AI models.

This created an interesting future possibility: as LLMs consume the entire internet and face data scarcity, Stack Overflow’s model of human-driven knowledge creation with strong quality signals might represent a path forward for continued AI improvement. The company potentially sits at the intersection of where high-quality human knowledge contribution meets AI training needs.

Key LLMOps Principles and Lessons

Embracing Iterative Learning

Both initiatives showcase the power of rapid iteration and learning in the face of uncertainty. Stack Overflow didn’t try to design the perfect AI product upfront. They ran experiments, shared learnings weekly, celebrated both successes and failures publicly, and continuously evolved their approach based on what they discovered. The conversational search feature went through at least four distinct technical implementations, each addressing specific limitations discovered in the previous version.

Ellen’s advice to “time box your learning and experimentation” reflects this principle. Rather than trying to master all AI technologies before starting, teams should focus on taking manageable bites—what can we learn this week?—and build knowledge incrementally.

Starting Simple and Adding Complexity Judiciously

The conversational search evolution demonstrates starting with the simplest possible implementation (chat UI on keyword search) and only adding technical complexity when specific problems demanded it. Teams shouldn’t jump immediately to sophisticated techniques like RAG, fine-tuning, or multi-agent systems. Start with what you can build today, identify where it breaks, and apply the right technology to address that specific failure mode.

This approach also helps teams learn the technologies gradually rather than being overwhelmed. Stack Overflow’s team learned about semantic search, RAG, and fine-tuning in sequence as each became necessary, not all at once.

Evaluation Must Be Use-Case Specific

While vendor-provided evaluation metrics can provide starting templates, truly understanding product quality requires custom evaluation tailored to your specific use case. Stack Overflow’s realization that they needed to separate accuracy, relevance, and completeness—three dimensions that might initially seem similar—proved critical to understanding their product’s performance.

Furthermore, their evaluation combined multiple approaches:

No single evaluation method suffices for production AI systems. Teams need multiple lenses to truly understand quality and trust.

Human Evaluation Remains Essential

Despite the appeal of fully automated evaluation, both Stack Overflow initiatives relied heavily on human judgment. Subject matter experts validated answers, researchers conducted qualitative interviews, and domain experts helped design benchmarks. Ellen explicitly noted she wasn’t a software engineer and couldn’t judge answer correctness herself—finding users who could was a necessity, not a luxury.

Human evaluation serves multiple purposes: it provides ground truth for automated systems, it uncovers nuanced quality dimensions machines might miss, and it helps teams understand the “why” behind successes and failures, not just the “what.”

The Importance of Domain Expertise and Existing Assets

Stack Overflow’s AI initiatives succeeded in part because they leveraged existing assets: community-validated Q&A content, quality signals from voting and engagement, domain diversity, and an engaged user base willing to provide feedback. Teams should inventory their existing assets—whether data, domain expertise, user relationships, or platform capabilities—and consider how these can strengthen AI product development rather than starting from scratch.

Ellen noted that Stack Overflow was essentially “created for LLMs” even though it preceded them by 15 years. The same factors that made it valuable to developers—validated knowledge, diverse expertise, strong quality signals—made it perfect for training AI systems.

Managing Non-Determinism and Probabilistic Thinking

Ellen’s closing insight addresses a fundamental shift in product development: teams must now “think about building products and evaluating products in probabilities rather than certainties.” Everything becomes non-deterministic—what the product will do, how users will interact with it, what outcomes it will drive.

Getting comfortable with ranges of outcomes rather than single certainties represents perhaps the biggest mindset shift product managers and engineers must make when working with LLMs in production. This affects everything from how teams set success criteria to how they communicate expectations to stakeholders to how they design user experiences that account for variable quality.

The Value of Disciplined Product Decisions

Perhaps the most underappreciated lesson is Stack Overflow’s willingness to roll back conversational search despite substantial investment. In an era where many companies ship AI features that don’t truly work—degrading user experience in the name of appearing “AI-forward”—Stack Overflow demonstrated discipline in recognizing when something didn’t meet their quality bar.

This decision was enabled by clear metrics (sub-70% accuracy against developer expectations) and honest ROI assessment. The team could weigh the resources required for further improvement against other opportunities. Importantly, leadership recognized the learning value even in the “failed” product—it set foundations for subsequent success.

Trust and Transparency in Developer Tools

The “trust gap” Stack Overflow identified—rising AI usage coupled with declining trust—highlights why transparency matters particularly for technical audiences. Developers demand to understand where answers come from, how systems work, and what limitations exist. This drove the RAG implementation for attribution even before considering RAG for answer quality improvement.

Product teams building for technical users should expect higher skepticism and greater demands for explainability than they might encounter in consumer contexts. This isn’t a barrier but rather an opportunity: solving for transparency and trust can differentiate products in ways pure accuracy improvements cannot.

Stack Overflow’s journey from disruption to new revenue streams illustrates how companies can respond to AI-driven market changes not just defensively but by finding new value creation opportunities. The initiatives showcase mature LLMOps practices even as the team was learning: rapid iteration, customer-focused evaluation, human-in-the-loop validation, technical sophistication applied judiciously, and disciplined product decisions. Most importantly, they demonstrate that teams don’t need to be experts before starting—they can learn as they go, taking one bite of the apple at a time, building both products and expertise iteratively.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Scaling AI Product Development with Rigorous Evaluation and Observability

Notion 2025

Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.

document_processing content_moderation question_answering +52

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57