ZenML

Building a Production RAG-Based Slackbot for Developer Support

Vespa 2024
View original source

Vespa developed an intelligent Slackbot to handle increasing support queries in their community Slack channel. The solution combines RAG (Retrieval-Augmented Generation) with Vespa's search capabilities and OpenAI, leveraging both past conversations and documentation. The bot features user consent management, feedback mechanisms, and automated user anonymization, while continuously learning from new interactions to improve response quality.

Industry

Tech

Technologies

Overview

This case study documents a summer internship project at Vespa.ai where two NTNU students built an intelligent Slackbot to handle the growing volume of community questions. The project represents a practical implementation of Retrieval-Augmented Generation (RAG) in a production setting, combining Vespa’s search infrastructure with OpenAI’s language models to create a self-improving community support tool.

The context for this project stems from Vespa.ai’s significant growth in late 2023, when Docker pulls of the vespaengine/vespa image surged from 2 million to 11 million in just a few months. This growth led to a corresponding increase in questions on their community Slack channel, creating an opportunity to leverage the accumulated knowledge in past conversations as a foundation for automated responses.

Technical Architecture

The solution architecture centers on a Slackbot built in Kotlin using the Slack SDK for Java, integrated with a Vespa application for document storage and retrieval, and OpenAI for generating natural language summaries from retrieved context.

Slackbot Implementation

The Slackbot operates on two primary event handlers. The AppMentionEvent handler triggers when users invoke the bot with @Vespa Bot <question>, initiating the RAG pipeline to generate answers. The MessageEvent handler captures all messages (with user consent) for indexing into the Vespa application, enabling the system to continuously expand its knowledge base.

The bot can run in two modes: Socket Mode for development (where the bot connects to Slack’s API) and HTTP Server mode for production deployment (requiring a public IP endpoint). The implementation includes slash commands for help functionality and reaction handling for gathering user feedback through 👍 and 👎 emoji responses.

Vespa Schema and Document Model

The Vespa application schema defines a slack_message document type with fields for message_id, thread_id, and text. The design captures the hierarchical nature of Slack conversations, where messages belong to threads, enabling grouped retrieval of contextually related messages.

A key feature is the synthesized text_embedding field, a 384-dimensional float tensor that Vespa automatically generates using an embedded model when documents are inserted or updated. This embedding enables semantic similarity search alongside traditional keyword-based retrieval. The schema uses angular distance metric for vector comparison, which is appropriate for normalized embeddings.

Hybrid Ranking Strategy

The ranking approach combines semantic search with classical information retrieval, implementing what the team calls a “hybrid2” rank profile. The first-phase ranking expression weights semantic similarity at 70% and scaled BM25 at 30%, reflecting a preference for conceptual similarity while still leveraging exact term matching.

The semantic component uses cosine distance between the query embedding and document embeddings. The BM25 component is normalized through an arctangent-based scaling function that maps the unbounded BM25 scores to a -1 to 1 range, making the two signals comparable. This normalization is crucial for combining different ranking signals effectively.

An additional ranking factor incorporates the ratio of positive to negative reactions (👍 vs 👎), providing a form of implicit human feedback that biases the system toward historically helpful responses. This represents a lightweight form of reinforcement learning from human feedback (RLHF) without requiring explicit training.

Rather than creating a standalone application, the Slackbot extends Vespa’s existing Documentation Search backend (search.vespa.ai), which already provides RAG capabilities over documentation, sample apps, and blog posts. This integration allows the bot to potentially draw on both conversational history and official documentation when answering questions.

The existing system breaks down all documentation into paragraphs, retrieves relevant ones based on queries, and sends them to OpenAI for summarization. The Slack message corpus adds another layer of practical, user-generated knowledge to this foundation.

LLMOps Considerations

The implementation includes several user-centric privacy controls required for production deployment. Users must explicitly consent to having their questions sent to OpenAI, acknowledging the third-party data sharing involved in LLM inference. Additionally, users can opt into having their messages indexed to improve the bot’s capabilities. User anonymization is implemented to protect privacy in the stored conversation data.

Continuous Learning Architecture

The system is designed to improve over time through continuous indexing of new messages. As users interact with the Slack channel, their messages (with consent) are fed into Vespa, expanding the knowledge base. This represents a form of continuous data collection that enhances retrieval quality without requiring model retraining.

The feedback mechanism through emoji reactions provides a signal for ranking adjustments, though the blog does not detail whether this feedback is used for more sophisticated model improvement beyond the ranking modifications.

Infrastructure and Deployment

The deployment utilizes Google Cloud Platform (GCP) with Terraform for infrastructure as code (IaC). The team used SpaceLift for managing Terraform state and simplifying the deployment process. This reflects standard DevOps practices for LLM application deployment, ensuring reproducible infrastructure and manageable state across environments.

The transition from Socket Mode (development) to HTTP Server mode (production) represents a common pattern in bot development, where local development uses simpler connection methods while production requires proper endpoint exposure and availability.

Limitations and Considerations

It’s worth noting that this case study is primarily a learning narrative from summer interns, so the depth of production hardening and long-term operational insights is limited. The blog focuses more on the development journey than on production metrics, monitoring strategies, or lessons learned from actual usage.

The hybrid ranking weights (70% semantic, 30% BM25) appear to be chosen somewhat arbitrarily, and the blog does not discuss how these were tuned or evaluated. Similarly, the 384-dimensional embedding model choice is not justified beyond following Vespa’s documentation patterns.

The consent and anonymization mechanisms are mentioned but not detailed, leaving questions about implementation specifics and compliance considerations for handling user data in a production LLM pipeline.

Technical Stack Summary

The complete technical stack includes:

This project demonstrates a practical approach to building RAG-powered applications with production considerations including user consent, feedback collection, and cloud deployment, representing valuable LLMOps patterns for community support automation.

More Like This

Building a Search Engine for AI Agents: Infrastructure, Product Development, and Production Deployment

Exa.ai 2025

Exa.ai has built the first search engine specifically designed for AI agents rather than human users, addressing the fundamental problem that existing search engines like Google are optimized for consumer clicks and keyword-based queries rather than semantic understanding and agent workflows. The company trained its own models, built its own index, and invested heavily in compute infrastructure (including purchasing their own GPU cluster) to enable meaning-based search that returns raw, primary data sources rather than listicles or summaries. Their solution includes both an API for developers building AI applications and an agentic search tool called Websites that can find and enrich complex, multi-criteria queries. The results include serving hundreds of millions of queries across use cases like sales intelligence, recruiting, market research, and research paper discovery, with 95% inbound growth and expanding from 7 to 28+ employees within a year.

question_answering data_analysis chatbot +44

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Context Engineering and Agent Development at Scale: Building Open Deep Research

LangChain 2025

Lance Martin from LangChain discusses the emerging discipline of "context engineering" through his experience building Open Deep Research, a deep research agent that evolved over a year to become the best-performing open-source solution on Deep Research Bench. The conversation explores how managing context in production agent systems—particularly across dozens to hundreds of tool calls—presents challenges distinct from simple prompt engineering, requiring techniques like context offloading, summarization, pruning, and multi-agent isolation. Martin's iterative development journey illustrates the "bitter lesson" for AI engineering: structured workflows that work well with current models can become bottlenecks as models improve, requiring engineers to continuously remove structure and embrace more general approaches to capture exponential model improvements.

code_generation summarization chatbot +39