Vectorize: Building a Context-Aware AI Assistant with RAG for Developer Support

Company

Vectorize

Title

Building a Context-Aware AI Assistant with RAG for Developer Support

Industry

Tech

Link

https://vectorize.io/creating-a-context-sensitive-ai-assistant-lessons-from-building-a-rag-application/

Year

2024

Summary (short)

Vectorize, a platform for building RAG pipelines, faced a challenge where users frequently asked questions already answered in their documentation but were reluctant to leave the UI to search for answers. To address this, they built an AI assistant integrated directly into their product interface using RAG technology. The solution leverages their own platform to ingest documentation from multiple sources (docs site, Discord, Intercom), implements context-sensitive retrieval using page topics, employs reranking models to filter irrelevant results, and uses anti-hallucination prompting with Llama 3.1 70B on Groq. The resulting assistant provides users with immediate, contextually relevant answers without requiring them to leave their workflow, while the system continuously improves as new support content and documentation are added.

## Overview Vectorize, a company providing RAG pipeline infrastructure for AI applications, developed an in-product AI assistant to help users navigate their platform without leaving the user interface. This case study provides valuable insights into practical LLMOps challenges around building production RAG systems, particularly focusing on context-aware retrieval, relevance filtering, prompt engineering, and monitoring strategies. The company chose to "eat their own dogfood" by using their own platform to build the assistant, which serves as both a customer support tool and a validation of their product capabilities. The core problem Vectorize identified was straightforward but significant: despite having comprehensive documentation, users still asked questions about features that were already documented. This wasn't surprising given the human tendency to try solving problems before consulting documentation, but the friction of context-switching from the product UI to a separate documentation site created a barrier to information access. While they had an Intercom chat interface for direct support, this wasn't always immediately available and some users were hesitant to reach out. The solution needed to provide immediate, contextual help without requiring users to leave their workflow. ## Architecture and RAG Pipeline Design The technical implementation centers on a RAG pipeline that Vectorize built using their own platform. The architecture involves ingesting unstructured content from multiple sources, transforming it into embedding vectors, and storing these in a vector database for similarity search. When a user asks a question, the system retrieves relevant context and provides it to an LLM for response generation. A particularly noteworthy aspect of their implementation is the **multi-source data ingestion strategy**. Rather than limiting themselves to just their documentation website, they configured their pipeline to pull from three distinct sources: the documentation site (via web crawler connector), Discord (for community discussions and tagged answers), and Intercom (for support ticket responses). This multi-source approach creates what they describe as a "self-improving system" - as support staff answer questions and tag approved answers in Discord or Intercom, that knowledge automatically flows into the vector index. Similarly, when documentation is updated or incorrect information is removed, the vector embeddings are automatically updated in real-time. This continuous data refresh mechanism represents a sophisticated approach to keeping RAG systems current, addressing one of the common challenges in production LLM deployments where knowledge bases quickly become stale. ## Context-Sensitive Retrieval One of the most innovative aspects of this implementation is how Vectorize leveraged the fact that their AI assistant is embedded within the product UI itself. Rather than treating the assistant as a standalone chatbot, they recognized that the user's location within the application provides valuable contextual information about what they're trying to accomplish. This led to their development of a **context-aware retrieval mechanism**. Each instance of the AiAssistant React component is configured with two key properties: a `contextualQuestion` (a seed question likely to be relevant on that page) and a `topic` (describing what the user is doing on that page). For example, on the page for managing retrieval endpoint tokens, the component is initialized with the question "How to manage retrieval endpoint tokens?" and the topic "Retrieval endpoint token management." The seed question serves two purposes: it demonstrates the capability of the assistant and encourages users to try it by presenting a question they likely have. More critically, the `topic` parameter significantly improves retrieval quality. The author discovered that users don't always formulate good questions for semantic search. In their example, a user on the Retrieval Endpoint Token page might simply ask "When does it expire?" without mentioning what "it" refers to. Without additional context, this vague question would produce poor retrieval results. While traditional RAG applications might use query rewriting (sending the question plus chat history to an LLM to reformulate it), Vectorize took a simpler approach: they append the topic to the user's question when calling the retrieval endpoint. The code is straightforward: `const contextualizedQuestion = '(${topic}) ${question}';`. This ensures that similarity search considers both the semantic meaning of the question and the contextual topic, dramatically improving result relevance. The author notes this simple technique "really improved the quality of the responses" and prevented the assistant from giving unrelated answers. ## Relevance Filtering with Reranking Models A critical insight from this case study concerns the limitations of pure vector similarity search. When you request N results from a vector database, you get exactly N results - the most similar ones available, regardless of whether they're actually relevant. The author provides a concrete example: when a user asks about API keys while configuring Elasticsearch, the similarity search returns results about Elasticsearch but also about other integrations (S3, Couchbase) that use API keys. Looking at the similarity scores, they range from 0.71 to 0.74 - virtually indistinguishable. But these marginally similar but irrelevant results can confuse the LLM, sometimes causing it to generate responses about S3 or Couchbase API keys when the user only cared about Elasticsearch. The solution Vectorize implemented was **reranking with relevance scoring**. Their retrieval endpoint has built-in support for passing results through a reranking model when the `rerank` parameter is set to true. Reranking models are specialized systems trained to calculate how relevant a particular result is to a specific question, producing a relevance score. When they applied reranking to the Elasticsearch API key example, the results were striking: the Elasticsearch-related results had relevance scores between 0.91 and 0.97, while the S3 and Couchbase results scored below 0.01. This dramatic difference made it trivial to filter out irrelevant information. Through experimentation, the author determined that a relevance threshold of 0.5 worked well for their use case - any retrieved data with a relevance score below this threshold is excluded from the context sent to the LLM. This approach provides a much more deterministic way to ensure data quality compared to relying on similarity scores alone. The case study demonstrates that while vector similarity is useful for initial retrieval, reranking is essential for production RAG systems to maintain response quality. ## Prompt Engineering and Hallucination Prevention With high-quality retrieval established, the next challenge was response generation. Vectorize uses the Llama 3.1 70B model hosted on Groq for generation. The choice of a 70B parameter model (rather than a larger one) is notable - the author suggests that with proper relevance filtering, a smaller model performs well, offering better latency and lower costs compared to larger alternatives. The Groq service's performance characteristics (described as "lightning fast") further optimize the user experience. The prompting strategy demonstrates several production LLMOps best practices. First, they incorporate the same `topic` parameter used in retrieval into the system prompt: "Unless the question specifies otherwise, assume the question is related to this general topic: {topic}." This helps keep the LLM focused on the relevant domain. The responses are formatted in Markdown for better presentation in the UI. Most importantly, they implemented **anti-hallucination prompting** - arguably the most critical aspect of their prompt design. The instruction reads: "This is very important: if there is no relevant information in the texts or there are no available texts, respond with 'I'm sorry, I couldn't find an answer to your question.'" The author explicitly emphasizes that without this instruction (and the additional emphasis phrase "This is very important"), the LLM would fabricate answers to questions about undocumented features. They observed this behavior in their development environment for features still in progress that lacked documentation. An additional prompt addresses the multi-source nature of their data: "These knowledge bases texts come from various sources, including documentation, support tickets, and discussions on platforms such as Discord. Some of the texts may contain the names of the authors of the resource or a person asking a question. Never include these names in your answer." This prevents the model from inappropriately quoting users or including names that leaked into the source content (like "Thanks for your help, Chris"). These prompt engineering details reflect a mature understanding of how LLMs behave in production and the specific guardrails needed for customer-facing applications. ## Monitoring and Continuous Improvement The case study acknowledges the importance of monitoring but is candid about what they have and haven't implemented. They deployed standard **user feedback mechanisms** - thumbs-up and thumbs-down buttons after each response. These clicks are logged to their analytics system to track user satisfaction with the AI assistant's performance. This represents a basic but essential feedback loop for any production LLM application. More interestingly, the author discusses a monitoring feature they plan to add: **tracking the relevance of retrieval results over time**. While this capability is on the Vectorize product roadmap (and thus not yet implemented in their own assistant), the author emphasizes its importance. This metric would show how well the system retrieves relevant information based on actual user questions. Low relevance scores could indicate two issues: either the need for query rewriting to improve question formulation, or blind spots in the knowledge base itself. For example, if a popular new feature has thin documentation, questions about it would yield poor retrieval results, and declining relevance scores over time would alert the team to this documentation gap. This forward-looking discussion demonstrates awareness that production RAG systems require ongoing monitoring beyond just user satisfaction metrics - you need to understand the health of your retrieval pipeline itself. ## Critical Analysis and Tradeoffs While the case study presents an effective implementation, it's worth examining some of the claims and tradeoffs with a balanced perspective. The author works for Vectorize (as co-founder and CTO) and acknowledges this is an "eat-your-own-dogfood" scenario. The enthusiastic endorsement of using Vectorize for building RAG pipelines should be viewed in this context - while it likely was genuinely convenient for their use case, alternative approaches using frameworks like LangChain, LlamaIndex, or Haystack would also be viable. The context injection approach (appending the topic to questions) is clever and simple, but it's essentially a shortcut around more sophisticated query rewriting techniques. The author acknowledges they "may do this in the future," suggesting the current approach has limitations. For more complex scenarios or longer conversations, proper query rewriting with an LLM might become necessary. The current approach works because they explicitly designed the assistant for single-shot question answering rather than extended dialogues. The reranking threshold of 0.5 was determined through experimentation, which is appropriate, but the case study doesn't discuss how this threshold might need adjustment over time or across different types of questions. Production systems typically need ongoing calibration of such parameters. Additionally, while they mention using relevance filtering made it possible to use a smaller (70B) model effectively, there's no quantitative comparison of performance metrics (accuracy, latency, cost) between different model sizes or with/without relevance filtering. The monitoring discussion reveals a gap in their current implementation - they track user satisfaction but not retrieval quality. While they acknowledge this and plan to address it, this represents a common challenge in RAG deployments: comprehensive observability requires instrumentation at multiple levels (retrieval, generation, user satisfaction), and building this infrastructure takes significant effort. ## Practical Lessons for LLMOps The case study concludes with six concrete lessons learned that provide practical guidance for others building production RAG systems: - **Multiple real-time sources**: Don't limit RAG pipelines to static documentation; pull from support systems, community forums, and other knowledge sources, with automatic updates to create self-improving systems - **Context improves quality**: Leverage any available contextual information (like UI location) to help LLMs stay focused and generate better responses - **Context in retrieval**: Include contextual clues in similarity search queries, whether through query rewriting or direct concatenation - **Reranking for filtering**: Use reranking models to score relevance and filter out low-scoring results rather than relying solely on similarity scores - **Prompt engineering matters**: Clear instructions, especially anti-hallucination prompts, enable smaller models to perform well - **Monitor results**: Track both user feedback and retrieval quality to identify system issues and knowledge gaps These lessons reflect hard-won experience deploying LLMs in production and avoiding common pitfalls. The emphasis on reranking is particularly valuable, as many RAG tutorials and frameworks treat it as optional, while this case study demonstrates it's essential for production quality. Similarly, the specific anti-hallucination prompting techniques provide concrete guidance beyond generic "use good prompts" advice. Overall, this case study provides a realistic and technically detailed view of building a production RAG application, with honest discussion of both successes and areas still needing improvement. The integration of the assistant directly into the product UI, rather than as a standalone chatbot, represents thoughtful UX design informed by understanding user behavior and context-switching costs.

Start deploying reproducible AI workflows today