Elastic: Building an Enterprise RAG-based AI Assistant with Vector Search and LLM Integration

LLMOps Database

Tech

Elastic

Company

Elastic

Title

Building an Enterprise RAG-based AI Assistant with Vector Search and LLM Integration

Industry

Tech

Link

https://www.elastic.co/blog/generative-ai-elasticgpt

Year

2025

Summary (short)

Elastic developed ElasticGPT, an internal generative AI assistant built on their own technology stack to provide secure, context-aware knowledge discovery for their employees. The system combines RAG (Retrieval Augmented Generation) capabilities through their SmartSource framework with private access to OpenAI's GPT models, all built on Elasticsearch as a vector database. The solution demonstrates how to build a production-grade AI assistant that maintains security and compliance while delivering efficient knowledge retrieval and generation capabilities.

Tags

question_answering

chatbot

structured_output

regulatory_compliance

## Overview Elastic, the company behind Elasticsearch and the Elastic Stack, developed ElasticGPT as an internal generative AI assistant for their workforce. This case study represents a classic "dogfooding" or "customer zero" approach where a technology company uses its own products to build internal solutions. The primary goal was twofold: to deliver secure, context-aware knowledge discovery for Elastic employees (called "Elasticians"), and to validate their generative AI capabilities while gathering feedback for product teams. The core component of ElasticGPT is SmartSource, an internally built RAG (Retrieval Augmented Generation) model that retrieves relevant context from internal data sources and passes it to OpenAI's large language models. This approach allows the system to provide answers grounded in proprietary company data rather than relying solely on the LLM's pre-trained knowledge. ## Backend Architecture and Vector Database Implementation At the heart of ElasticGPT is Elasticsearch, which serves a dual purpose in the architecture. First, it functions as the vector database for SmartSource's RAG capabilities, storing embeddings (numerical representations) of internal data from sources including Elastic's Wiki, ServiceNow Knowledge Articles, and ServiceNow News Articles. Second, it acts as a robust repository for chat data across all models, logging every interaction including user messages, timestamps, feedback, and metadata. The data ingestion pipeline leverages Elastic's Enterprise Connectors to bring in data from various internal sources. The text then goes through a chunking process to break it into searchable segments, followed by embedding generation for semantic search capabilities. When a user submits a query, SmartSource performs vector search in Elasticsearch to retrieve the most relevant context, which is then fed to GPT-4o for response generation. An important operational detail mentioned is that user chat data is deleted every 30 days, with only metrics retained. This approach balances data retention needs with cost management and privacy considerations—a practical LLMOps concern that many organizations face when logging LLM interactions. ## Frontend Architecture and User Experience The frontend is built using React and EUI (Elastic's own UI framework), hosted on Kubernetes within Elastic Cloud. The article notes an interesting architectural decision point: early in development, the team experimented with Hugging Face's Chat UI for a quick start, but switched to EUI when users demanded custom features. This highlights a common trade-off in LLMOps between using off-the-shelf UI components versus building custom interfaces that integrate better with existing enterprise systems. Key features of the frontend include real-time streaming of responses (so users see answers unfold naturally rather than waiting for complete responses), source attribution and linking to build trust in the RAG outputs, and feedback buttons for users to rate answer quality. The latter is particularly relevant for LLMOps as it provides a mechanism for continuous improvement and evaluation of the system's performance. Security is handled through Elastic's Okta SSO integration for authentication, with end-to-end encryption for data protection. The Kubernetes orchestration on Elastic Cloud enables zero-downtime deployments, which is essential for maintaining availability of a production AI system. ## API Layer and RAG Pipeline The API serves as the integration layer between the React frontend and Elasticsearch backend, using a stateless, streaming design optimized for real-time response delivery. For SmartSource queries, the API orchestrates a multi-step process: triggering a vector search in Elasticsearch to fetch relevant context, sending that context along with the user query to GPT-4o (hosted on Azure), and streaming the generated response back to the frontend. For queries to GPT-4o and GPT-4o-mini that don't require internal data context, the API bypasses the RAG pipeline entirely, routing queries directly to the Azure-hosted models. This architectural decision demonstrates a thoughtful approach to system design—not all queries need the overhead of retrieval, and routing logic can optimize for different use cases. ## LangChain for RAG Orchestration LangChain serves as the orchestration layer for SmartSource's RAG capabilities, managing the entire pipeline end-to-end. This includes chunking ingested data, generating embeddings, retrieving context from Elasticsearch, and crafting prompts for GPT-4o. The article specifically mentions that LangChain ensures only the relevant chunks from documents are retrieved rather than entire documents, keeping answers concise and relevant—an important consideration for managing context window limits and response quality. The choice of LangChain is justified by its flexibility and compatibility with the Elastic Stack. While the article presents this integration positively (as one would expect from a company-authored piece), LangChain is indeed a widely-used framework for building LLM applications and provides useful abstractions for RAG pipelines. ## Monitoring and Observability Elastic APM (Application Performance Monitoring) is used to track every API transaction, including query latency, error rates, and other performance metrics. Kibana dashboards provide visibility into API performance, model usage, and system health. This observability setup is critical for production LLM systems, enabling the team to identify and resolve issues before they affect users. The analytics capabilities also enable tracking of usage patterns, identification of common queries, and discovery of areas for refinement. This feedback loop is essential for continuous improvement of LLM systems—understanding how users interact with the system and where it falls short. ## Extended LLM Access and Shadow AI Mitigation Beyond the RAG-based SmartSource, ElasticGPT provides secure access to OpenAI's GPT-4o and GPT-4o-mini models hosted on a private Azure tenant. These models are available for tasks that don't require internal data retrieval, such as general queries, content drafting, or creative brainstorming. Importantly, because these models are hosted in a secure environment, employees can share private company data without worrying about compliance violations. The article mentions that Elastic's IT team is using this approach to reduce the potential impact of "shadow AI"—employees using consumer AI tools with company data. By providing sanctioned, secure access to multiple LLMs (with plans to add Anthropic's Claude and Google's Gemini models), the organization can better control how AI is used while still enabling productivity gains. This represents a pragmatic LLMOps strategy for enterprise adoption. ## Critical Assessment While this case study provides useful insights into building an internal RAG-based AI assistant, it's important to note that this is a company-authored piece that naturally presents Elastic's technologies in a favorable light. The article claims the platform has "slashed redundant IT queries and creating employee efficiencies," but provides no specific metrics or quantitative results to substantiate these claims. The architecture described is sound and follows established patterns for enterprise RAG implementations. The use of Elasticsearch as a vector database is a natural choice given Elastic's product portfolio, though organizations without existing Elasticsearch infrastructure might consider purpose-built vector databases or other alternatives. Similarly, the choice of EUI makes sense for internal consistency but isn't necessarily the right choice for other organizations. The 30-day data retention policy for chat data is mentioned as a cost-effective approach, but the implications for long-term analytics and model improvement could be explored further. Additionally, while the article mentions plans to incorporate agentic AI capabilities, no details are provided on how these would be implemented or governed. ## Future Directions The article mentions several planned enhancements: using Elasticsearch's "Semantic Text" field type, inference endpoints, and LLM observability features. The team also plans to incorporate specialized AI agents for workflow automation. These represent ongoing evolution of the platform, though specific timelines or implementation details are not provided.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source