## Overview
Co-op is one of the world's largest consumer cooperatives and the UK's fifth largest food retailer, operating over 2,500 stores across the country. Beyond food retail, Co-op is also the UK's leading funeral services provider, a major general insurer, and has a growing legal services business. In 2023, the company identified an opportunity to leverage GenAI to dramatically improve how store employees access essential operational information, leading to the development of a RAG-based virtual assistant application designed to streamline policy and process document searches.
The case study represents a practical example of applying LLMOps principles to an internal enterprise use case, with particular emphasis on model evaluation, experimentation, and the importance of building production-ready ML infrastructure around GenAI applications.
## The Problem
Co-op employees working across their UK-wide food stores have access to a comprehensive library of over 1,000 web-based guides covering all store policies and procedures. However, the existing traditional keyword search engine presented several significant challenges that impacted operational efficiency.
Navigating these documents was time-consuming and cumbersome, requiring precise search terms to find the required information. As noted by Joe Wretham, Senior Data Scientist at Co-op, employees often need to find and navigate information while working under pressure in busy store environments. The volume of queries to the application was substantial, with approximately 50,000 to 60,000 questions asked weekly, broken down into roughly 23,000 initial queries and 35,000 follow-up questions.
Even when employees found the correct document, locating the specific piece of information was time-consuming since documents are lengthy and thorough. The inefficacy of the discoverability process frequently led employees to rely on the company's support centers for assistance, increasing operational costs and reducing overall efficiency.
## Technical Solution Architecture
Co-op's data science team, already longtime users of the Databricks Data Intelligence Platform for data warehousing, engineering, and analytics, embarked on their first GenAI venture called the "How Do I?" project. The solution leverages several key technical components working together in a production pipeline.
### Document Processing and Vector Storage
The team used Databricks Lakeflow Jobs to automate the daily extraction and embedding of documents from Contentful, a popular content management system used by Co-op for content storage. This automated pipeline ensures that the RAG application always has access to up-to-date information, which is critical for operational use cases where policy and procedure changes need to be immediately reflected in search results.
The embedded documents are stored in Databricks Vector Search, which is described as an optimized storage solution that manages and retrieves vector embeddings using semantic recall. This allows the system to quickly retrieve relevant document chunks to support user queries based on semantic similarity rather than exact keyword matching, addressing the core limitation of the previous search system.
### Model Experimentation and Selection
A significant portion of the LLMOps effort focused on systematic model evaluation and experimentation. The development process involved testing various AI models, including DBRX (Databricks' own open model), Mistral, and OpenAI's GPT models. The team implemented MLflow, an open source platform that facilitates model swapping and experimentation, which proved vital for fine-tuning their system.
The team built an evaluation module within the Databricks Data Intelligence Platform to measure the accuracy of different models. This evaluation framework fired hundreds of test questions at the application in many configurations, assessing three key dimensions: accuracy of responses, response times, and built-in safeguarding features. This systematic approach to model evaluation represents a mature LLMOps practice, moving beyond ad-hoc testing to structured, repeatable evaluation processes.
Despite all models scoring well in the evaluation, OpenAI's GPT-3.5 was ultimately selected as it provided the best balance of performance, speed, cost, and security considerations. This pragmatic approach to model selection, balancing multiple factors rather than optimizing purely for accuracy, reflects real-world production constraints that LLMOps practitioners must navigate.
### Model Serving and Infrastructure
The solution utilizes Databricks Model Serving to simplify model deployment, ensuring seamless integration into Co-op's existing infrastructure. Additionally, Databricks' serverless computing provides scalable and efficient processing power to handle the high volume of queries (50,000-60,000 weekly).
The development team also leveraged Databricks Assistant, a context-aware AI assistant, for resolving syntax queries and simple issues during development, demonstrating how AI assistants can accelerate the development process itself.
### Prompt Engineering and Optimization
The case study explicitly mentions significant experimentation with prompt engineering strategies. This included fine-tuning prompts to optimize response accuracy and relevance, adjusting parameters to control responses, and iterating on prompt phrasing to improve the model's understanding and outputs. This emphasis on prompt engineering as a key optimization lever is characteristic of production RAG applications where the quality of responses depends heavily on how questions and context are presented to the LLM.
### Integration and Flexibility
A key factor enabling rapid development was Databricks' seamless integration with external tools from OpenAI and Hugging Face, along with their commitment to open standards. This allowed quick setup and iteration of AI models without being locked into a single vendor's ecosystem. The flexibility to swap between different model providers during experimentation was crucial for finding the optimal solution.
## Data Governance and Security
The case study notes that Databricks solution architects provided essential support in helping navigate technical challenges and ensuring data security compliance. As Co-op transitions to Unity Catalog, they will enhance data governance and access controls for more secure data handling. This focus on governance is particularly important for enterprise deployments where policy documents may contain sensitive operational information.
## Current Status and Results
It's important to note that the project is described as still being in the proof-of-concept stage, which provides important context for evaluating the claimed benefits. Initial feedback from internal tests is described as "overwhelmingly positive," with employees finding the AI-powered application intuitive and significantly quicker at retrieving necessary information compared to the previous setup.
The anticipated benefits include faster and more accurate access to information for team members, reduced workload on support centers, and encouraging more self-service among employees. Co-op plans to conduct a trial of the "How Do I?" project in selected stores, with potential for full-scale deployment if proven successful. The trial phase is explicitly described as crucial for collecting user feedback and making necessary adjustments to optimize system performance before full implementation.
## Future Applications
The success of this initial GenAI project has opened the door for Co-op to consider additional GenAI applications, including automating legal document processing and personalizing customer offers. This expansion demonstrates how an initial LLMOps project can serve as a foundation for broader AI adoption across an organization.
## Critical Assessment
While the case study presents a compelling narrative of GenAI adoption, several aspects warrant consideration. First, the project remains in proof-of-concept stage, so the claimed benefits are largely anticipated rather than proven at scale. The transition from POC to production deployment often surfaces additional challenges that may not be apparent in controlled testing environments.
Second, the case study is presented on a Databricks customer stories page, so there is inherent promotional intent. The emphasis on Databricks tools and the quote about Databricks "removing barriers" should be understood in this context.
Third, the evaluation framework, while described as systematic, is somewhat opaque in terms of specific metrics and benchmarks used. The claim that "all models scored well" before selecting GPT-3.5 would benefit from more specific performance data.
That said, the case study does provide a reasonable template for how enterprises can approach RAG implementation, including the emphasis on automated data pipelines, systematic model evaluation, and phased rollout through trials before full deployment. The acknowledgment that the project is still in POC stage also lends credibility to the overall narrative.