A detailed case study of implementing LLMs in a supplier discovery product at Scoutbee, evolving from simple API integration to a sophisticated LLMOps architecture. The team tackled challenges of hallucinations, domain adaptation, and data quality through multiple stages: initial API integration, open-source LLM deployment, RAG implementation, and finally a comprehensive data expansion phase. The result was a production-ready system combining knowledge graphs, Chain of Thought prompting, and custom guardrails to provide reliable supplier discovery capabilities.
This case study comes from a QCon London 2024 presentation by Nischal HP, VP of Data Science and ML Engineering at Scoutbee, a company operating in the supply chain intelligence space. Scoutbee provides enterprise customers like Unilever and Walmart with a semantic search platform for supplier discovery—essentially a specialized search engine that helps organizations find manufacturers and suppliers that meet specific criteria. The presentation chronicles an 18-month journey of integrating LLMs into their existing product, moving from initial POC to production deployment through four distinct evolutionary stages.
The context is particularly relevant for enterprise LLMOps because it addresses the challenges of bringing LLMs to production in environments where data privacy, reliability, and domain expertise are paramount. Unlike consumer-facing applications where occasional errors might be tolerable, enterprise supply chain decisions involve significant business risk, making trust and explainability essential requirements.
The journey began in early 2023 with a straightforward approach: connecting the existing application to ChatGPT via API, using LangChain for prompt engineering. This represented the minimal viable integration that many organizations attempted during the initial LLM wave. The existing infrastructure included knowledge graphs populated through distributed inferencing with Spark and smaller ML models.
The immediate problems that emerged were instructive for understanding LLMOps challenges:
Despite these issues, user feedback indicated genuine enthusiasm for the experience, validating the market opportunity while highlighting the substantial work required for production readiness.
The privacy and security concerns drove the decision to self-host open-source models. The team deployed LLaMA-13B (initially obtained through unofficial channels before Hugging Face availability) with FastChat API. This immediately increased complexity and cost—the organization became responsible for infrastructure that was expensive to operate compared to API-based approaches.
A critical LLMOps learning emerged: prompt engineering is not portable across models. The prompts developed for ChatGPT required substantial rework to function with LLaMA, meaning any future model changes would require similar effort.
For domain adaptation, the team explored several approaches including zero-shot learning, in-context learning, few-shot learning, and agents. They implemented agents that could understand tasks, make queries to different systems, and synthesize answers. Heavy prompt engineering was required to feed domain knowledge into these agents.
The guardrails implementation was particularly innovative. Rather than using a linear validation approach, they developed a Graphs of Thought methodology inspired by research from ETH Zurich. This modeled the business process as a graph, allowing different guardrails to be invoked depending on where the user was in their workflow. This was necessary because supplier discovery involves dynamic, non-linear business processes rather than simple query-response patterns.
However, testing agents proved to be a nightmare. The reasoning process was opaque—agents would sometimes invoke data APIs appropriately, sometimes make up answers, and debugging was extremely difficult without the ability to set breakpoints in the “thinking” process. This made the team uncomfortable with bringing agents to production.
The persistence of hallucinations led to implementing Retrieval-Augmented Generation (RAG). This significantly expanded the engineering stack to include:
The Chain of Thought approach was chosen specifically to address the agent opacity problem. Instead of a direct question-to-answer path, the LLM now followed an explicit reasoning process that could be observed and validated. This reasoning process could be taught through few-shot examples, providing a “roadmap” for the LLM without requiring full model retraining.
The team also implemented query transformation to handle the diversity of user inputs. Some users typed keywords (Google-style), while others provided extensive narrative context. The system needed to normalize these inputs into standard forms and potentially split complex queries into multiple sub-queries.
Observability became critical at this stage. The team identified two distinct observation requirements:
The Ragas framework (open-source) was adopted to generate scores for generation and retrieval quality, providing actionable insights for system improvement.
Key outcomes from RAG implementation:
New challenges emerged: users began interrogating the data more deeply, requiring conversational capabilities the system didn’t yet support, and latency increased significantly with more processing steps.
With the LLM layer stabilized, the bottleneck shifted to data quality and coverage. The effectiveness of RAG depends entirely on having comprehensive, high-quality data to retrieve.
The team chose knowledge graphs over pure vector embeddings for several reasons:
The knowledge graph ontology design, which originally took 6-9 months, could potentially be accelerated to months using LLMs—an interesting example of using LLMs to improve LLM-powered systems.
For populating the knowledge graph at scale, the team employed a teacher-student approach: using a superior LLM to generate high-quality training data, then fine-tuning smaller, more economical models on this data. Human validation remained in the loop, but the effort reduction was 10-20x compared to purely human annotation. This approach was motivated by practical constraints: large models are expensive to operate, and GPU availability in AWS regions (Frankfurt, Ireland, North Virginia) was already constrained.
To handle language variation in user queries, they used LLMs to generate facts with synonyms and alternative phrasings, expanding the knowledge graph’s query understanding.
The scaling challenges required a fundamental infrastructure change. The team identified several pain points:
The solution was adopting Ray (open-source from UC Berkeley, with Anyscale providing the enterprise platform). Ray provides a universal compute framework that allows data scientists to scale code using simple decorators rather than understanding distributed systems. The platform also provided optimized LLM hosting on smaller GPUs, running on the company’s own infrastructure to maintain privacy requirements.
The presentation concluded with practical wisdom that reflects hard-won operational experience:
Business Considerations:
Technical Practices:
Guardrails and Safety:
Team Management:
The overall arc of the case study demonstrates that moving LLMs to production is an evolutionary journey requiring substantial architectural changes, new observability practices, data infrastructure investment, and organizational adaptation. The 18-month timeline reflects the reality that enterprise LLM deployment is far more complex than initial POCs suggest.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.