A data scientist shares their experience transitioning from traditional ML to implementing LLM-based recommendation systems at a private equity company. The case study focuses on building a recommendation system for boomer-generation users, requiring recommendations within the first five suggestions. The implementation involves using OpenAI APIs for data cleaning, text embeddings, and similarity search, while addressing challenges of production deployment on AWS.
This case study comes from a conference talk given by Annie, a data scientist working at Bainbridge Capital, a private equity company. The presentation offers a candid, practitioner’s perspective on the challenges and realities of deploying LLM-powered recommendation systems in production. What makes this case study particularly valuable is its honest assessment of the gap between theoretical LLM capabilities and the practical challenges of operationalizing them in a real business context.
Annie describes herself as someone experiencing an “identity crisis” in the rapidly evolving AI landscape, having transitioned from traditional data science work involving regression and tree-based models to now working extensively with LLMs. This perspective resonates with many practitioners who find themselves navigating the shift from classical ML to generative AI systems.
The core business problem centers on building a recommendation system for a private equity company’s application. The specific business requirement articulated is compelling: users should find a match within the first five recommendations they see. This constraint is particularly stringent because many of their users are described as “Boomers”—people who may not have grown up with technology and thus have limited patience with applications that don’t immediately deliver value. The team has approximately one minute to create a positive user experience before users may disengage.
This business constraint immediately shapes several LLMOps considerations: the recommendations must be high quality from the start, the system must be fast enough to generate recommendations quickly, and the user experience for inputting data must be carefully designed to ensure sufficient information is collected without overwhelming users.
Interestingly, the team is not yet at the stage of fine-tuning LLMs or implementing RAG (Retrieval-Augmented Generation) architectures. They are primarily using inference—calling pre-trained models via APIs and cloud services. This represents a common early-stage pattern in enterprise LLM adoption where teams leverage existing model capabilities rather than customizing models.
The team uses LLMs across nearly every step of their data science lifecycle, primarily because their input data consists of unprocessed text scraped from the internet, and they lack labeled data. This makes LLMs particularly valuable for their use case. Specific applications mentioned include:
Text Data Cleaning: Using LLMs to clean and process text data scraped from the internet. This is a practical application where LLMs can handle the messiness and variability of real-world text data better than traditional rule-based approaches.
Feature Engineering: Leveraging LLMs to extract and create features from unstructured text, transforming raw text into structured representations useful for downstream tasks.
Similarity Search: Implementing similarity searches using text embeddings to match users with recommendations. This involves tokenization and embedding generation, which Annie notes is quite different from traditional data science preprocessing.
The team is deploying their models on AWS Cloud, chosen simply because it’s the cloud service their organization uses rather than any particular technical preference. Annie highlights several AWS integration points that make LLM deployment more accessible:
AWS SageMaker with Hugging Face: This integration allows teams to deploy LLMs without manually downloading model artifacts. The hosted model approach significantly reduces the operational burden of managing model files and dependencies.
AWS Bedrock: The Bedrock runtime allows invoking LLMs as a managed service, abstracting away infrastructure concerns. Annie mentions that Deep Learning AI had just released a free course on creating serverless LLM applications with Bedrock, indicating the timeliness of this technology stack.
The mention of serverless approaches is notable because it suggests the team is considering or exploring event-driven architectures where LLM inference can be invoked on-demand without maintaining persistent compute resources.
Annie’s talk is refreshingly honest about the challenges of moving LLMs into production, explicitly pushing back against the marketing narrative that LLM deployment is simple. She enumerates several critical production considerations:
The team uses OpenAI’s API for data cleaning, which introduces constraints around rate limits and costs. When processing data for multiple users simultaneously, these limitations become significant operational concerns. Rate limiting can create bottlenecks in data processing pipelines, and API costs can escalate quickly with high-volume usage.
Running pre-trained models like BERT, performing tokenization, and generating text embeddings all require compute resources that differ substantially from traditional ML preprocessing. Understanding and budgeting for these compute requirements is essential for production deployments.
A critical question raised is how to evaluate LLM outputs and collect the right data points to determine if predictions and similarity searches align with desired user experiences. Unlike traditional ML where evaluation metrics are often well-established, LLM evaluation—particularly for tasks like recommendations—requires thoughtful design of feedback loops and success metrics.
The quality of input data directly affects recommendation quality. If a user provides only a one-sentence description, the system may not have sufficient information to generate good recommendations. This creates a UX challenge: how much guidance should users receive when inputting their data? How do you balance collecting enough information with not overwhelming users?
Annie poignantly describes watching data scientists’ “eyes glaze over” when they realize that the impressive results they achieved manually with an LLM need to be reproduced and automated within an application. The gap between interactive exploration and production automation is significant and often underestimated.
The talk emphasizes thinking from the end—starting with business requirements and working backward to understand what technical capabilities are needed. This product-oriented thinking is essential for LLMOps because it forces teams to consider the full user experience rather than just model performance in isolation.
Annie’s self-deprecating description of her learning journey—signing up for 15 Udemy courses on generative AI but never completing them, then trying to build an LLM app in a weekend—reflects the reality many practitioners face. The field is moving quickly, and there’s constant pressure to upskill while simultaneously delivering on production requirements.
The closing metaphor comparing LLM deployment readiness to deciding to have children (“there’s never a right time”) captures an important truth: organizations that wait for perfect conditions before deploying LLMs may wait indefinitely. The recommendation is to start experimenting, even if it feels scary or unfamiliar.
This case study offers several valuable insights for practitioners:
The journey from traditional ML to LLMOps involves a genuine paradigm shift, not just learning new tools. The operational concerns—rate limits, tokenization costs, evaluation approaches—are fundamentally different from classical ML deployment patterns.
Cloud provider integrations can significantly lower the barrier to entry for LLM deployment. Services like SageMaker’s Hugging Face integration and Bedrock’s managed inference reduce the need for deep infrastructure expertise.
The gap between “it works in a notebook” and “it works in production at scale” is substantial for LLM applications. Automation, reproducibility, and handling edge cases require significant additional engineering effort.
User experience considerations are tightly coupled with LLM system design. The quality of input data, the latency of responses, and the accuracy of outputs all directly impact user satisfaction.
Starting with inference-only approaches using pre-trained models is a legitimate path to production. Not every organization needs to immediately jump to fine-tuning or RAG architectures—there’s value in first understanding the operational challenges of LLM inference before adding complexity.
It’s worth noting that this case study comes from a conference talk rather than a polished marketing case study, which gives it additional credibility. Annie is candid about being in the early stages of deployment (“where I’m at right now”) rather than claiming complete success. The acknowledgment that there’s “not really a quick and dirty way” of deploying LLMs is a valuable counterpoint to vendor marketing that often oversimplifies these challenges. For organizations beginning their LLMOps journey, this realistic perspective is arguably more useful than polished success stories that omit the messy details of production deployment.
Articul8 developed a generative AI platform to address enterprise challenges in manufacturing and supply chain management, particularly for a European automotive manufacturer. The platform combines public AI models with domain-specific intelligence and proprietary data to create a comprehensive knowledge graph from vast amounts of unstructured data. The solution reduced incident response time from 90 seconds to 30 seconds (3x improvement) and enabled automated root cause analysis for manufacturing defects, helping experts disseminate daily incidents and optimize production processes that previously required manual analysis by experienced engineers.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.