A former Apple messaging team lead shares five crucial insights for deploying LLMs in production, based on real-world experience. The presentation covers essential aspects including handling inappropriate queries, managing prompt diversity across different LLM providers, dealing with subtle technical changes that can impact performance, understanding the current limitations of function calling, and the critical importance of data quality in LLM applications.
This case study is drawn from a conference talk by a speaker with significant industry experience, having co-founded the messaging apps team at Apple before transitioning to work with LLMs starting in early 2023. The speaker subsequently joined LlamaIndex, where they created LlamaIndex TS (TypeScript version) and led partnerships. The talk distills practical lessons learned from building and deploying LLM applications in production environments, offering a candid and somewhat humorous look at the challenges practitioners face.
The presentation is structured around “five things you might want to know before putting LLMs in production,” providing actionable insights that go beyond theoretical concepts to address real-world pain points. It’s worth noting that this is a practitioner-focused talk rather than a formal case study with quantified business results, so the lessons are primarily experiential rather than data-driven.
The speaker opens with an unconventional but memorable heuristic for determining if an LLM application is truly in production: whether users have attempted to input inappropriate or unexpected content. The example given is users typing offensive or irrelevant queries into chatbots. This leads to a critical point about query classification as a first line of defense in production systems.
The key recommendation is that regardless of how sophisticated the underlying system is—whether it’s a complex multi-agent architecture with planning capabilities, code generation, or image analysis—production LLM applications need robust input filtering and classification. The speaker emphasizes that there are “lots and lots of questions that your users might want to ask you that you just don’t want to answer.” Rather than attempting to handle all possible inputs with complex agent logic, the pragmatic approach is to implement query classification that can identify and reject problematic queries with a simple “I can’t answer that” response.
This insight speaks to a broader LLMOps principle: production systems need explicit boundaries and guardrails that protect both the system and the users from unintended behaviors. The complexity of multi-agent systems shouldn’t be applied to every input; instead, a classification layer should route or filter queries before they enter the more expensive and potentially problematic reasoning chains.
The second lesson addresses the often-overlooked reality that different LLMs expect prompts formatted in different ways. The speaker notes that LLMs “talk in different languages” including JSON, Markdown, YAML, and surprisingly, XML. This last format is particularly relevant for Anthropic’s Claude models, which work best with XML-structured prompts.
The speaker illustrates this point with examples from Meta’s Llama family:
The speaker points out a telling detail: Meta’s official documentation for Llama 3 function calling contained a typo, suggesting that function calling “was not top of mind for the researchers when they launched this thing.” This observation highlights a common LLMOps challenge—documentation quality and feature maturity vary significantly across providers and even across different capabilities within the same model family.
The practical implication is that LLMOps teams need to invest time in understanding the specific prompting conventions for each model they use. What works for OpenAI’s GPT-4 may not work optimally for Claude or Llama models, and even different versions of the same model family (Llama 2 vs Llama 3) can have substantially different requirements.
Perhaps the most technically specific lesson concerns how seemingly minor implementation changes can have outsized effects on system behavior. The speaker shares a concrete example involving OpenAI’s embedding API:
embeddings.create methodget_embedding from embedding_utilsThe speaker references Boris Power from OpenAI, who confirmed that newer embedding models no longer have the newline issue. However, when the speaker attempted to remove this preprocessing step from LlamaIndex (since it was no longer necessary), it immediately broke their most basic demo.
This example underscores several important LLMOps principles:
The fourth lesson provides a sobering assessment of the current state of function calling (tool use) in LLMs, which underpins many agent-based architectures. The speaker cites Mistral’s testing when they launched Mistral Large, which showed that even the best models achieved only around 50% accuracy on function calling tasks.
The speaker notes that Mistral’s benchmarks showed their model “almost got to 50% accuracy,” positioning themselves as better than GPT-4 and Claude. However, the practical implication is stark: “50% of the times is correct” is not a compelling reliability story for production applications.
Additional concerns are raised about GPT-4 and GPT-4o Mini specifically:
The recommendation is clear: function calling capabilities are “still very early.” LLMOps teams should be cautious about relying heavily on agent-based architectures that depend on reliable function calling, and should implement robust error handling and validation for function call outputs.
The final and perhaps most emphasized lesson is the importance of inspecting data quality, particularly when using document parsing for RAG (Retrieval-Augmented Generation) systems. The speaker references guidance from the AI Engineer conference: “always look at your data.”
A practical example is provided where the speaker and a colleague ran the same PDF (their slide deck) through two different document parsers:
The speaker emphasizes that no amount of sophisticated multi-agent RAG architecture can compensate for garbage input data. If the underlying parsed text is corrupted or poorly extracted, the entire system will fail regardless of how well-designed the retrieval and generation components are.
This connects to a fundamental LLMOps principle: data quality is upstream of model quality. Teams often focus on prompt engineering, model selection, and architecture design while underinvesting in the data pipeline. Document parsing is particularly treacherous because:
While this talk doesn’t provide quantified business outcomes or detailed technical architectures, it offers valuable practitioner wisdom that reflects the current maturity level of LLM deployments. Several themes emerge:
Defensibility over capability: Rather than maximizing what an LLM system can do, production deployments often benefit from clearly defining what the system should not do and implementing robust boundaries.
Model heterogeneity is real: The ecosystem includes models from OpenAI, Anthropic, Meta, Mistral, and others, each with different prompting conventions, strengths, and quirks. LLMOps practices need to accommodate this diversity.
Reliability lags capability: Features like function calling are exciting but not yet reliable enough for many production use cases. Setting appropriate expectations and implementing fallbacks is essential.
Data quality requires vigilance: Even with sophisticated tooling, fundamental data quality issues can undermine entire systems. Regular inspection and validation of data pipelines remains necessary.
The speaker’s background—transitioning from traditional software engineering at Apple to LLM development—also reflects a broader industry trend of experienced engineers discovering that many established software engineering practices need adaptation for the probabilistic, often unpredictable nature of LLM-based systems.
Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Raindrop's CTO Ben presents a comprehensive framework for building reliable AI agents in production, addressing the challenge that traditional offline evaluations cannot capture the full complexity of real-world user behavior. The core problem is that AI agents fail in subtle ways without concrete errors, making issues difficult to detect and fix. Raindrop's solution centers on a "discover, track, and fix" loop that combines explicit signals like thumbs up/down with implicit signals detected semantically in conversations, such as user frustration, task failures, and agent forgetfulness. By clustering these signals with user intents and tracking them over time, teams can identify the most impactful issues and systematically improve their agents. The approach emphasizes experimentation and production monitoring over purely offline testing, drawing parallels to how traditional software engineering shifted from extensive QA to tools like Sentry for error monitoring.