ZenML

Overcoming LLM Production Deployment Challenges

Neeva 2023
View original source

A comprehensive analysis of the challenges and solutions in deploying LLMs to production, presented by a machine learning expert from Neeva. The presentation covers both infrastructural challenges (speed, cost, API reliability, evaluation) and output-related challenges (format variability, reproducibility, trust and safety), along with practical solutions and strategies for successful LLM deployment, emphasizing the importance of starting with non-critical workflows and planning for scale.

Industry

Tech

Technologies

Overview

This presentation comes from Tanmay, a machine learning practitioner who has worked at both TikTok and Neeva, a search AI startup. The talk takes a refreshingly pragmatic and somewhat contrarian view on LLM deployment, acknowledging the significant gap between the impressive demos flooding social media and the reality of production deployments. The speaker self-describes as being a “bus kill about LLMs,” suggesting an intentionally grounded perspective that contrasts with the hype cycle. This is particularly valuable context as the speaker works at an AI-first search company where LLM deployment challenges are directly relevant to the core product.

The central thesis is that while Twitter and other platforms showcase thousands of LLM demos, very few of these actually make it to production. For practitioners working in industry, this observation resonates strongly. The talk systematically categorizes the obstacles into two major buckets: infrastructural challenges and output-linked challenges.

Infrastructural Challenges

Latency and Speed

One of the most significant barriers to production deployment is that LLMs are inherently slower than many status quo experiences users have become accustomed to. The speaker uses search as a prime example, which is particularly relevant given Neeva’s focus on search AI. Users expect near-instantaneous results, but LLM-generated outputs take significantly longer to complete compared to traditional retrieval-based approaches.

The solutions proposed fall into two categories: making models genuinely faster, or making them seem faster through UX techniques. On the technical side, conventional ML optimization techniques like model distillation and pruning can reduce latency. Using smaller models where larger ones aren’t strictly necessary is also recommended. However, these approaches lean toward the “build” side of the spectrum, requiring more technical investment.

For those using API-based solutions, human-computer interaction (HCI) techniques become crucial. The speaker recommends loading animations, streaming output (showing tokens as they’re generated rather than waiting for completion), and parallelizing complementary tasks rather than making them blocking operations. This acknowledgment that perceived performance can be as important as actual performance is a sophisticated insight that bridges technical and product considerations.

Buy vs. Build Decisions

The talk frames this as one of the most consequential decisions in LLM deployment. “Buying” refers to purchasing API access to foundational models (like OpenAI, Anthropic, etc.), while “building” typically means fine-tuning open-source LLMs for specific use cases.

The economics differ substantially: buying presents lower upfront costs but creates scaling challenges as usage grows, while building requires higher upfront investment with more uncertainty about whether the resulting model will meet quality thresholds. The speaker’s recommended approach is nuanced: “buy while you build.” This strategy involves using API access to get to market quickly, validate MVPs, and collect real user data. Meanwhile, teams should simultaneously work on fine-tuning in-house models to ensure long-term cost sustainability as adoption scales.

This is particularly astute advice for startups and growth-stage companies that need to balance speed-to-market with long-term operational sustainability. It acknowledges that the optimal choice may evolve over time rather than being a one-time decision.

API Reliability

An emerging challenge that the speaker highlights is the reliability of foundational model APIs. As these providers build out their serving infrastructure, users can quickly lose trust when they experience even infrequent downtime. This is especially problematic for production deployments where reliability expectations are high.

The recommended mitigation strategy draws an analogy to multi-cloud approaches: implementing fallbacks across different foundational model providers. The speaker notes that, at the time of the talk, there hadn’t been an instance where two major foundation providers failed simultaneously, making this a reasonable redundancy strategy.

Additionally, the concept of “failing gracefully” is emphasized. Users generally understand that LLM technology is still in active development, so thoughtful error handling and fallback experiences can maintain user trust even when failures occur. Planning for the “last resort case” is framed as essential rather than optional.

Evaluation Infrastructure

The speaker acknowledges that evaluation of LLM outputs remains a significant challenge, with the industry still leaning “relatively heavily on the manual side.” There’s an ongoing search for clearer quality metrics for LLM output, but this remains an unsolved problem.

The practical advice offered is to embrace a “fail fast with fail-safes” mentality. This means ensuring that trust and safety failures are prevented (the fail-safe), while being more permissive about iterating on core product quality in production. Strong user feedback loops become essential, and the recommendation is to link LLM integrations to top-line metrics like session duration or stay duration to evaluate real-world impact.

Output-Linked Challenges

Output Format Variability

This is described as “probably the largest chunk of challenge” in the output category. Since LLMs are generative models, there’s inherent unpredictability in their responses, which creates significant problems when integrating them into pipelines that expect specific output formats.

The solutions offered include few-shot prompting, where output examples are provided directly in the prompt to guide the model toward desired formats. The speaker also mentions libraries like Guardrails and LMQL (referred to as “realm” in the transcript) that can validate output formats and retry LLM calls when validation fails. This represents an important pattern in production LLM systems: treating LLM outputs as potentially malformed and building validation and retry logic around them.

Lack of Reproducibility

The speaker characterizes this as “absolutely the easiest one to solve.” The issue is that the same input might produce different outputs even from the same model. The simple solution is to set the temperature parameter to zero, which makes model outputs deterministic. While this reduces creativity and variety, it provides the consistency needed for many production use cases.

Adversarial Attacks and Trust & Safety

The speaker groups prompt hijacking and trust & safety concerns together, reasoning that the primary negative outcome of prompt hijacking is generating outputs that a trust and safety layer could prevent. This framing suggests that robust content filtering and safety guardrails can address multiple attack vectors simultaneously.

Strategic Recommendations for First Production Deployment

The talk concludes with practical advice for getting “your first LLM to prod” successfully:

Project Positioning: Focus on deploying to non-critical workflows where the LLM can add value without becoming a dependency. This acknowledges the current state of the technology where output variability and API downtimes make critical-path deployments risky. The goal is to “add value but not become a dependency” while more reliable serving infrastructure is built.

Higher Latency Use Cases: Targeting use cases where users have lower expectations for response speed creates more room to deliver value despite the inherent latency of LLM generation. This is about finding product-market fit for LLM capabilities given current technical constraints.

Plan to Build While You Buy: Reiterated as crucial for long-term success, this ensures that initial deployments using API access have a path to sustainable scaling as adoption grows.

Do Not Underestimate HCI: The human-computer interaction component is emphasized as a significant determinant of LLM success. Responding “seemingly faster” through streaming and progressive loading, failing gracefully when errors occur, and enabling large-scale user feedback are all critical to production success. This underscores that LLMOps isn’t purely a backend engineering discipline—it intersects heavily with product design and user experience.

Critical Assessment

This talk provides a valuable reality check for the LLM deployment space. The speaker’s experience at both TikTok (a large-scale consumer platform) and Neeva (an AI-native startup) lends credibility to the observations. However, it’s worth noting that some of the challenges described may have evolved since the talk was given, as the LLM infrastructure space is developing rapidly. API reliability has generally improved, new evaluation frameworks continue to emerge, and the tooling ecosystem for output validation has matured.

The “buy while you build” strategy is sound but assumes organizations have the resources and expertise to pursue parallel tracks, which may not be realistic for smaller teams. Additionally, the recommendation to target non-critical workflows, while pragmatic, could limit the impact and business value of initial LLM deployments.

Overall, this represents a practitioner-focused, battle-tested perspective on LLMOps that appropriately tempers enthusiasm with operational reality.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building and Scaling LLM Applications at Discord

Discord 2024

Discord shares their comprehensive approach to building and deploying LLM-powered features, from ideation to production. They detail their process of identifying use cases, defining requirements, prototyping with commercial LLMs, evaluating prompts using AI-assisted evaluation, and ultimately scaling through either hosted or self-hosted solutions. The case study emphasizes practical considerations around latency, quality, safety, and cost optimization while building production LLM applications.

chatbot compliance content_moderation +19

Enterprise LLM Implementation Panel: Lessons from Box, Glean, Tyace, Security AI and Citibank

Various 2023

A panel discussion featuring leaders from multiple enterprises sharing their experiences implementing LLMs in production. The discussion covers key challenges including data privacy, security, cost management, and enterprise integration. Speakers from Box discuss content management challenges, Glean covers enterprise search implementations, Tyace shares content generation experiences, Security AI addresses data safety, and Citibank provides CIO perspective on enterprise-wide AI deployment. The panel emphasizes the importance of proper data governance, security controls, and the need for systematic approach to move from POCs to production.

compliance cost_optimization databases +26