ZenML

Large Language Models in Production Round Table Discussion: Latency, Cost and Trust Considerations

Various 2023
View original source

A panel of experts from various companies and backgrounds discusses the challenges and solutions of deploying LLMs in production. They explore three main themes: latency considerations in LLM deployments, cost optimization strategies, and building trust in LLM systems. The discussion includes practical examples from Digits, which uses LLMs for financial document processing, and insights from other practitioners about model optimization, deployment strategies, and the evolution of LLM architectures.

Industry

Tech

Technologies

Overview

This case study is derived from a roundtable discussion featuring multiple experts discussing the practical challenges and solutions for deploying large language models (LLMs) in production environments. The panel includes Rebecca (Research Engineer at Facebook AI Research focusing on NLP robustness), David (VP at Unusual Ventures with extensive MLOps experience), Honest (ML Engineer at Digits, a fintech company), and James (CEO of Bountiful, building monitoring and testing tools for foundation model workflows). The discussion is moderated by Diego Oppenheimer, who brings experience from DataRobot and Algorithmia.

The conversation provides valuable insights into real-world LLM deployments, covering everything from foundational concepts to specific production challenges around cost, latency, and trust. This is particularly valuable as it represents multiple perspectives: a researcher, a practitioner actively deploying LLMs, a VC seeing many companies attempt deployment, and a tooling provider.

Key Definitions and Context

Rebecca provides important historical context, positioning large language models as an evolution of transfer learning rather than an entirely new paradigm. She traces the progression from statistical models in the 1990s through deep learning in 2010, the Transformer architecture in 2017, and BERT in 2018. The exponential scaling has moved from BERT Large (340 million parameters) to GPT-2 (1.5 billion), T5 (11 billion), and GPT-3 (175 billion parameters).

A key insight shared is that the “large” in LLMs may be somewhat misleading—the discussion references the “Chinchilla paper” which demonstrated that models must be “Chinchilla optimal,” meaning training data must scale proportionally with parameter count. Many first-generation large models were actually undertrained, and smaller models trained on more data (like the Llama models from Meta) can outperform larger undertrained models.

The panel identifies several characteristics that make LLMs unique for production: they can be accessed via natural language (no-code interface), they are general rather than task-specific, and they don’t suffer from traditional ML cold-start problems—you can begin working with them without pre-training data.

Production Complexity Tiers

David provides an extremely useful framework for thinking about LLM production complexity in three tiers:

Tier 1 represents the simplest applications—essentially UI wrappers that make a single API call to OpenAI or similar providers. Examples include Jasper and Copy.ai for text completion, or sales tools like Reggie and Autobound for generating sales copy. These require minimal infrastructure and were the first to achieve production success.

Tier 2 involves incorporating external information about the world or about users into the models. This requires building infrastructure, databases, and integrations. Examples include ChatGPT (which requires conversation context), Mem (which reads historical documents and emails for personalized completion), and aptEdge (which connects to Jira, Zendesk, and Slack for customer service). Most companies are still building at this tier, which David describes as a “sea of engineering work” that people are finally completing.

Tier 3 represents true model complexity—chaining calls, building agents that parse responses and call models iteratively, fine-tuning models, or implementing RLHF. This requires ML engineering expertise and remains challenging for most teams. David notes there’s “a big difference when you give an LLM the keys to the car versus the autopilot.”

James adds that his company is observing very large companies being built in Tier 2 complexity, noting that Jasper (a Tier 1 example) became an enormous company, and a second wave of companies is now emerging in Tier 2.

Digits Case Study: Self-Hosted LLMs for Financial Data

Honest provides the most detailed production case study from Digits, a fintech company providing real-time bookkeeping for business owners and accountants. They use LLMs for multiple purposes including text generation and custom embeddings.

Decision Process: When evaluating whether to use OpenAI’s API versus self-hosting, Digits considered several factors:

Technical Approach: Digits uses a teacher-student approach where larger models are used to train smaller, domain-specific models. The process involves:

Honest notes that their latency improved by “orders of magnitude” from initial deployment through optimization, and the domain-specific focus was the “biggest latency saver.”

Hallucination Mitigation: For text generation use cases, Digits implements multiple layers of protection:

Cost-Quality-Latency Triangle

James introduces the “cost-quality-latency triangle” framework for thinking about LLM applications. Different use cases prioritize different vertices:

A real example shared: one founder’s workflow involves seven different model instances (four fine-tuned Babbage, plus embeddings models), initially taking 15 seconds end-to-end. Through optimization (fine-tuning smaller models instead of using large ones), this was reduced to 3.5 seconds—still potentially too slow for some use cases.

Latency Challenges

The panel agrees latency is significantly underappreciated. Current state-of-the-art in “lab environments” is around 29 milliseconds, but production deployments often accumulate much more latency through workflow chaining.

Rebecca explains the fundamental architectural constraint: Transformers are optimized for training (parallel processing of all tokens) but not for sequential inference. RNN-based encoding would be faster for inference, but Transformers have become dominant since 2017. She points to “State Space Models” from Stanford (the “Hungry Hungry Hippos” paper) as potential future architectures that might address this, though they currently underperform Transformers.

The discussion acknowledges that truly real-time use cases (sub-15ms) are currently out of reach for most LLM applications, though this may change rapidly.

Trust and Hallucinations

The panel discusses “hallucinations”—outputs that aren’t factually accurate. Rebecca notes this has existed in language models for years but matters more now that models are used for real-world tasks. The panel discusses several mitigation approaches:

Open Source Ecosystem

The discussion emphasizes the growing open-source LLM ecosystem beyond OpenAI. James recommends Discord servers for Eleuther AI, Cohere AI, and Lion as resources for understanding open-source alternatives. The panelists note that while open-source models aren’t as accessible as APIs (requiring fine-tuning and engineering work), they offer portability and privacy benefits for use cases that can’t send data to external APIs.

Key Takeaways for LLMOps Practitioners

The discussion surfaces several important considerations for teams deploying LLMs:

The roundtable represents a snapshot of a rapidly evolving field—multiple panelists explicitly reserve the right to “eat their words” as the technology progresses.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Multi-Industry AI Deployment Strategies with Diverse Hardware and Sovereign AI Considerations

AMD / Somite AI / Upstage / Rambler AI 2025

This panel discussion at AWS re:Invent features three companies deploying AI models in production across different industries: Somite AI using machine learning for computational biology and cellular control, Upstage developing sovereign AI with proprietary LLMs and OCR for document extraction in enterprises, and Rambler AI building vision language models for industrial task verification. All three leverage AMD GPU infrastructure (MI300 series) for training and inference, emphasizing the importance of hardware choice, open ecosystems, seamless deployment, and cost-effective scaling. The discussion highlights how smaller, domain-specific models can achieve enterprise ROI where massive frontier models failed, and explores emerging areas like physical AI, world models, and data collection for robotics.

healthcare document_processing classification +40

Building Production AI Agents and Agentic Platforms at Scale

Vercel 2025

This AWS re:Invent 2025 session explores the challenges organizations face moving AI projects from proof-of-concept to production, addressing the statistic that 46% of AI POC projects are canceled before reaching production. AWS Bedrock team members and Vercel's director of AI engineering present a comprehensive framework for production AI systems, focusing on three critical areas: model switching, evaluation, and observability. The session demonstrates how Amazon Bedrock's unified APIs, guardrails, and Agent Core capabilities combined with Vercel's AI SDK and Workflow Development Kit enable rapid development and deployment of durable, production-ready agentic systems. Vercel showcases real-world applications including V0 (an AI-powered prototyping platform), Vercel Agent (an AI code reviewer), and various internal agents deployed across their organization, all powered by Amazon Bedrock infrastructure.

code_generation chatbot data_analysis +38