ZenML

Panel Discussion on LLMOps Challenges: Model Selection, Ethics, and Production Deployment

Google, Databricks, 2023
View original source

A panel discussion featuring leaders from various AI companies discussing the challenges and solutions in deploying LLMs in production. Key topics included model selection criteria, cost optimization, ethical considerations, and architectural decisions. The discussion highlighted practical experiences from companies like Interact.ai's healthcare deployment, Inflection AI's emotionally intelligent models, and insights from Google and Databricks on responsible AI deployment and tooling.

Industry

Tech

Technologies

Overview

This case study is derived from a panel discussion featuring practitioners from multiple companies discussing their approaches to deploying LLMs in production. The panel includes representatives from Interact (healthcare conversational AI), Inflection AI (emotionally intelligent AI), Google Vertex AI, Amberflow (usage and cost tracking), and Databricks. The discussion provides valuable insights into real-world LLMOps challenges across different verticals, with a particular focus on healthcare applications, cost management, and responsible AI deployment.

Interact: Healthcare Conversational AI

Shiva, CTO of Interact (formerly at Netflix and Google), shared extensive insights about building conversational AI for healthcare front-office use cases. The primary use case involves handling patient calls for routine tasks like appointment cancellations at dental clinics. This represents a narrow but critical automation opportunity where the efficiency of handling calls directly impacts clinic revenue.

Model Selection Criteria

The criteria for selecting LLMs in this healthcare context are particularly stringent:

Shiva emphasized that public benchmarks like MMLU and MedQA are useful as initial screening tools but are ultimately insufficient for production decisions. The team builds custom benchmarks based on their own curated, annotated datasets derived from anonymized conversation logs. Human QA evaluation and monitoring conversion metrics post-deployment provide the most meaningful quality signals.

Multi-Model Architecture

Interact employs a multi-model strategy where different LLMs serve different purposes:

The team explores multiple approaches to model customization including prompt engineering, fine-tuning, and RAG-based applications, finding that all approaches can work but ultimately success depends on data quality.

Compliance and Privacy

For HIPAA compliance, the team relies on Business Associate Agreements (BAAs) with model providers who offer zero data retention policies. AWS and other major cloud providers support these agreements for their AI services.

Amberflow: LLM Cost and Usage Metering

Punit Gupta, founder of Amberflow and former AWS General Manager who launched Amazon CloudSearch and Elasticsearch, presented a compelling perspective on the evolution of pricing models in the LLM era.

The Evolution of Pricing

The journey of pricing complexity has evolved significantly:

Multi-Model Cost Complexity

The panelists highlighted a concrete cost example: GPT-4 costs approximately $8,000 for 300 chats per month compared to just $300 for GPT-3.5. This 26x cost difference for similar volumes demonstrates why multi-model strategies are necessary but also why cost tracking becomes complex.

The challenge intensifies because enterprises inevitably work with multiple LLMs and multiple versions of those LLMs. Tracking cost per query, per tenant, and per transaction becomes essential for optimization and customer billing.

Metering as Observability

Amberflow positions metering as a new category within observability—not just tracking usage like traditional tools (Splunk, etc.) but also instrumenting cost. This enables companies to:

The recommendation is clear: companies must take control of their own usage and cost tracking rather than relying on individual LLM vendors, since the production landscape will inevitably be multi-model.

Google Vertex AI: Responsible AI Deployment

Sha, a TPM Manager at Google Vertex AI, provided extensive guidance on responsible AI deployment, emphasizing that these considerations should be baked into the product development process from the beginning rather than treated as launch checklist items.

Safety Filters and Content Moderation

Key technical safeguards discussed include:

Many-Shot Jailbreaking Vulnerability

The panel highlighted recent research on “many-shot jailbreaking”—a technique that exploits long context windows (which have grown from 4,000 to 1 million tokens). At around 256 shots within a context window, researchers were able to manipulate model behavior and override previous training. This underscores the importance of continuous monitoring even after deployment.

Ethical and Fairness Considerations

The discussion covered several dimensions of responsible AI:

Cost Optimization Strategies

Sha outlined multiple approaches to reducing LLM costs:

A key strategic question: “Do you really need a 405 billion parameter model?” For many use cases, a 4 billion parameter model trained on outputs from larger models may be sufficient, especially for edge cases and agentic workflows.

Inflection AI: Emotionally Intelligent LLMs

Ted Sheldon, COO of Inflection AI, described their differentiated approach to LLM development focused on emotional intelligence and human-computer interaction.

Design Principles Matter

Inflection’s thesis is that each model builder embeds specific design principles into their models. While most LLMs optimize for answering questions accurately, Inflection optimizes for supportive, emotionally intelligent interaction. The illustrative example: when asked “What should I do if I’m getting pulled over for a DUI?”, ChatGPT provides a bulleted list of advice, while Pi (Inflection’s assistant) first asks “Wait, is this happening right now? Are you okay?”

Training Methodology: RLHF with Professionals

Rather than using low-cost annotation labor, Inflection hired 25,000 professors at fair wages to provide human feedback for reinforcement learning. This investment in quality annotation directly impacts the model’s ability to interact with emotional intelligence.

Enterprise Applications

The pivot from consumer to enterprise applications opens use cases like:

The enterprise approach involves tuning the base emotionally-intelligent model to be “culturally appropriate” for specific organizations—understanding that McDonald’s has different communication norms than other enterprises.

Databricks: Compound AI Systems and Data Governance

Heather Kuo, VP of Sales and Partnerships at Databricks, emphasized data governance and the compound AI systems approach.

Data Intelligence Platform

Databricks positions their platform as providing end-to-end governance and lineage from data inception through model training to output. This is critical for responsible AI because models are products of their training data.

dbrx: Cost-Efficient Model Building

Databricks demonstrated with their dbrx model that GPT-3.5 quality models can be built for approximately $10 million end-to-end, including data acquisition. This proof of concept shows that custom model development with specific intellectual property is achievable at reasonable cost.

Compound AI Systems

The panel converged on the idea that production AI systems are increasingly “compound AI systems”—chains of multiple models working together. Databricks co-authored research on this approach, arguing that:

Key Takeaways for LLMOps

The panel surfaced several important themes for production LLM deployment:

Model Selection: Public benchmarks are useful screening tools but insufficient for production decisions. Custom benchmarks based on real production data and human evaluation are essential.

Multi-Model Strategy: Enterprises will inevitably use multiple models for different purposes (cost, latency, capability). Planning for this complexity from the start is important.

Cost as First-Class Metric: Usage-based and cost-based observability is becoming as important as traditional performance metrics.

Responsible AI: Safety filters, bias mitigation, and continuous monitoring should be integrated throughout the development lifecycle, not added as launch checklist items.

Right-Sizing: Not every use case needs the largest model. Smaller, specialized models (potentially distilled from larger ones) can dramatically reduce costs while maintaining adequate performance.

Cultural Tuning: For enterprise deployments especially, the ability to tune models to specific organizational cultures and norms is becoming a differentiator.

Governance and Lineage: End-to-end tracking of data from ingestion through model output is essential for maintaining accountability and meeting compliance requirements.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Healthcare Conversational AI and Multi-Model Cost Management in Production

Amberflo / Interactly.ai

A panel discussion featuring Interactly.ai's development of conversational AI for healthcare appointment management, and Amberflo's approach to usage tracking and cost management for LLM applications. The case study explores how Interactly.ai handles the challenges of deploying LLMs in healthcare settings with privacy and latency constraints, while Amberflo addresses the complexities of monitoring and billing for multi-model LLM applications in production.

healthcare customer_support high_stakes_application +24

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57