ZenML

Integrating Live-Staffed AI Chat with LLM-Powered Customer Service

Smith.ai 2024
View original source

Smith.ai transformed their customer service platform by implementing a next-generation chat system powered by large language models (LLMs). The solution combines AI automation with human supervision, allowing the system to handle routine inquiries autonomously while enabling human agents to focus on complex cases. The system leverages website data for context-aware responses and seamlessly integrates structured workflows with free-flowing conversations, resulting in improved customer experience and operational efficiency.

Industry

Tech

Technologies

Overview

Smith.ai is a company that provides virtual receptionist and customer engagement services, offering both AI-powered and human-staffed solutions for businesses across various industries including law firms, home services, healthcare, and more. This case study describes their transition from traditional rule-based AI chat systems to a generative AI-powered web chat product that combines large language models with human agent supervision.

The announcement, written by Travis Corrigan (Head of Product at Smith.ai), positions this as a major product evolution that fulfills a long-standing company goal of creating more natural, human-like AI interactions. While the text is inherently promotional in nature, it does provide useful insights into the architectural decisions and operational approach behind their LLM-powered chat system.

The Problem with Previous AI Approaches

Smith.ai articulates that their previous AI technology, developed approximately 5-7 years ago, was fundamentally limited in its conversational capabilities. The key limitations included:

This meant that while their previous chat systems could handle basic queries, they frequently required human intervention for anything beyond the most straightforward interactions, reducing efficiency and potentially frustrating customers who expected more natural conversations.

The Generative AI Solution

Core Technical Approach

Smith.ai’s new system leverages large language models to enable more natural, context-aware conversations. The key technical elements described include:

Just-in-time context injection: The AI is infused with contextual information from within the ongoing chat conversation along with external content sources. This approach resembles what is commonly known in the industry as Retrieval-Augmented Generation (RAG), where relevant information is retrieved and provided to the LLM at query time to ground its responses in accurate, business-specific data.

Business-specific training data: The LLM is “steered” using the client business’s own data. Initially, the primary data source is the business’s website, which the AI ingests and can reference when formulating responses. The company indicates plans to expand this to include:

This approach addresses a common challenge in deploying LLMs for customer service: ensuring responses are accurate and specific to the business rather than generic or potentially hallucinated.

Hybrid Free-flow and Structured Workflows

One notable architectural decision is the combination of free-form conversation handling with structured “playbooks” (task-specific workflows). The system can:

This hybrid approach suggests a system where the LLM handles the natural language understanding and generation components, while deterministic workflows handle critical business processes that require consistent data capture.

Human-in-the-Loop Architecture

A central aspect of Smith.ai’s LLMOps approach is their continued use of human agents in a supervisory and intervention role. This is positioned as a key differentiator from purely automated solutions and addresses common concerns about LLM reliability in production environments.

The human agents serve several functions:

The text describes this as allowing humans to focus on higher-value activities by offloading “repetitive and mundane” tasks to the AI. Agents enter conversations “later and only when necessary,” which suggests a system where the AI handles the initial interaction and escalates to humans based on certain triggers or thresholds—though the specific escalation criteria are not detailed.

Production Considerations and Observations

What the Case Study Addresses

The announcement touches on several LLMOps-relevant concerns:

What the Case Study Doesn’t Address

It’s worth noting several areas where the case study lacks technical depth or transparency:

Critical Assessment

The case study is fundamentally a product announcement and marketing piece, so it naturally emphasizes benefits while omitting challenges and technical complexities. The claims about improved customer experience and AI capability are not substantiated with specific metrics or customer testimonials within this particular text.

That said, the architectural decisions described—particularly the human-in-the-loop approach and the use of business-specific data for grounding responses—align with widely-recognized best practices for deploying LLMs in customer-facing applications where accuracy and reliability are important.

The hybrid approach of combining free-form LLM conversation with structured workflows is a pragmatic solution that acknowledges current LLM limitations while leveraging their strengths in natural language understanding and generation. Similarly, maintaining human oversight addresses both quality concerns and the reality that fully autonomous AI customer service remains challenging for complex or sensitive interactions.

Industry Context

Smith.ai serves businesses across multiple industries including legal, healthcare, home services, and more. This makes their LLMOps implementation particularly interesting because it must handle diverse domain-specific vocabularies and customer service scenarios while maintaining accuracy. The modular approach of training on client-specific website data and documentation suggests a system designed to be customizable across these different verticals without requiring completely different models for each.

The 24/7 availability mentioned indicates this is a production system handling real customer interactions at scale, making the human oversight layer an important safety mechanism for maintaining service quality around the clock.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61

Production AI Deployment: Lessons from Real-World Agentic AI Systems

Databricks / Various 2026

This case study presents lessons learned from deploying generative AI applications in production, with a specific focus on Flo Health's implementation of a women's health chatbot on the Databricks platform. The presentation addresses common failure points in GenAI projects including poor constraint definition, over-reliance on LLM autonomy, and insufficient engineering discipline. The solution emphasizes deterministic system architecture over autonomous agents, comprehensive observability and tracing, rigorous evaluation frameworks using LLM judges, and proper DevOps practices. Results demonstrate that successful production deployments require treating agentic AI as modular system architectures following established software engineering principles rather than monolithic applications, with particular emphasis on cost tracking, quality monitoring, and end-to-end deployment pipelines.

healthcare chatbot question_answering +42