Lubu Labs: Deploying an AI SDR Chatbot for Lead Qualification with Production-Grade Observability

Company

Lubu Labs

Title

Deploying an AI SDR Chatbot for Lead Qualification with Production-Grade Observability

Industry

Tech

Link

https://www.lubulabs.com/ai-blog/langsmith-production-ai-sdr-loyalty-chatbot

Year

2026

Summary (short)

Lubu Labs deployed an AI SDR (Sales Development Representative) chatbot for a loyalty platform to qualify inbound leads, answer product questions, and route conversations appropriately. The implementation faced challenges around quality drift on real traffic, debugging complex tool and model interactions, and occasional duplicate CRM actions that could damage revenue operations. The team used LangSmith's tracing, feedback loops, and evaluation workflows to make the system debuggable and production-ready, implementing idempotent tool calls, structured state management with LangGraph, and regression testing against representative conversation datasets to ensure reliable operation.

Tags

## Overview Lubu Labs built and deployed an AI SDR (Sales Development Representative) chatbot for a loyalty platform client, transitioning from demo to production while confronting the operational challenges that differentiate proof-of-concept from revenue-critical systems. The use case centered on qualifying inbound leads, answering product-specific questions, collecting essential information conversationally, routing prospects to appropriate next steps (self-serve resources, meeting scheduling, support tickets), and writing structured outcomes to downstream CRM and operational systems. The case study emphasizes that production deployment of LLM-based systems requires comprehensive observability infrastructure, not just better prompts. The implementation illustrates several core LLMOps principles: treating AI agents as stateful workflows rather than simple request-response systems, instrumenting multi-step processes for end-to-end visibility, implementing idempotency safeguards for side-effecting operations, establishing regression testing frameworks with representative data, and closing feedback loops between production outcomes and system improvements. The team positioned LangSmith as the central observability and evaluation platform that made the system "shippable" once complexity moved beyond simple chatbot interactions into tool-using agents with real business consequences. ## Technical Architecture and Implementation The chatbot was architected as a LangGraph state machine, explicitly modeling the SDR workflow as nodes representing distinct stages with conversation state carried forward between transitions. The team defined a custom `SDRState` class extending `MessagesState` to track lead identifiers, lead data structures, qualification scores, next action determinations, knowledge base context from retrieval, and thread IDs for both idempotency and tracing correlation. This structured state management approach contrasts with simpler stateless chatbot patterns and reflects the complexity required for production agent systems. The workflow nodes included distinct stages for knowledge base retrieval (using Qdrant vector store), qualification logic, product Q&A when needed, sales briefing generation, and CRM write operations. Conditional edges determined routing based on conversation state, such as whether product-specific questions required additional retrieval. The graph compilation included a PostgreSQL-backed checkpointer (`PostgresSaver`) for state persistence and reliability, enabling conversation resumption and providing a foundation for retry handling. Thread IDs served as stable identifiers linking conversation turns across the system. The retrieval implementation connected to Qdrant vector store with explicit error handling and validation, addressing the common production challenge that retrieval can fail silently or return low-quality results. The team implemented retry policies with exponential backoff for reliability, configuring maximum attempts, initial intervals, backoff factors, and maximum intervals while specifying which exception types trigger retries. This reflects a mature approach to handling transient failures in production LLM systems where external dependencies (vector databases, APIs, model endpoints) introduce failure modes absent in prototype implementations. Tool calling emerged as a critical complexity point. The agent needed to perform actions with side effects: writing lead updates to CRM systems, triggering marketing automation, potentially scheduling meetings. The team learned that debugging "the model responded weirdly" required seeing the exact sequence of model calls and tool invocations with their inputs, outputs, timing, and retry behavior. Without trace-level visibility into this multi-step execution, failures appeared as generic errors disconnected from root causes, making systematic improvement impossible. ## Production Challenges and Failure Modes The case study identifies three primary challenges encountered when moving from staging to production, each illustrating common LLMOps pitfalls. First, quality drift on real traffic manifested when prompts that "worked" in controlled testing began failing once users posed messy, multi-part questions reflecting actual conversational patterns. This highlights the gap between synthetic test cases and production distribution shifts, a fundamental challenge in deployed LLM systems where evaluation datasets rarely capture the full range of user behavior. Second, debugging tool and model interactions proved difficult without granular observability. When failures occurred in workflows involving retrieval, reasoning, and CRM writes, the team couldn't distinguish whether problems originated from model outputs, tool execution errors, or interaction between components. This opacity is characteristic of agent systems where multiple LLM calls, retrieval operations, and external API interactions create complex failure surfaces that single-call logging cannot adequately capture. Third, and most insidiously, occasional duplicate actions created operational noise despite being rare enough to evade manual quality assurance. A small percentage of conversations resulted in two lead updates instead of one, causing downstream confusion and undermining trust in the system. This failure mode illustrates how side-effecting operations in production AI systems require different guarantees than read-only inference, demanding idempotency controls typically absent from prototype implementations. The duplicate action bug specifically emerged under latency spikes when CRM write operations timed out, triggering retries. The first attempt would eventually complete after the timeout while the retry also succeeded, resulting in duplicate writes from a single user turn. The team notes this looked innocuous in assistant message logs since the visible output appeared correct while the damage manifested in side effects. This exemplifies how production observability must extend beyond model outputs to include tool execution timing, retry behavior, and external system interactions. ## LangSmith as Observability and Evaluation Infrastructure LangSmith's tracing capability provided end-to-end timelines for each conversation turn, capturing model calls, tool invocations, inputs and outputs, latencies, and metadata. The team describes this as a "flight recorder" enabling reproduction of failures by run ID and comparison of behavior across prompt versions. The practical value emerged in moving from reactive debugging based on anecdotes to systematic investigation anchored in complete execution traces. Each trace included sufficient context to answer questions about which prompt version produced observed behavior, what retrieval actually returned, which tool calls failed or returned unexpected data, and whether operations were retries potentially causing duplication. The team emphasized tagging every run with metadata needed for post-hoc analysis: environment (staging vs. production), channel (web, mobile, API), lead type hypothesis, prompt version identifiers, and stable lead/thread IDs. This metadata strategy reflects production observability maturity where "what happened?" cannot be answered independently from "under what conditions?" Without systematic tagging, traces become difficult to segment and analyze across the dimensions most relevant to debugging and improvement. LangSmith's evaluation workflows enabled the team to build regression testing against representative conversation datasets. They curated examples including objections, pricing questions, ambiguous intent, and messy multi-part requests, then ran the agent against this dataset as they iterated on prompts, retrieval configurations, and tool schemas. This approach transformed quality from a subjective assessment into a measurable property that could be tracked across deployments, preventing silent quality degradation between releases. The evaluation strategy started simple with checks for correct routing, required field extraction, groundedness in retrieved context, and avoidance of unsafe side effects. The team advises starting with these foundational evaluations before adding nuance, acknowledging that comprehensive evaluation frameworks evolve incrementally rather than being designed completely upfront. This pragmatic approach contrasts with perfectionist evaluation strategies that delay deployment while pursuing comprehensive metrics. LangSmith's feedback and annotation system closed the loop between production outcomes and system improvements. The team collected structured feedback on production runs, marking good and bad outcomes with a taxonomy of failure categories: wrong routing, hallucination, missed required fields, tone mismatches, and tool failures. This categorized feedback created a prioritized queue of engineering work and fed high-quality examples back into evaluation datasets. The team warns against collecting only thumbs-up/thumbs-down signals, arguing that structured failure taxonomies convert user feedback into actionable engineering tasks. ## Solving the Duplicate Actions Problem The duplicate lead update issue demonstrates how observability enables discovering what to harden in production systems. LangSmith traces made the exact tool call timing and retry behavior visible, revealing that under latency spikes, CRM write timeouts triggered retries while the original attempt eventually completed. With this root cause identified through traces, the team implemented three changes: making tool calls idempotent through idempotency keys, separating "retry model call" from "retry side-effecting tool call" to apply different policies, and adding specific evaluation cases for "no duplicate writes" under simulated timeout conditions. The concrete implementation added idempotency keys derived from thread ID and turn ID to CRM write operations. The code example shows constructing a payload with lead data, score, and sales briefing, then invoking the CRM write with an idempotency key formatted as `{thread_id}:{turn_id}`. This ensures that retries of the same logical operation can be deduplicated by downstream systems or prevented by the tool implementation itself. The team positioned the CRM write boundary as the critical place to harden, reflecting the principle that side-effecting operations require stricter guarantees than read operations. This solution illustrates a broader LLMOps pattern: distinguishing retry policies for model inference (which is generally safe to retry) from tool execution with side effects (which requires idempotency guarantees). Many LLM frameworks default to uniform retry behavior, but production systems must differentiate between retriable operations that cannot cause duplicate effects and operations that modify external state. The case study frames this as "separating thinking from doing," positioning model calls as stateless reasoning and tool calls as stateful actions requiring different operational disciplines. ## Production Deployment Lessons and Recommendations The case study synthesizes several recommendations for production LLM deployments based on lessons learned. First, make grounding inspectable by using explicit knowledge base retrieval steps with full tracing of what was retrieved and under which configuration (prompt version, retrieval parameters, thread ID). The team used Qdrant for vector search but emphasizes that the choice matters less than ensuring retrieval is observable rather than opaque. This enables debugging questions about whether answer quality issues stem from retrieval problems versus reasoning failures. Second, separate thinking from doing by treating retries as normal for model calls but dangerous for side effects. The recommendation is to make tool calls idempotent and distinguish "retry the model" from "retry the tool," applying different policies and safeguards to each. This architectural principle prevents common production issues where naive retry logic causes duplicate database writes, duplicate emails, or duplicate charge attempts. Third, treat quality as a regression problem by maintaining representative conversation datasets and re-running evaluations on every meaningful change to prompts, models, retrieval settings, or tool schemas. This transforms quality assurance from subjective review to systematic measurement, creating accountability for changes and early detection of regressions. The team notes this provided product and sales stakeholders with concrete evidence for improvements rather than debates over individual conversation transcripts. The overarching message positions observability as non-optional for AI agents touching revenue operations. The case study explicitly states LangSmith didn't "make the model smarter" but rather made the system "understandable enough to improve" and therefore safe to ship. This framing challenges the common assumption that production readiness primarily requires better prompts or models, arguing instead that the critical enabler is infrastructure for systematic understanding, testing, and improvement of complex multi-step agent behaviors. ## Critical Assessment and Limitations While the case study provides valuable practical insights, several aspects warrant balanced consideration. The narrative strongly emphasizes LangSmith as the enabling technology, which reflects both the genuine value the team found and the promotional context of a case study likely coordinated with LangChain/LangSmith marketing. Alternative observability approaches exist (Phoenix Arize, Langfuse, custom instrumentation with OpenTelemetry) that could provide similar capabilities, and the case study doesn't explore tradeoffs between these options. The technical depth is somewhat limited in certain areas. The retrieval implementation mentions Qdrant and retrieval tracing but provides minimal detail about embedding strategies, chunking approaches, reranking, or retrieval evaluation methodology. The evaluation framework is described at a high level (routing correctness, field extraction, groundedness) without specifying how these properties are measured or what thresholds determine pass/fail. The CRM integration details are minimal, and the specific loyalty platform context is left vague, limiting transferability of lessons to other domains. The duplicate actions problem, while interesting, represents a relatively standard distributed systems challenge around idempotency that is well-understood outside the LLM context. The solution (idempotency keys based on conversation turn) is correct but not particularly novel. The case study could have explored more LLM-specific challenges like prompt injection vulnerabilities, handling retrieval of outdated information, or managing context window limitations in long conversations. The focus on observability and testing is valuable but may underweight other critical production concerns. The case study doesn't discuss cost management and token usage optimization, latency requirements and optimization strategies, model version management and controlled rollout strategies, data retention and privacy compliance for conversation logs, or security considerations around tool calling and CRM access. For a production SDR system handling potentially sensitive business information, these omissions are notable. The results section is largely qualitative, describing the system as "shippable" and emphasizing that observability made systematic improvement possible, but providing limited quantitative metrics. We don't learn about actual lead qualification accuracy, conversation completion rates, handoff quality as rated by sales teams, false positive rates for routing decisions, or improvement magnitudes from prompt iterations. The lack of metrics makes it difficult to assess the actual business impact and validate the strong claims about production readiness. ## Broader Context and LLMOps Implications Despite these limitations, the case study effectively illustrates several important LLMOps principles that extend beyond the specific tooling choices. The distinction between demo-quality and production-quality systems centers on observability, testing infrastructure, and operational safeguards rather than core model capabilities. Many teams underinvest in these areas while over-focusing on prompt engineering and model selection, only to encounter deployment challenges when systems face real traffic patterns and business consequences. The architecture pattern of modeling agent workflows as explicit state machines with distinct nodes and persistent state reflects emerging best practices for complex LLM applications. This approach provides clear boundaries for testing, debugging, and reasoning about system behavior compared to monolithic prompt chains or purely reactive chatbot patterns. The checkpointing strategy enables conversation resumption and provides audit trails, both critical for production systems. The emphasis on structured feedback loops and continuous evaluation demonstrates maturity in thinking about LLM systems as living artifacts requiring ongoing tuning rather than one-time deployments. The recommendation to maintain curated datasets of representative interactions and run regression tests on changes reflects software engineering discipline applied to ML systems, bridging the gap between traditional software practices and the unique challenges of generative AI. The case study implicitly addresses the challenge of stakeholder alignment in AI projects by describing how systematic evaluations provided product and sales teams with concrete evidence for improvements. This highlights how good LLMOps practices facilitate cross-functional collaboration by making system behavior measurable and discussable in shared terms, moving beyond subjective assessments of individual outputs. Overall, while presented through the lens of specific tooling (LangSmith, LangGraph), the case study captures genuine lessons about what production LLM deployment requires: comprehensive observability, systematic evaluation, careful handling of side effects, structured feedback collection, and disciplined regression testing. These principles transcend particular platforms and represent foundational LLMOps capabilities that teams building production AI systems should prioritize.

Start deploying reproducible AI workflows today