ZenML

Rebuilding a Production Chatbot with Direct API Access and Multi-Agent Architecture

Langchain 2025
View original source

LangChain rebuilt their public documentation chatbot after discovering their support engineers preferred using their own internal workflow over the existing tool. The original chatbot used traditional vector embedding retrieval, which suffered from fragmented context, constant reindexing, and vague citations. The solution involved building two distinct architectures: a fast CreateAgent for simple documentation queries delivering sub-15-second responses, and a Deep Agent with specialized subgraphs for complex queries requiring codebase analysis. The new approach replaced vector embeddings with direct API access to structured content (Mintlify for docs, Pylon for knowledge base, and ripgrep for codebase search), enabling the agent to search iteratively like a human. Results included dramatically faster response times, precise citations with line numbers, elimination of reindexing overhead, and internal adoption by support engineers for complex troubleshooting.

Industry

Tech

Technologies

Overview

LangChain’s case study documents their experience rebuilding their production documentation chatbot (chat.langchain.com) after recognizing that their own support engineers weren’t actively using the tool they had built. This self-reflective case study offers valuable insights into the realities of deploying LLM-based agents in production, particularly around the gap between theoretical capabilities and practical usability. The company discovered that their internal team had developed a three-step manual workflow—searching documentation, checking the knowledge base, and examining actual code implementation—that proved more reliable than their chatbot. This observation became the foundation for their redesign.

The original chatbot followed a conventional RAG pattern using vector embeddings, chunking documents, and similarity-based retrieval. While functional, it suffered from several production issues: fragmented context due to chunking, constant reindexing as documentation updated multiple times daily, and citations that lacked specificity. The engineers’ preference for manual search revealed that the chatbot wasn’t meeting real user needs, particularly for complex technical troubleshooting questions.

Architecture Design: Two Modes for Different Query Types

A key architectural decision was recognizing that not all queries require the same depth of investigation. LangChain implemented two distinct agent architectures optimized for different use cases:

CreateAgent for Fast Documentation Queries: For straightforward documentation questions, they use LangChain’s CreateAgent abstraction, which prioritizes speed by eliminating planning overhead and orchestration complexity. This agent performs immediate tool calls—typically 3-6 per query—and returns answers in under 15 seconds. The system offers multiple model options (Claude Haiku 4.5, GPT-4o Mini, and GPT-4o-nano), with Haiku 4.5 demonstrating exceptional performance for rapid tool calling while maintaining accuracy. This mode handles the majority of user queries where documentation and knowledge base searches suffice.

Deep Agent with Subgraphs for Complex Queries: For questions requiring codebase analysis, they built a Deep Agent architecture with specialized subgraphs. Each subgraph operates as a domain expert: one handles documentation search, another manages knowledge base queries, and a third performs codebase analysis. This architecture takes 1-3 minutes for complex queries but provides comprehensive answers that synthesize information across multiple domains. The subgraph approach is crucial for managing context—each subagent filters information in its domain and extracts only the “golden data” before passing refined insights to the main orchestrator agent.

Moving Away from Vector Embeddings

The case study presents a critical evaluation of vector embeddings for structured content. While acknowledging that embeddings work well for unstructured content like PDFs, LangChain identifies three specific problems they encountered with product documentation:

Structural fragmentation: Chunking documentation into 500-token fragments destroys the inherent organization—headers, subsections, and contextual relationships. This resulted in citations like “set streaming=True” without explaining when or why, forcing users to hunt through pages for context.

Operational overhead: With documentation updating multiple times daily, the constant cycle of re-chunking, re-embedding, and re-uploading created significant maintenance burden and deployment friction.

Citation quality: Similarity-based retrieval produced vague citations that users couldn’t easily verify or trace back to authoritative sources.

The breakthrough insight was that they were “solving the wrong problem.” Documentation already has structure, knowledge bases are already categorized, and codebases are already navigable. Rather than building sophisticated retrieval mechanisms, they gave the agent direct access to existing organizational structures.

Tool Design: Mirroring Human Workflows

The tools LangChain built explicitly mirror how humans actually search for information rather than how retrieval algorithms typically work:

Documentation Search with Mintlify API: Instead of retrieving document chunks, the agent queries Mintlify’s API and receives complete pages with all headers, subsections, and code examples intact. Critically, the agent is prompted to evaluate whether initial results actually answer the question and to refine searches iteratively. With a budget of 4-6 tool calls, the agent can search for “memory,” recognize the ambiguity between different memory types, then search specifically for “checkpointing” and “store API” to provide comprehensive coverage.

Knowledge Base Two-Step Search: The knowledge base search (powered by Pylon) implements a scan-then-read pattern. First, the agent retrieves article titles—sometimes dozens—and scans them to identify relevance. Then it reads only the most promising articles in full. This prevents context window overload while ensuring the agent thoroughly understands the articles it does choose to examine.

Codebase Search Tools: The codebase search toolset mirrors the workflow of experienced engineers using tools like Claude Code. Three tools work in sequence: search_public_code uses ripgrep for pattern matching, list_public_directory understands file structure with tree commands, and read_public_file extracts specific implementations with line-number precision. For example, when investigating streaming token issues, the codebase subagent can search for “streaming buffer,” navigate to the relevant file, and return lines 47-83 where the default buffer size is implemented.

A critical aspect of the system is prompting the agent to think critically about whether it has sufficient information. Rather than performing a single search and returning whatever it finds, the agent is instructed to refine queries when results are ambiguous, search for concepts mentioned but not explained, and narrow down when multiple interpretations exist. This iterative search process happens in seconds with CreateAgent but fundamentally changes response quality—the agent isn’t just retrieving, it’s reasoning about what the user actually needs.

When a user asks “How do I add memory to my agent?”, the agent recognizes this could refer to persisting conversation state within a thread or storing facts across conversations. It searches for “checkpointing” to understand thread-level persistence, fetches a relevant support article, recognizes it doesn’t cover cross-thread memory, then searches for “store API” to fill the gap. The final answer comprehensively covers both use cases with precise citations.

Managing Context with Subgraphs

The initial implementation of their deep agent gave it access to all three tools simultaneously, resulting in context explosion—five documentation pages, twelve knowledge base articles, and twenty code snippets all at once. The agent would either produce bloated responses or miss key insights.

Restructuring with specialized subgraphs solved this problem. Each subagent operates independently, searching its domain, asking follow-up questions, filtering results, and extracting only essential information. The main orchestrator never sees raw search results, only refined insights. The documentation subagent might read five full pages but return only two key paragraphs; the knowledge base subagent might scan twenty titles but return only three relevant summaries; the codebase subagent might search fifty files but return only specific implementation details with line numbers. This curated information enables the main agent to synthesize comprehensive answers without drowning in irrelevant detail.

Production Infrastructure and Middleware

The case study emphasizes that elegant agent design requires robust production infrastructure. LangChain implemented modular middleware handling operational concerns that would otherwise clutter prompts:

These layers are invisible to users but essential for reliability, allowing the agent to focus on reasoning while infrastructure handles failure modes, cost optimization, and quality control.

Observability and Optimization with LangSmith

LangSmith played a central role in both development and optimization. The team traced every conversation to identify unnecessary tool calls and refine prompts. The observability data revealed that most questions could be answered with 3-6 tool calls if the agent was taught to ask better follow-up questions. LangSmith’s evaluation suite enabled A/B testing of different prompting strategies with measurements of both speed and accuracy improvements. One example trace shows a 30-second interaction with 7 tool calls: 4 documentation searches, a knowledge base article lookup, and 2 article reads, with 20 seconds devoted to streaming the final response.

Streaming and State Management

The user experience leverages the LangGraph SDK for streaming and state management. When users open Chat LangChain, their conversation history is fetched using thread metadata filtering. Each thread stores the user’s ID, ensuring conversations remain private and persistent across sessions.

When a user sends a message, responses stream in real-time using three stream modes: messages shows tokens appearing progressively, updates reveals tool calls as the agent searches, and values shows the final complete state. Users can watch the agent think, search documentation, check the knowledge base, and build responses token-by-token without loading spinners. The system also enables streamSubgraphs: true to show nested agent activity.

Conversation memory is handled by passing the same thread_id across messages, with LangGraph’s checkpointer automatically storing history, retrieving context, and maintaining state. A 7-day TTL manages data retention.

Results and Business Impact

Since launching the new system, Chat LangChain delivers sub-15-second responses with precise citations for public users. The direct API approach eliminated reindexing overhead—documentation updates automatically without deployment cycles. Users can immediately verify answers through links to specific documentation pages or knowledge base articles.

Internally, support engineers now actively use the Deep Agent for complex tickets. The system searches documentation, cross-references known issues, and dives into private codebases to find implementation details that explain production behavior. The case study emphasizes that “the agent doesn’t replace our engineers—it amplifies them,” handling research so engineers can focus on problem-solving.

Critical Assessment and Tradeoffs

While the case study presents compelling results, several aspects warrant balanced consideration:

Architecture complexity: The dual-mode approach (CreateAgent for simple queries, Deep Agent for complex ones) introduces classification complexity. The text mentions that Deep Agent mode is “only enabled for a subset of users at launch,” suggesting challenges in determining when to invoke the more expensive architecture.

Latency tradeoffs: The 1-3 minute response time for Deep Agent queries, while justified for complex investigations, represents a significant user experience tradeoff. The case study doesn’t discuss how users perceive or react to these longer wait times, or how the system manages user expectations.

Private vs. public codebase: The internal version searches private repositories while the public version searches only public code. This creates a feature disparity that may limit the public tool’s effectiveness for users investigating behaviors that depend on private implementations.

Model selection burden: Offering multiple models (Haiku 4.5, GPT-4o Mini, GPT-4o-nano) shifts optimization decisions to end users. While flexibility is valuable, it assumes users understand performance tradeoffs and can make informed choices.

Cost implications: The case study mentions caching middleware for cost optimization but doesn’t provide detailed cost analysis. The iterative search approach with 3-6 tool calls per query, particularly with the Deep Agent making even more calls across subgraphs, likely has significant token usage compared to single-shot retrieval.

Generalizability: The success of direct API access depends heavily on content being well-structured and APIs being available. Organizations with less organized documentation or without existing APIs may not achieve similar benefits without substantial upfront investment in content organization.

Key Takeaways for LLMOps Practitioners

The case study offers several valuable lessons for production LLM deployments:

User observation over assumptions: The trigger for the entire redesign was observing that internal users preferred their manual workflow. This highlights the importance of monitoring actual usage patterns rather than assuming tools are effective.

Architecture diversity: Different query types benefit from different architectures. The fast CreateAgent serves the majority of simple queries, while the slower Deep Agent handles edge cases requiring comprehensive investigation.

Structure over similarity: For structured content, giving agents direct access to existing organization often outperforms similarity-based retrieval with all its overhead.

Iterative search with reasoning: Prompting agents to evaluate result quality and refine searches produces significantly better outcomes than single-shot retrieval.

Context management through specialization: Subgraphs that filter and extract “golden data” before passing information to orchestrator agents prevent context overload while maintaining comprehensiveness.

Production middleware as first-class concern: Guardrails, retries, fallbacks, and caching aren’t afterthoughts but essential infrastructure for reliable production systems.

Observability drives optimization: LangSmith’s tracing and evaluation capabilities enabled data-driven prompt refinement and performance optimization.

The case study represents LangChain demonstrating their own tools in a real production scenario, which inherently creates promotional aspects. However, the technical details, candid discussion of initial failures, and specific architectural decisions provide valuable insights for practitioners building production LLM systems. The emphasis on observing user behavior, the willingness to abandon conventional approaches like vector embeddings when they don’t serve the use case, and the investment in production infrastructure all reflect mature LLMOps thinking.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building a Multi-Agent Research System for Complex Information Tasks

Anthropic 2025

Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.

question_answering document_processing data_analysis +48

Production Monitoring and Issue Discovery for AI Agents

Raindrop 2026

Raindrop's CTO Ben presents a comprehensive framework for building reliable AI agents in production, addressing the challenge that traditional offline evaluations cannot capture the full complexity of real-world user behavior. The core problem is that AI agents fail in subtle ways without concrete errors, making issues difficult to detect and fix. Raindrop's solution centers on a "discover, track, and fix" loop that combines explicit signals like thumbs up/down with implicit signals detected semantically in conversations, such as user frustration, task failures, and agent forgetfulness. By clustering these signals with user intents and tracking them over time, teams can identify the most impactful issues and systematically improve their agents. The approach emphasizes experimentation and production monitoring over purely offline testing, drawing parallels to how traditional software engineering shifted from extensive QA to tools like Sentry for error monitoring.

chatbot customer_support question_answering +40