ZenML

Leveraging LangSmith for Debugging Tools & Actions in Production LLM Applications

Mendable 2024
View original source

Mendable.ai enhanced their enterprise AI assistant platform with Tools & Actions capabilities, enabling automated tasks and API interactions. They faced challenges with debugging and observability of agent behaviors in production. By implementing LangSmith, they successfully debugged agent decision processes, optimized prompts, improved tool schema generation, and built evaluation datasets, resulting in a more reliable and efficient system that has already achieved $1.3 million in savings for a major tech company client.

Industry

Tech

Technologies

Overview

Mendable.ai is a platform that helps enterprise teams answer technical questions using AI. This case study, authored by Nicolas Camara (CTO of Mendable), describes how they leveraged LangSmith from LangChain to debug and optimize their “Tools & Actions” feature, which enables AI agents to perform automated actions beyond simple question-and-answer interactions. The case study provides valuable insights into the challenges of building production-grade agentic AI systems and the importance of observability tooling in LLMOps.

Business Context and Use Case

The case study highlights a significant enterprise deployment where Mendable equipped approximately 1,000 customer success and sales personnel at a $20+ billion tech company with GTM (Go-To-Market) assistants. These assistants provide technical guidance, process help, and industry expertise. The reported results include $1.3 million in savings over five months, with projections of $3 million in annual savings due to decreased research time and reduced dependency on technical resources. While these numbers come from the company itself and should be viewed with appropriate skepticism, they do illustrate the scale and potential impact of the deployment.

The Tools & Actions feature represents an evolution from simple Q&A to action-based AI assistants. A typical workflow example involves a salesperson asking about key initiatives for a prospect, where the assistant could:

This expansion in capabilities introduces significant complexity in terms of reliability, debugging, and observability.

Technical Challenges

The core technical challenge Mendable faced was the “black box” nature of agent execution. When building applications that depend on agentic behavior, reliability and observability become critical concerns. Understanding the key interactions and decisions within an agent loop is inherently tricky, especially when the agent has access to multiple resources and is embedded in a production pipeline.

A specific design feature that introduced complexity was allowing users to create custom tools via API calls. Users could input tags like <ai-generated-value> when creating API requests, which the AI would fill at request time based on the user’s question and the schema. This “just-in-time” AI input/output pattern created numerous moving parts that were difficult to debug.

Before implementing proper observability tooling, the development process was cluttered with console.log statements throughout the codebase and suffered from high-latency debugging runs. Trying to understand why a tool wasn’t being called or why an API request had failed became what the author describes as a “nightmare.” There was no proper visibility into the agentic behavior or whether custom tools were working as expected.

LangSmith Integration and Solution

Mendable was already using LangChain’s OpenAI tool agents, which made the integration with LangSmith straightforward. LangSmith provided the observability capabilities needed to understand agent behavior in production.

Tracing and Call Hierarchy Visualization

When tracing is enabled in LangChain, the application captures and displays a detailed visualization of the run’s call hierarchy. This feature allows developers to explore:

This visibility proved immediately valuable. The case study describes an example where, upon connecting LangSmith to the Tools & Actions module, the team quickly spotted problems that were previously invisible. One trace revealed that a call to ChatOpenAI was taking 7.23 seconds—an unusually long time. Upon investigation, they discovered that the prompt had concatenated all of their RAG pipeline prompts and sources with the Tools & Actions content, leading to significant delays in the streaming process. This insight allowed them to optimize which chunks of the prompt needed to be used by the Tools & Actions module, reducing overall latency.

Tool Inspection and Schema Validation

A particularly valuable aspect of LangSmith’s tracing capabilities was the ability to inspect tool inputs. Since Mendable allows users to create custom tools, they needed to ensure that the tool-building process in the UI was both easy and performant. This meant that when a tool was created in the backend, it needed to have the correct schema defined—partially by user input (API request details) and partially by what the AI would automatically provide at request time.

The case study provides an example of a “Recent News Tool” where the query parameter inside {query: {q}} was generated by the AI. Ensuring this query was accurate to user intent while also being optimized for the specific tool being used was challenging. LangSmith made it easy to verify this by running the same tool with different queries approximately 20 times and quickly scrolling through traces to confirm output and schema accuracy.

For traces that weren’t accurate, the team could either drill deeper into the trace to understand the issue or annotate it within LangSmith for later review. This led to an important insight: the tool descriptions were critical for generating correct schemas and inputs. Armed with this knowledge, they improved the AI-generated portions of tool creation and updated their product to emphasize the importance of providing detailed descriptions when users create tools.

Dataset Building for Evaluation

As optimization experiments accumulated, the need to quickly save inputs and outputs for further evaluation became apparent. LangSmith’s dataset functionality allowed the team to select specific runs and add them to a dataset with a single button click. This created a centralized repository of trace data that could be used for ongoing evaluation, either manually or through LangSmith’s built-in evaluation capabilities.

LLMOps Best Practices Demonstrated

This case study illustrates several important LLMOps best practices:

Observability as a First-Class Concern: The transition from console.log debugging to proper tracing infrastructure represents a maturation in how AI applications should be developed and maintained. Production AI systems require visibility into their decision-making processes.

Performance Optimization Through Visibility: The 7.23-second latency issue would have been extremely difficult to diagnose without proper tracing. The ability to identify which component was slow and why (prompt concatenation) enabled targeted optimization.

Iterative Testing and Validation: Running tools multiple times with different inputs and reviewing traces allowed for systematic quality improvement. This approach is more rigorous than ad-hoc testing.

Data Collection for Future Evaluation: Building datasets from production runs creates a foundation for regression testing, A/B testing, and ongoing model evaluation.

User Experience Considerations: The insight that tool descriptions are critical for proper AI behavior led to both technical improvements and UX changes to guide users toward better tool creation.

Limitations and Considerations

It’s worth noting that this case study comes from the LangChain blog and was written by a customer of LangSmith, so there is an inherent promotional element. The specific metrics cited ($1.3 million in savings, $3 million projected) are self-reported and not independently verified. The case study also doesn’t discuss any limitations or challenges with LangSmith itself, which suggests a somewhat one-sided perspective.

Additionally, the Tools & Actions feature is described as early-stage (“we are still early in the process”), so long-term production reliability data isn’t available. The case study focuses primarily on the debugging and development phase rather than ongoing production monitoring and maintenance.

Conclusion

Mendable’s experience demonstrates the critical importance of observability tooling in production LLM applications, particularly for agentic systems that make complex decisions and interact with external APIs. LangSmith provided the visibility needed to move from chaotic debugging to systematic optimization, enabling faster development cycles and more reliable deployments. For teams building similar agentic AI systems, this case study underscores the value of investing in proper tracing and debugging infrastructure early in the development process.

More Like This

Building AI-Native Platforms: Agentic Systems, Infrastructure Evolution, and Production LLM Deployment

Delphi / Seam AI / APIsec 2025

This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.

chatbot content_moderation customer_support +40

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57