ZenML

LangSmith Integration for Automated Feedback and Improved Iteration in SDLC

Factory 2024
View original source

Factory AI implemented self-hosted LangSmith to address observability challenges in their SDLC automation platform, particularly for their Code Droid system. By integrating LangSmith with AWS CloudWatch logs and utilizing its Feedback API, they achieved comprehensive LLM pipeline monitoring, automated feedback collection, and streamlined prompt optimization. This resulted in a 2x improvement in iteration speed, 20% reduction in open-to-merge time, and 3x reduction in code churn.

Industry

Tech

Technologies

Overview

Factory is an enterprise AI company focused on automating the software development lifecycle (SDLC) through what they call “Droids” — autonomous AI agents that handle different stages of software development. Their flagship product, Code Droid, is designed to automate complex software development tasks. The company raised $15 million in Series A funding led by Sequoia Capital, indicating significant investor confidence in their approach. This case study, published by LangChain (the vendor of the tools Factory uses), documents how Factory implemented LLM observability and feedback mechanisms to improve their development iteration speed.

It’s important to note that this case study originates from LangChain’s marketing materials, so the claims should be viewed with appropriate skepticism as they naturally emphasize the benefits of LangChain’s products. That said, the technical implementation details provide useful insights into real-world LLMOps challenges and solutions.

The Core Challenge: Observability in Secure Enterprise Environments

Factory operates in enterprise contexts where customers have strict data controls and security requirements. This creates a fundamental tension in LLMOps: modern LLM applications require extensive observability to debug and optimize complex multi-step workflows, but many observability solutions require sending sensitive data to third-party services, which is unacceptable in high-security environments.

Factory’s specific challenges included:

Technical Implementation: Self-Hosted LangSmith with CloudWatch Integration

Factory’s solution centered on deploying self-hosted LangSmith, which allowed them to maintain observability infrastructure within controlled environments. The key architectural decision was integrating LangSmith with AWS CloudWatch logs, creating a unified logging and tracing infrastructure.

The integration works by exporting LangSmith traces to CloudWatch, enabling Factory engineers to correlate LangSmith events and steps with their existing CloudWatch logs. This is particularly valuable for debugging agentic workflows where the LLM makes multiple calls and decisions. By linking these systems, engineers can pinpoint exactly where in the agentic pipeline an issue occurred, maintaining what Factory describes as “a single source of truth for data flow in LLM from one step to the next.”

The custom tracing capability via LangSmith’s first-party API was highlighted as a key feature that enabled Factory to work around the challenges posed by their custom LLM tooling. Rather than being constrained by pre-built integrations that might not fit their architecture, they could implement tracing in a way that matched their specific system design.

Debugging Context-Awareness and Hallucination Issues

One of the more technically interesting aspects of Factory’s implementation is how they use LangSmith to debug context-awareness issues. In autonomous coding agents, hallucinations can be particularly problematic because the agent might generate code that looks plausible but doesn’t match the actual codebase or requirements.

Factory’s approach links feedback directly to each LLM call within LangSmith. When a user provides negative feedback on a Droid’s output (such as a code comment or suggestion), that feedback is attached to the specific LLM call that generated the problematic output. This eliminates the need for a proprietary logging system and provides immediate visibility into which prompts and contexts are producing poor results.

This feedback-to-call linking is crucial for debugging complex agentic systems because it allows engineers to see not just that something went wrong, but exactly what the LLM “saw” (its input context) when it produced the erroneous output. This contextual debugging is essential for understanding whether problems stem from insufficient context, poorly formatted context, or issues with the prompt itself.

Automated Feedback Loop and Prompt Optimization

Perhaps the most significant LLMOps innovation described in this case study is Factory’s automated feedback loop for prompt optimization. Traditional prompt engineering involves manually reviewing outputs, identifying problems, hypothesizing about causes, and iteratively refining prompts. This process is time-consuming and often relies heavily on human intuition.

Factory implemented a more systematic approach using LangSmith’s Feedback API:

The workflow begins when a Droid posts a comment (such as a code review comment) and users provide positive or negative feedback. This feedback is appended to various stages of Factory’s workflows through the Feedback API. The feedback is then exported to datasets where it can be analyzed for patterns.

The clever part of their approach is using the LLM itself to analyze why certain prompts produced bad outputs. They have the LLM examine a prompt alongside examples of good and bad outputs, then generate hypotheses about why the prompt might have caused the bad example but not the good example. This meta-analysis helps identify systematic issues with prompts that might not be obvious from looking at individual failures.

This approach essentially automates the “pattern recognition” step of prompt optimization that traditionally requires significant human effort. By benchmarking examples and automating this analysis, Factory claims to have increased control over accuracy while reducing the mental overhead and infrastructure requirements for processing feedback.

Reported Results and Metrics

The case study reports several quantitative improvements, though as with any vendor-published case study, these should be viewed as claims rather than independently verified results:

The iteration speed improvement is the most directly attributable to the LangSmith implementation described in this case study. The customer-facing metrics (open-to-merge time, code churn, development hours saved) are broader product success metrics that would be influenced by many factors beyond just the observability and feedback infrastructure.

Critical Assessment

While the technical approach described is sound and represents reasonable LLMOps best practices, there are some limitations to consider:

The case study is published by LangChain, so it naturally emphasizes positive outcomes and the value of LangSmith. Independent verification of the claimed metrics is not provided.

The 2x iteration speed improvement, while impressive, is somewhat vague in terms of what exactly is being measured. Iteration speed could refer to many different aspects of the development process, and the baseline comparison (previous manual methods) may not have been optimized.

The technical details, while useful, are presented at a relatively high level. Specific implementation challenges, tradeoffs, or limitations are not discussed, which would provide a more balanced view.

That said, the core architectural decisions — self-hosted observability for security-sensitive environments, integration with existing logging infrastructure (CloudWatch), feedback-to-call linking for debugging, and LLM-assisted analysis of prompt failures — represent practical patterns that other teams building autonomous LLM agents could learn from.

Broader LLMOps Lessons

This case study illustrates several important themes in production LLM operations:

Security-conscious observability: As LLMs move into enterprise contexts with strict data requirements, self-hosted or on-premises observability solutions become essential. This is particularly true for applications that process sensitive code or business data.

Unified logging and tracing: Connecting LLM-specific observability tools with existing infrastructure (like CloudWatch) helps teams avoid observability fragmentation and enables correlation of LLM behavior with other system events.

Feedback as a first-class citizen: Production LLM systems benefit enormously from systematic feedback collection and analysis. Building feedback directly into the product experience and linking it to specific LLM calls enables data-driven optimization.

LLM-assisted prompt debugging: Using LLMs to analyze their own failures is an emerging pattern that can accelerate the prompt engineering process, though it requires careful design to avoid simply reinforcing existing biases or issues.

Agentic system complexity: Autonomous agents that make multiple LLM calls and decisions present unique debugging challenges. Being able to trace through the full sequence of decisions and understand the context at each step is critical for maintaining reliability.

Factory’s implementation demonstrates a mature approach to LLMOps that goes beyond basic logging to create a continuous improvement cycle for their AI agents. While the specific metrics should be taken with appropriate caution given the source, the architectural patterns and approach provide valuable reference points for teams building similar autonomous LLM systems.

More Like This

Building a Multi-Agent Research System for Complex Information Tasks

Anthropic 2025

Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.

question_answering document_processing data_analysis +48

Scaling AI-Assisted Developer Tools and Agentic Workflows at Scale

Slack 2025

Slack's Developer Experience team embarked on a multi-year journey to integrate generative AI into their internal development workflows, moving from experimental prototypes to production-grade AI assistants and agentic systems. Starting with Amazon SageMaker for initial experimentation, they transitioned to Amazon Bedrock for simplified infrastructure management, achieving a 98% cost reduction. The team rolled out AI coding assistants using Anthropic's Claude Code and Cursor integrated with Bedrock, resulting in 99% developer adoption and a 25% increase in pull request throughput. They then evolved their internal knowledge bot (Buddybot) into a sophisticated multi-agent system handling over 5,000 escalation requests monthly, using AWS Strands as an orchestration framework with Claude Code sub-agents, Temporal for workflow durability, and MCP servers for standardized tool access. The implementation demonstrates a pragmatic approach to LLMOps, prioritizing incremental deployment, security compliance (FedRAMP), observability through OpenTelemetry, and maintaining model agnosticism while scaling to millions of tokens per minute.

code_generation question_answering summarization +46

Building Enterprise-Ready AI Development Infrastructure from Day One

Windsurf 2024

Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.

code_generation code_interpretation high_stakes_application +42