Company
Factory
Title
LangSmith Integration for Automated Feedback and Improved Iteration in SDLC
Industry
Tech
Year
2024
Summary (short)
Factory AI implemented self-hosted LangSmith to address observability challenges in their SDLC automation platform, particularly for their Code Droid system. By integrating LangSmith with AWS CloudWatch logs and utilizing its Feedback API, they achieved comprehensive LLM pipeline monitoring, automated feedback collection, and streamlined prompt optimization. This resulted in a 2x improvement in iteration speed, 20% reduction in open-to-merge time, and 3x reduction in code churn.
## Overview Factory is an enterprise AI company focused on automating the software development lifecycle (SDLC) through what they call "Droids" — autonomous AI agents that handle different stages of software development. Their flagship product, Code Droid, is designed to automate complex software development tasks. The company raised $15 million in Series A funding led by Sequoia Capital, indicating significant investor confidence in their approach. This case study, published by LangChain (the vendor of the tools Factory uses), documents how Factory implemented LLM observability and feedback mechanisms to improve their development iteration speed. It's important to note that this case study originates from LangChain's marketing materials, so the claims should be viewed with appropriate skepticism as they naturally emphasize the benefits of LangChain's products. That said, the technical implementation details provide useful insights into real-world LLMOps challenges and solutions. ## The Core Challenge: Observability in Secure Enterprise Environments Factory operates in enterprise contexts where customers have strict data controls and security requirements. This creates a fundamental tension in LLMOps: modern LLM applications require extensive observability to debug and optimize complex multi-step workflows, but many observability solutions require sending sensitive data to third-party services, which is unacceptable in high-security environments. Factory's specific challenges included: - Tracking data flow across LLM pipelines operating in customer environments with tight data controls - Debugging context-awareness issues in generated responses (such as hallucinations) - Integrating observability with their custom LLM tooling, which made most off-the-shelf LLM observability tools difficult to configure - Maintaining enterprise-level security and privacy while still achieving the observability needed to operate autonomous LLM systems reliably ## Technical Implementation: Self-Hosted LangSmith with CloudWatch Integration Factory's solution centered on deploying self-hosted LangSmith, which allowed them to maintain observability infrastructure within controlled environments. The key architectural decision was integrating LangSmith with AWS CloudWatch logs, creating a unified logging and tracing infrastructure. The integration works by exporting LangSmith traces to CloudWatch, enabling Factory engineers to correlate LangSmith events and steps with their existing CloudWatch logs. This is particularly valuable for debugging agentic workflows where the LLM makes multiple calls and decisions. By linking these systems, engineers can pinpoint exactly where in the agentic pipeline an issue occurred, maintaining what Factory describes as "a single source of truth for data flow in LLM from one step to the next." The custom tracing capability via LangSmith's first-party API was highlighted as a key feature that enabled Factory to work around the challenges posed by their custom LLM tooling. Rather than being constrained by pre-built integrations that might not fit their architecture, they could implement tracing in a way that matched their specific system design. ## Debugging Context-Awareness and Hallucination Issues One of the more technically interesting aspects of Factory's implementation is how they use LangSmith to debug context-awareness issues. In autonomous coding agents, hallucinations can be particularly problematic because the agent might generate code that looks plausible but doesn't match the actual codebase or requirements. Factory's approach links feedback directly to each LLM call within LangSmith. When a user provides negative feedback on a Droid's output (such as a code comment or suggestion), that feedback is attached to the specific LLM call that generated the problematic output. This eliminates the need for a proprietary logging system and provides immediate visibility into which prompts and contexts are producing poor results. This feedback-to-call linking is crucial for debugging complex agentic systems because it allows engineers to see not just that something went wrong, but exactly what the LLM "saw" (its input context) when it produced the erroneous output. This contextual debugging is essential for understanding whether problems stem from insufficient context, poorly formatted context, or issues with the prompt itself. ## Automated Feedback Loop and Prompt Optimization Perhaps the most significant LLMOps innovation described in this case study is Factory's automated feedback loop for prompt optimization. Traditional prompt engineering involves manually reviewing outputs, identifying problems, hypothesizing about causes, and iteratively refining prompts. This process is time-consuming and often relies heavily on human intuition. Factory implemented a more systematic approach using LangSmith's Feedback API: The workflow begins when a Droid posts a comment (such as a code review comment) and users provide positive or negative feedback. This feedback is appended to various stages of Factory's workflows through the Feedback API. The feedback is then exported to datasets where it can be analyzed for patterns. The clever part of their approach is using the LLM itself to analyze why certain prompts produced bad outputs. They have the LLM examine a prompt alongside examples of good and bad outputs, then generate hypotheses about why the prompt might have caused the bad example but not the good example. This meta-analysis helps identify systematic issues with prompts that might not be obvious from looking at individual failures. This approach essentially automates the "pattern recognition" step of prompt optimization that traditionally requires significant human effort. By benchmarking examples and automating this analysis, Factory claims to have increased control over accuracy while reducing the mental overhead and infrastructure requirements for processing feedback. ## Reported Results and Metrics The case study reports several quantitative improvements, though as with any vendor-published case study, these should be viewed as claims rather than independently verified results: - 2x improvement in iteration speed compared to their previous method of manual data collection and human-driven prompt iteration - Average customer experienced approximately 20% reduction in open-to-merge time - 3x reduction in code churn on code impacted by Droids in the first 90 days - Over 550,000 hours of development time claimed to be saved across various organizations The iteration speed improvement is the most directly attributable to the LangSmith implementation described in this case study. The customer-facing metrics (open-to-merge time, code churn, development hours saved) are broader product success metrics that would be influenced by many factors beyond just the observability and feedback infrastructure. ## Critical Assessment While the technical approach described is sound and represents reasonable LLMOps best practices, there are some limitations to consider: The case study is published by LangChain, so it naturally emphasizes positive outcomes and the value of LangSmith. Independent verification of the claimed metrics is not provided. The 2x iteration speed improvement, while impressive, is somewhat vague in terms of what exactly is being measured. Iteration speed could refer to many different aspects of the development process, and the baseline comparison (previous manual methods) may not have been optimized. The technical details, while useful, are presented at a relatively high level. Specific implementation challenges, tradeoffs, or limitations are not discussed, which would provide a more balanced view. That said, the core architectural decisions — self-hosted observability for security-sensitive environments, integration with existing logging infrastructure (CloudWatch), feedback-to-call linking for debugging, and LLM-assisted analysis of prompt failures — represent practical patterns that other teams building autonomous LLM agents could learn from. ## Broader LLMOps Lessons This case study illustrates several important themes in production LLM operations: **Security-conscious observability**: As LLMs move into enterprise contexts with strict data requirements, self-hosted or on-premises observability solutions become essential. This is particularly true for applications that process sensitive code or business data. **Unified logging and tracing**: Connecting LLM-specific observability tools with existing infrastructure (like CloudWatch) helps teams avoid observability fragmentation and enables correlation of LLM behavior with other system events. **Feedback as a first-class citizen**: Production LLM systems benefit enormously from systematic feedback collection and analysis. Building feedback directly into the product experience and linking it to specific LLM calls enables data-driven optimization. **LLM-assisted prompt debugging**: Using LLMs to analyze their own failures is an emerging pattern that can accelerate the prompt engineering process, though it requires careful design to avoid simply reinforcing existing biases or issues. **Agentic system complexity**: Autonomous agents that make multiple LLM calls and decisions present unique debugging challenges. Being able to trace through the full sequence of decisions and understand the context at each step is critical for maintaining reliability. Factory's implementation demonstrates a mature approach to LLMOps that goes beyond basic logging to create a continuous improvement cycle for their AI agents. While the specific metrics should be taken with appropriate caution given the source, the architectural patterns and approach provide valuable reference points for teams building similar autonomous LLM systems.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.