Etsy developed a gifting assistant agent to address challenges in searching through their unique, unstructured inventory of handcrafted and vintage items. The agent uses LangChain and LangGraph to enable conversational search, helping shoppers iteratively refine gift recommendations through natural dialogue. The team built the system with a focus on engineering reliability, evaluation rigor, and streamlined deployment, launching a beta version in production within six weeks with a small team of three senior engineers and one designer. Early results showed high-quality search results and relatively high purchase rates in the limited release.
Etsy, a two-sided marketplace connecting millions of merchants selling handcrafted or vintage items to over 86 million buyers, developed a conversational AI gifting assistant to tackle unique search challenges inherent to their platform. The company’s inventory presents special difficulties because items lack fixed attribute schemas, instead relying on unstructured descriptions that can be cluttered and noisy. Additionally, important product details are often captured in visual elements like listing images, and effective search requires domain expertise across diverse niches ranging from fine arts to fandoms. The Etsy team believed that an agentic approach would be particularly valuable for gifting scenarios, where shoppers often know who they want to buy for but lack specific product ideas. The vision was to create an agent that could collaboratively and iteratively refine recommendations with the shopper to find the perfect gift.
The project was led by Derrick Kondo, an engineer on the GenAI Enablement team, and involved approximately three senior engineers and one designer. The team successfully moved from initial development to a beta production launch in just six weeks, a timeline they attribute to leveraging LangChain abstractions and platform capabilities. Early production results indicated that the agent returns high-quality search results with a relatively thin harness and achieves relatively high purchase rates in the limited release, though specific metrics were not disclosed in the presentation.
The team adopted a philosophy of starting as simple as possible and only adding complexity when justified. For the core agent framework, they selected LangChain based on several factors: it was best-in-class and first-to-market in the agentic space, provided agent-native observability and evaluation capabilities, and offered vendor optionality—an important consideration given rising token costs and model capacity issues. The architectural decision to maintain vendor optionality reflects a pragmatic approach to managing LLM operations in production where cost and availability can be significant concerns.
The solution architecture centers on a LangChain v1 ReAct agent, chosen for its balance of abstraction and control. The ReAct pattern involves the model deciding which tools to invoke based on a request, reasoning about the results after invocation, and then either continuing the tool-calling loop or returning a curated list of listing recommendations. The system includes two main points of customization: middleware that can intercept and modify agent behavior, and the tools themselves.
The tool set includes several categories:
For long-term memory management of recipient information, the team uses a PostgreSQL key-value store, providing persistent storage of user preferences and recipient profiles across sessions.
One of the earliest reliability problems the team encountered was what they termed “spin”—repeated tool calls by the model for the same specific tool. To address this, they created middleware with matching conditions and graduated intervention strategies. The middleware monitors for repeated tool calls and intervenes at different severity levels. For example, when the model makes five repeated calls to the search listings tool, the middleware adds instructions to the system prompt telling the model to synthesize its findings or ask the user for more input. If the pattern continues to ten repeated tool calls, the middleware raises an error and asks the user to try again. This graduated approach balances allowing the model flexibility while preventing infinite loops or wasteful computation.
Another model unreliability issue involved hallucination of listing IDs, where the model would sometimes return truncated or invalid IDs. The solution involved creating a ledger system within the tools. For instance, the search listings tool records all observed listing IDs in a ledger during execution. After the model completes its run, middleware compares the curated set of IDs the model returns with the observed IDs in the ledger and fixes discrepancies in a best-effort manner. This pattern demonstrates a pragmatic approach to handling LLM unreliability through deterministic validation and correction layers.
The team discovered that the model would sometimes corrupt recipient memory data, such as storing a t-shirt size under the interest field rather than the appropriate size field. To debug and resolve these issues, they created a terminal UI agent debugger that provides visibility into agent state and store interactively while being integrated with Python’s native debugger. This tool supports semantic breakpoints, such as breakpoints at specific nodes in the agent graph, in addition to traditional program stack inspection. Using this debugger, the team identified and resolved seven or eight different issues, leading to solutions like creating broader, richer, and more descriptive recipient memory schemas that better guide the model’s memory management behavior.
The team wanted to stream results from the LangGraph agent back to clients for better user experience, but faced infrastructure challenges. Their LangGraph self-deployment existed in a relatively new Kubernetes cluster that they didn’t want to expose directly to the internet, as this would require re-implementing authentication endpoints and adding redundant security protections. Meanwhile, their existing longstanding Etsy web cluster ran Apache, which was not designed for long-running requests.
The architecture team developed an innovative socket-passing pattern that allows reuse of existing web infrastructure. When a client request arrives, PHP and the Apache worker authenticate the request as usual, then pass the file descriptor to a long-running daemon running in a sidecar container. After the handoff, the Apache worker is freed and released, while the long-running daemon handles streaming update events back to the client. This solution demonstrates creative problem-solving that respects existing infrastructure constraints while enabling modern streaming capabilities.
The team developed a comprehensive evaluation approach covering both the agent trajectory (the path to a result) and the final outcome quality. This two-dimensional evaluation reflects mature thinking about agent systems, recognizing that both the process and the end result matter.
Beyond standard integration and unit tests, the team created pass-K tests for non-deterministic agent behavior. These tests invoke a non-deterministic test K times and ensure that the empirical pass rate exceeds some threshold. For example, given a question about a seller, they have a test that checks whether the agent successfully calls the get-shop tool across multiple invocations. This approach acknowledges the stochastic nature of LLM-based agents while still maintaining quality standards through statistical validation.
For evaluating final outcomes, the team wanted to ensure that listings were relevant to both the recipient profile and shopper constraints like budget. They recognized the need for an LLM judge aligned to a golden dataset but faced significant methodological challenges:
Rather than leaving these decisions ad-hoc, the team created an Etsy Agents CLI that leverages LangSmith APIs to streamline the entire evaluation workflow. This tooling operationalizes best practices so they can be used by all engineers, including those who are ML-nascent product engineers rather than ML specialists.
The team built a batch processing capability into the Etsy Agents CLI to run multi-user simulations for dataset generation. This same infrastructure serves multiple purposes beyond evaluation: batch generation for running agents across the entire inventory, and load testing for productionization. The architecture uses parallel and distributed workers that send requests to the agent as a service in LangSmith deployments. Importantly, the same service handles both real-time and batch requests, avoiding offline versus online drift. This ensures consistent governance and observability through LangSmith and follows the DRY (Don’t Repeat Yourself) principle by using the same framework for any use case and agent.
Once the dataset was generated, the merchant team labeled listing relevance for each example. The team emphasized the importance of reviewer calibration to ensure consistent labeling, recommending the use of statistics like Cohen’s Kappa to measure alignment while discounting alignment by chance. This is especially important when the distribution of samples across classes is skewed, which is common in relevance labeling tasks.
The team created train, validation, and test splits in LangSmith using industry best practices for LLM prompt optimization, specifically a 20/40/40 split ratio. They developed an alignment tool that uses automated prompt optimization techniques such as JEPA (which uses an LLM to reflect on the judge’s errors and adjust the prompt accordingly) to align the judge to the golden dataset. The tool outputs standard metrics related to precision and recall. Throughout this process, LangSmith was essential for understanding both the judgments made by the LLM judge and its judgment stability over time.
Once the LLM judge was trained and validated, they could use it with their classification tool to run evaluations on both the holdout set and other production examples, providing ongoing quality monitoring.
The deployment infrastructure reflects sophisticated DevOps practices adapted for LLM applications. The system operates within an Etsy Agents monorepo, which enables automatic agent project discovery and creates CI/CD pipelines dynamically. The team emphasizes that no Terraform configuration or handwritten pipelines are required—developers only need to specify a YAML file for deployment resources. The build system uses LangSmith deployment APIs to deploy in a scalable and reliable way.
This approach significantly reduces the operational burden of deploying new agents or agent versions, allowing product engineers to focus on application logic rather than infrastructure concerns. The six-week timeline from start to beta production launch suggests that this streamlined deployment approach, combined with LangChain abstractions, significantly accelerated development.
The case study demonstrates several markers of LLMOps maturity. At the application level, the team ensures reliability through deterministic checks implemented in middleware code. The modularity of this middleware design has forward-looking benefits: as LLMs improve and subsume functionality currently handled by middleware, the modular architecture makes it easy to swap components in and out.
At the platform level, LangSmith provides foundational services and tools, and the team found it straightforward to build integrations and customizations for their internal workloads and systems. The approach of having platform engineers integrated with the product team and co-developing the agent proved effective for defining requirements for integration and platform development, creating a tight feedback loop between platform capabilities and product needs.
While the presentation describes impressive engineering accomplishments and a rapid path to production, some caution is warranted in interpreting the claims. The presenter mentions “high-quality search results” and “relatively high purchase rates” but provides no specific metrics, benchmarks against baseline search, or statistical significance testing. The six-week development timeline is presented as a success story enabled by LangChain, but it’s unclear how much of this speed came from the framework versus the team’s expertise, existing infrastructure, or the limited scope of a beta release.
The heavy reliance on LangChain and LangSmith creates vendor lock-in risk, even though the team values “vendor optionality” at the LLM model level. While they can swap underlying LLM providers, their entire agent architecture, evaluation pipeline, and deployment system are tightly coupled to the LangChain ecosystem. The long-term implications of this dependency aren’t addressed.
The middleware-based solutions for model reliability (handling spin, fixing hallucinated IDs, managing memory corruption) represent workarounds for fundamental model limitations. While pragmatic and necessary for production systems, these patches add complexity and may mask problems that would be better addressed through improved prompting, better model selection, or fundamental architectural changes. The team acknowledges this somewhat by noting that middleware can be removed as models improve, but the current system appears to require significant deterministic scaffolding around the LLM to function reliably.
The evaluation methodology is sophisticated, particularly the LLM judge alignment workflow and the use of pass-K tests for stochastic behavior. However, the reliance on an LLM judge to evaluate LLM outputs introduces potential circularity, and while they mention avoiding data leakage, the specifics of how they validated that their judge genuinely generalizes are not detailed.
The socket-passing pattern for streaming is clever but represents technical debt—a workaround for infrastructure limitations rather than a clean architectural solution. While it unblocks the immediate need, it adds complexity at the boundaries between systems that may create maintenance burden.
Overall, this represents a solid production LLM application with thoughtful engineering and evaluation practices, built rapidly by leveraging an established framework. The team demonstrates awareness of common pitfalls in LLM applications and has developed reasonable mitigations. However, the lack of quantitative results, heavy framework dependency, and reliance on corrective middleware suggest this is an early-stage production system that will likely require continued refinement and possibly architectural evolution as the technology and their understanding of user needs mature.
Notion, a knowledge work platform serving enterprise customers, spent multiple years (2022-2026) iterating through four to five complete rebuilds of their agent infrastructure before shipping Custom Agents to production. The core problem was enabling users to automate complex workflows across their workspaces while maintaining enterprise-grade reliability, security, and cost efficiency. Their solution involved building a sophisticated agent harness with progressive tool disclosure, SQL-like database abstractions, markdown-based interfaces optimized for LLM consumption, and a comprehensive evaluation framework. The result was a production system handling over 100 tools, serving majority-agent traffic for search, and enabling workflows like automated bug triaging, email processing, and meeting notes capture that fundamentally changed how their company and customers operate.
Shopify's CTO discusses how the company has achieved near-universal AI adoption internally, with nearly 100% of employees using AI tools daily as of December 2025. The company has developed sophisticated internal platforms including Tangle (an ML experimentation framework), Tangent (an auto-research loop for automatic optimization), and SimGym (a customer simulation platform using historical data). These systems have enabled dramatic productivity improvements including 30% month-over-month PR merge growth, significant code quality improvements through critique loops, and the ability to run hundreds of automated experiments. The company provides unlimited token budgets to employees and emphasizes quality token usage over quantity, focusing on efficient agent architectures with critique loops rather than many parallel agents. They've also implemented Liquid AI models for low-latency applications, achieving 30-millisecond response times for search queries.
Booking.com developed a GenAI agent to assist accommodation partners in responding to guest inquiries more efficiently. The problem was that manual responses through their messaging platform were time-consuming, especially during busy periods, potentially leading to delayed responses and lost bookings. The solution involved building a tool-calling agent using LangGraph and GPT-4 Mini that can suggest relevant template responses, generate custom free-text answers, or abstain from responding when appropriate. The system includes guardrails for PII redaction, retrieval tools using embeddings for template matching, and access to property and reservation data. Early results show the system handles tens of thousands of daily messages, with pilots demonstrating 70% improvement in user satisfaction, reduced follow-up messages, and faster response times.