Company
Etsy
Title
LLM-Powered Product Attribute Extraction from Unstructured Marketplace Data
Industry
E-commerce
Year
2025
Summary (short)
Etsy faced the challenge of understanding and categorizing over 100 million unique, handmade items listed by 5 million sellers, where most product information existed only as unstructured text and images rather than structured attributes. The company deployed large language models to extract product attributes at scale from listing titles, descriptions, and photos, transforming unstructured data into structured attributes that could power search filters and product comparisons. The implementation increased complete attribute coverage from 31% to 91% in target categories, improved engagement with search filters, and increased overall post-click conversion rates, while establishing robust evaluation frameworks using both human-annotated ground truth and LLM-generated silver labels.
## Overview Etsy's case study presents a compelling production LLM implementation addressing a fundamental marketplace challenge: converting vast amounts of unstructured product data into structured attributes at scale. With over 100 million items listed by 5 million independent sellers, Etsy lacks the standardized SKUs and centralized product catalogs typical of traditional e-commerce platforms. The uniqueness that defines Etsy's value proposition—handmade, customized, and often one-of-a-kind items—creates significant technical challenges for product discovery and search functionality. The company's solution leverages foundational LLMs to perform attribute extraction across millions of listings, though the case study appropriately balances enthusiasm with acknowledgment of real-world complexities. ## The Core Problem and Business Context Etsy's marketplace contains items spanning thousands of categories with highly variable and often niche product attributes. While standard e-commerce attributes like color and material are common, Etsy must also handle specialized attributes such as "bead hole size" and "slime additives." The platform collects both structured data (through optional seller-provided attribute fields) and unstructured data (titles, descriptions, and images). However, to reduce friction in the listing process and accommodate unique items, most attribute fields are optional, resulting in sellers predominantly providing only unstructured information. This creates a fundamental tension: Etsy's search and discovery algorithms need structured data for filtering, comparison, and low-latency operations, but most product information exists in formats that are computationally expensive to process at query time. The case study cites a particularly illustrative example of a porcelain sculpture that appears to be a t-shirt at first glance, highlighting how visual and textual ambiguity can mislead both humans and machines. The business imperative is clear: extract structured attributes from unstructured content to power search filters, product comparisons, and improved buyer experiences without adding seller burden. ## Evolution from Traditional ML to LLMs Before adopting LLMs, Etsy explored multiple machine learning approaches to attribute extraction. Supervised classification models struggled with the long tail of sparse attributes—even if all possible attributes could be enumerated, many would appear too infrequently for traditional models to learn effectively. Sequence tagging approaches similarly failed to scale across multiple attributes. Transformer-based question-answering models like AVEQA and MAVEQA showed promise by generalizing to unseen attribute values, but still required substantial application-specific training data, which was costly and time-consuming to obtain. The advent of foundational LLMs changed the calculus entirely. These models brought three critical capabilities: vast general knowledge from pre-training that could handle diverse product types, the ability to process large context windows quickly and affordably, and instruction-following capabilities with minimal examples. This represented a qualitative shift from requiring large labeled datasets to leveraging few-shot learning and prompt engineering. However, the case study appropriately doesn't position LLMs as a silver bullet—instead, it acknowledges that deploying them at scale required building robust evaluation frameworks, monitoring systems, and quality assurance processes. ## Evaluation Strategy and Silver Labels One of the most technically interesting aspects of Etsy's implementation is their evolution in evaluation methodology. Initially, the team worked with third-party labeling vendors to create human-annotated datasets for calculating standard metrics like precision, recall, and Jaccard index. These ground truth labels served as benchmarks for iterating on prompts and context engineering. However, Etsy encountered significant limitations with this approach: human annotators made mistakes (the case study cites an example where a human incorrectly labeled a light fixture as ½ inch when it was actually 5.5 inches), and the process was expensive and time-consuming, making it impractical for scaling across thousands of categories. The solution was to shift toward "silver labels"—using high-performance, state-of-the-art LLMs to generate ground truth annotations. This isn't a fully automated process; human-in-the-loop remains essential. Etsy domain experts review the silver labels and iterate on prompts to ensure quality before generating larger datasets for evaluating more scalable production models. This represents a pragmatic approach to the evaluation challenge: accepting that LLM-generated labels may not be perfect ground truth, but acknowledging they can be more consistent and cost-effective than human annotation, especially when combined with expert oversight. The case study demonstrates mature thinking about evaluation tradeoffs rather than claiming perfect accuracy. ## Context Engineering and Inference Architecture The core of Etsy's LLM pipeline is what they term "context engineering"—carefully curating the information provided to the model for attribute extraction. Working with partners across product, merchandising, and taxonomy teams, they assemble comprehensive context including seller-provided listing data (titles, descriptions, images), few-shot examples hand-selected by domain experts, business logic from Etsy's product taxonomy, and category-specific extraction rules. This cross-functional collaboration is notable, as it reflects the reality that production LLM systems require domain expertise beyond engineering. Each listing is represented as a JSON string containing this contextual information, which is then injected into multiple prompts for parallel attribute extraction. The architecture uses LiteLLM to route requests across different regions, achieving higher parallelization and avoiding single-region dependencies. This is a practical infrastructure decision that addresses both throughput and reliability concerns. Responses from the LLM are parsed into Pydantic dataclasses, which provide both basic type validation and custom validation based on business logic. This layered validation approach—first using Pydantic for structure, then applying business rules—reflects mature software engineering practices adapted for LLM outputs. After inference and validation, a post-processing job formats the structured outputs for consumption by various downstream systems: filestores, database tables, and search platforms. The case study describes a well-architected pipeline that goes beyond simple LLM API calls to include robust data engineering concerns. ## Monitoring and Reliability Practices Etsy's approach to monitoring demonstrates sophisticated thinking about LLM reliability in production. The team recognizes that inference can fail for numerous reasons: code bugs, permissions issues, transient errors, quota exceeded errors, and safety filters. Rather than failing the entire pipeline when individual inferences fail (which would be impractical at scale), they log errors and surface error metrics through their observability platform. The team receives alerts only when failed inferences exceed defined thresholds, balancing awareness with avoiding alert fatigue. For debugging, they log sample traces to Honeycomb, a distributed tracing platform. This represents best practices in observability—using structured logging and distributed tracing to understand system behavior without overwhelming storage or analysis capacity. However, Etsy goes beyond basic error monitoring to implement model performance evaluation directly in the pipeline. Before running production-scale inference, they run the LLM on a sample of their ground-truth dataset and calculate metrics like precision and recall, comparing these to baseline scores. If metrics deviate significantly, the pipeline terminates automatically. This automated performance regression detection is a critical production safeguard that many organizations overlook. It protects against scenarios where third-party LLM providers change model behavior, or where subtle prompt changes inadvertently degrade quality. The combination of error monitoring, distributed tracing, metric tracking, automated performance evaluation, and alerting provides comprehensive visibility into both pipeline health and model quality—a mature LLMOps practice that goes well beyond simply calling an API. ## Results and Impact The case study reports tangible improvements from the LLM implementation. In target categories, complete attribute coverage increased from 31% to 91%, representing a dramatic improvement in data completeness. When LLM-inferred attributes were added to search filters, buyer engagement with relevant filters increased, and overall post-click conversion rates improved. Most recently, the system enabled displaying LLM-inferred color swatches on search results pages, providing at-a-glance information to help buyers find items faster. These results are presented with appropriate specificity—mentioning percentage increases in engagement and conversion without claiming unrealistic magnitudes. The case study notes these are "promising results" and that the work was applied "in target categories," suggesting a measured rollout rather than universal deployment. This tempered presentation is credible and suggests the team is being honest about both successes and limitations. ## Critical Assessment and Limitations While the case study presents a compelling implementation, several aspects warrant balanced consideration. First, the shift from human ground truth to silver labels introduces potential circularity—using LLMs to evaluate LLMs. While the human-in-the-loop process with domain experts provides oversight, there's inherent risk that systematic biases in the label-generating LLM could propagate to the production model without detection. The case study acknowledges imperfect human annotation but doesn't deeply explore the tradeoffs of this evaluation approach. Second, the cost structure isn't discussed. Processing 100+ million listings through commercial LLM APIs, even with efficient routing and batching, likely represents substantial ongoing costs. The business case implicitly rests on improved conversion rates outweighing inference costs, but no financial analysis is provided. Third, the case study doesn't address how the system handles attribute extraction errors that reach production—are there mechanisms for seller correction or buyer feedback? Fourth, while parallel attribute extraction is mentioned, there's no discussion of potential inconsistencies when extracting multiple related attributes independently. Finally, the system's handling of multimodal information (both text and images) is mentioned but not detailed—it's unclear how vision capabilities are integrated and whether they add meaningful value beyond text processing. ## Technical Architecture Maturity Despite these limitations, Etsy's implementation demonstrates several markers of production LLM maturity. The use of Pydantic for output validation shows understanding that LLM outputs need structured validation before entering downstream systems. The LiteLLM routing strategy addresses both performance and reliability through geographic distribution. The automated performance regression testing in the pipeline prevents silent quality degradation. The graduated evaluation approach—from expensive human annotation to silver labels with expert oversight—shows pragmatic adaptation to real-world constraints. The emphasis on context engineering rather than model fine-tuning reflects current best practices for leveraging foundational models. By incorporating taxonomy knowledge, business rules, and few-shot examples into prompts, Etsy achieves domain adaptation without expensive retraining. The cross-functional collaboration with product, merchandising, and taxonomy teams indicates recognition that LLM systems require diverse expertise beyond engineering. ## Broader Implications for LLMOps Etsy's case study illustrates several important lessons for organizations deploying LLMs in production. First, the shift from requiring perfect ground truth to accepting validated silver labels represents a pragmatic approach to the evaluation challenge that many organizations face. Second, the comprehensive monitoring approach—combining error tracking, performance evaluation, and distributed tracing—demonstrates that LLM observability requires layered strategies beyond simple API monitoring. Third, the emphasis on validation and post-processing shows that raw LLM outputs rarely meet production requirements without additional structure and business logic enforcement. Fourth, the parallel processing architecture with regional routing illustrates infrastructure considerations for achieving acceptable latency and throughput at scale. Finally, the measured rollout to "target categories" with specific metric tracking demonstrates responsible deployment practices that allow for learning and iteration. The case study represents a mature production LLM implementation that goes beyond proof-of-concept to address real operational challenges: evaluation at scale, reliability under diverse failure modes, performance regression detection, and integration with existing data infrastructure. While some details remain unspecified and certain tradeoffs aren't fully explored, the overall approach provides valuable insights for practitioners building similar systems. Etsy's experience suggests that success with production LLMs requires not just model capabilities, but comprehensive engineering around evaluation, monitoring, validation, and operational processes.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.