Company
Amazon
Title
Building a Commonsense Knowledge Graph for E-commerce Product Recommendations
Industry
E-commerce
Year
2024
Summary (short)
Amazon developed COSMO, a framework that leverages LLMs to build a commonsense knowledge graph for improving product recommendations in e-commerce. The system uses LLMs to generate hypotheses about commonsense relationships from customer interaction data, validates these through human annotation and ML filtering, and uses the resulting knowledge graph to enhance product recommendation models. Tests showed up to 60% improvement in recommendation performance when using the COSMO knowledge graph compared to baseline models.
## Overview Amazon's COSMO (COmmon Sense MOdeling) framework represents a sophisticated application of large language models in production e-commerce systems, specifically designed to enhance product recommendations through commonsense knowledge graph construction. The work was presented at SIGMOD 2024, one of the premier database and data management conferences, highlighting both the research rigor and practical scalability of the approach. The fundamental problem COSMO addresses is the gap between customer intent and literal product matching. When a customer searches for "shoes for pregnant women," a traditional recommendation system might struggle to connect this query to "slip-resistant shoes" without explicit commonsense reasoning. COSMO bridges this gap by constructing knowledge graphs that encode relationships between products and human contexts—functions, audiences, locations, and similar semantic dimensions. ## Technical Architecture and LLM Integration The COSMO framework employs LLMs in a carefully orchestrated pipeline that balances automated generation with quality control mechanisms. This represents a mature approach to LLMOps where the model is not simply deployed end-to-end but is integrated into a larger system with multiple validation checkpoints. ### Data Sources and Preprocessing The system begins with two primary data sources from customer behavior: - **Query-purchase pairs**: These combine customer queries with subsequent purchases made within a defined time window or number of clicks. This captures explicit customer intent and its resolution. - **Co-purchase pairs**: These combine products purchased during the same shopping session, capturing implicit relationships between products that customers associate together. Before feeding this data to the LLM, COSMO applies preprocessing heuristics to reduce noise. For example, co-purchase pairs where the product categories are too distant in Amazon's product taxonomy are removed. This preprocessing step is critical for production systems as it reduces the computational burden on the LLM and improves the signal-to-noise ratio of generated hypotheses. ### Iterative LLM Prompting Strategy The LLM is used in a multi-stage, iterative process that exemplifies sophisticated prompt engineering practices: In the first stage, the LLM receives data pairs and is asked to describe relationships using a small set of base relations: *usedFor*, *capableOf*, *isA*, and *cause*. From the outputs, the team extracts frequently recurring relationship patterns and codifies them into a finer-grained taxonomy with canonical formulations such as *used_for_function*, *used_for_event*, and *used_for_audience*. This iterative refinement represents a key LLMOps pattern—using model outputs to inform better prompting strategies, creating a virtuous cycle of improvement. The team then repeats the process, prompting the LLM with the expanded relationship vocabulary. ### Quality Filtering Mechanisms A significant challenge in production LLM systems is handling low-quality or vacuous outputs. COSMO addresses this through multiple filtering layers: **Heuristic Filtering**: The team developed automated heuristics to identify problematic LLM outputs. For instance, if the LLM's answer is semantically too similar to the question itself (essentially paraphrasing the input), the question-answer pair is filtered out. This addresses the tendency of LLMs to generate "empty rationales" such as "customers bought them together because they like them." **Human Annotation**: A representative subset of candidates that survive heuristic filtering is sent to human annotators for assessment on two dimensions: - *Plausibility*: Whether the posited inferential relationship is reasonable - *Typicality*: Whether the target product is one that would commonly be associated with either the query or the source product **Machine Learning Classification**: Using the annotated data, the team trains a classifier to predict plausibility and typicality scores for the remaining candidates. Only candidates exceeding defined thresholds are retained. This approach scales the human judgment across the full dataset, a common pattern in production ML systems where human annotation cannot cover all data. ### Instruction Extraction and Refinement From high-quality candidates, the team extracts syntactic and semantic patterns that can be encoded as LLM instructions. For example, an extracted instruction might be "generate explanations for the search-buy behavior in the domain d using the *capableOf* relation." These instructions are then used to prompt the LLM in a final pass over all candidate pairs, improving consistency and quality of the generated relationships. This instruction extraction process demonstrates a meta-learning approach to prompt engineering—rather than manually crafting prompts, the system learns effective prompting patterns from successful examples. ## Knowledge Graph Construction The output of the COSMO pipeline is a set of entity-relation-entity triples that form a knowledge graph. An example triple might be: <*co-purchase of camera case and screen protector*, *capableOf*, *protecting camera*>. This structured representation enables the knowledge to be integrated into downstream systems through standard graph-based methods. ## Evaluation Methodology The team evaluated COSMO using the Shopping Queries Data Set created for KDD Cup 2022, which consists of queries and product listings with products rated according to their relevance to each query. This represents rigorous evaluation practices—using an external, competition-grade benchmark rather than internally-curated test sets. ### Model Architectures Tested Three model configurations were compared: - **Bi-encoder (two-tower model)**: Separate encoders for query and product, with outputs concatenated and fed to a neural network for relevance scoring. This architecture is computationally efficient for large-scale retrieval. - **Cross-encoder (unified model)**: All features of both query and product pass through a single encoder. Generally more accurate but computationally expensive. - **COSMO-enhanced cross-encoder**: The cross-encoder architecture augmented with relevant triples from the COSMO knowledge graph as additional input. ### Results The evaluation produced compelling results across two experimental conditions: **Frozen Encoders**: With encoder weights fixed, the COSMO-enhanced model achieved a 60% improvement in macro F1 score over the best baseline. This dramatic improvement demonstrates the value of the knowledge graph when the underlying representations cannot be adapted. **Fine-tuned Encoders**: When encoders were fine-tuned on a subset of the test dataset, all models improved significantly. However, the COSMO-enhanced model maintained a 28% edge in macro F1 and 22% edge in micro F1 over the best baseline. This shows that the commonsense knowledge provides complementary information that even fine-tuning on task-specific data cannot fully capture. ## Production Considerations While the source material focuses primarily on the research aspects, several production-relevant insights can be extracted: **Scalability**: The system is designed for Amazon's massive product catalog and query volume. The preprocessing, filtering, and ML classification stages are designed to reduce the computational load on the LLM while maintaining quality. **Human-in-the-Loop Design**: The architecture explicitly incorporates human review at critical points, acknowledging that LLMs alone cannot guarantee the quality needed for production deployment. This hybrid approach balances automation with quality control. **Modular Pipeline**: The separation of data extraction, LLM generation, filtering, and knowledge graph construction into distinct stages allows for independent optimization and monitoring of each component—a key principle in production ML systems. **Relationship Canonicalization**: The creation of a standardized vocabulary of relationships (*used_for_function*, etc.) enables consistent knowledge representation and easier integration with downstream systems. ## Limitations and Considerations The source material, while thorough, does not provide detailed information on several operationally important aspects such as latency characteristics, cost considerations for LLM inference at scale, refresh frequency for the knowledge graph, or handling of temporal dynamics in product-query relationships. Additionally, the evaluation is conducted on a specific benchmark, and real-world performance may vary based on query distribution and product catalog characteristics. The 60% improvement figure, while impressive, is achieved under the frozen encoder condition which may not reflect typical production deployments where fine-tuning is common. The 22-28% improvement with fine-tuned encoders, while still substantial, represents a more realistic estimate of production impact. ## Conclusion COSMO demonstrates a sophisticated approach to integrating LLMs into production recommendation systems. Rather than using LLMs for direct inference at query time (which would be prohibitively expensive at Amazon's scale), the framework uses LLMs to construct a knowledge asset that can be efficiently queried during production serving. The multi-stage pipeline with heuristic and human quality controls represents mature LLMOps practices that balance automation with reliability requirements.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.