ZenML

Building a Commonsense Knowledge Graph for E-commerce Product Recommendations

Amazon 2024
View original source

Amazon developed COSMO, a framework that leverages LLMs to build a commonsense knowledge graph for improving product recommendations in e-commerce. The system uses LLMs to generate hypotheses about commonsense relationships from customer interaction data, validates these through human annotation and ML filtering, and uses the resulting knowledge graph to enhance product recommendation models. Tests showed up to 60% improvement in recommendation performance when using the COSMO knowledge graph compared to baseline models.

Industry

E-commerce

Technologies

Overview

Amazon’s COSMO (COmmon Sense MOdeling) framework represents a sophisticated application of large language models in production e-commerce systems, specifically designed to enhance product recommendations through commonsense knowledge graph construction. The work was presented at SIGMOD 2024, one of the premier database and data management conferences, highlighting both the research rigor and practical scalability of the approach.

The fundamental problem COSMO addresses is the gap between customer intent and literal product matching. When a customer searches for “shoes for pregnant women,” a traditional recommendation system might struggle to connect this query to “slip-resistant shoes” without explicit commonsense reasoning. COSMO bridges this gap by constructing knowledge graphs that encode relationships between products and human contexts—functions, audiences, locations, and similar semantic dimensions.

Technical Architecture and LLM Integration

The COSMO framework employs LLMs in a carefully orchestrated pipeline that balances automated generation with quality control mechanisms. This represents a mature approach to LLMOps where the model is not simply deployed end-to-end but is integrated into a larger system with multiple validation checkpoints.

Data Sources and Preprocessing

The system begins with two primary data sources from customer behavior:

Before feeding this data to the LLM, COSMO applies preprocessing heuristics to reduce noise. For example, co-purchase pairs where the product categories are too distant in Amazon’s product taxonomy are removed. This preprocessing step is critical for production systems as it reduces the computational burden on the LLM and improves the signal-to-noise ratio of generated hypotheses.

Iterative LLM Prompting Strategy

The LLM is used in a multi-stage, iterative process that exemplifies sophisticated prompt engineering practices:

In the first stage, the LLM receives data pairs and is asked to describe relationships using a small set of base relations: usedFor, capableOf, isA, and cause. From the outputs, the team extracts frequently recurring relationship patterns and codifies them into a finer-grained taxonomy with canonical formulations such as used_for_function, used_for_event, and used_for_audience.

This iterative refinement represents a key LLMOps pattern—using model outputs to inform better prompting strategies, creating a virtuous cycle of improvement. The team then repeats the process, prompting the LLM with the expanded relationship vocabulary.

Quality Filtering Mechanisms

A significant challenge in production LLM systems is handling low-quality or vacuous outputs. COSMO addresses this through multiple filtering layers:

Heuristic Filtering: The team developed automated heuristics to identify problematic LLM outputs. For instance, if the LLM’s answer is semantically too similar to the question itself (essentially paraphrasing the input), the question-answer pair is filtered out. This addresses the tendency of LLMs to generate “empty rationales” such as “customers bought them together because they like them.”

Human Annotation: A representative subset of candidates that survive heuristic filtering is sent to human annotators for assessment on two dimensions:

Machine Learning Classification: Using the annotated data, the team trains a classifier to predict plausibility and typicality scores for the remaining candidates. Only candidates exceeding defined thresholds are retained. This approach scales the human judgment across the full dataset, a common pattern in production ML systems where human annotation cannot cover all data.

Instruction Extraction and Refinement

From high-quality candidates, the team extracts syntactic and semantic patterns that can be encoded as LLM instructions. For example, an extracted instruction might be “generate explanations for the search-buy behavior in the domain d using the capableOf relation.” These instructions are then used to prompt the LLM in a final pass over all candidate pairs, improving consistency and quality of the generated relationships.

This instruction extraction process demonstrates a meta-learning approach to prompt engineering—rather than manually crafting prompts, the system learns effective prompting patterns from successful examples.

Knowledge Graph Construction

The output of the COSMO pipeline is a set of entity-relation-entity triples that form a knowledge graph. An example triple might be: <co-purchase of camera case and screen protector, capableOf, protecting camera>. This structured representation enables the knowledge to be integrated into downstream systems through standard graph-based methods.

Evaluation Methodology

The team evaluated COSMO using the Shopping Queries Data Set created for KDD Cup 2022, which consists of queries and product listings with products rated according to their relevance to each query. This represents rigorous evaluation practices—using an external, competition-grade benchmark rather than internally-curated test sets.

Model Architectures Tested

Three model configurations were compared:

Results

The evaluation produced compelling results across two experimental conditions:

Frozen Encoders: With encoder weights fixed, the COSMO-enhanced model achieved a 60% improvement in macro F1 score over the best baseline. This dramatic improvement demonstrates the value of the knowledge graph when the underlying representations cannot be adapted.

Fine-tuned Encoders: When encoders were fine-tuned on a subset of the test dataset, all models improved significantly. However, the COSMO-enhanced model maintained a 28% edge in macro F1 and 22% edge in micro F1 over the best baseline. This shows that the commonsense knowledge provides complementary information that even fine-tuning on task-specific data cannot fully capture.

Production Considerations

While the source material focuses primarily on the research aspects, several production-relevant insights can be extracted:

Scalability: The system is designed for Amazon’s massive product catalog and query volume. The preprocessing, filtering, and ML classification stages are designed to reduce the computational load on the LLM while maintaining quality.

Human-in-the-Loop Design: The architecture explicitly incorporates human review at critical points, acknowledging that LLMs alone cannot guarantee the quality needed for production deployment. This hybrid approach balances automation with quality control.

Modular Pipeline: The separation of data extraction, LLM generation, filtering, and knowledge graph construction into distinct stages allows for independent optimization and monitoring of each component—a key principle in production ML systems.

Relationship Canonicalization: The creation of a standardized vocabulary of relationships (used_for_function, etc.) enables consistent knowledge representation and easier integration with downstream systems.

Limitations and Considerations

The source material, while thorough, does not provide detailed information on several operationally important aspects such as latency characteristics, cost considerations for LLM inference at scale, refresh frequency for the knowledge graph, or handling of temporal dynamics in product-query relationships. Additionally, the evaluation is conducted on a specific benchmark, and real-world performance may vary based on query distribution and product catalog characteristics.

The 60% improvement figure, while impressive, is achieved under the frozen encoder condition which may not reflect typical production deployments where fine-tuning is common. The 22-28% improvement with fine-tuned encoders, while still substantial, represents a more realistic estimate of production impact.

Conclusion

COSMO demonstrates a sophisticated approach to integrating LLMs into production recommendation systems. Rather than using LLMs for direct inference at query time (which would be prohibitively expensive at Amazon’s scale), the framework uses LLMs to construct a knowledge asset that can be efficiently queried during production serving. The multi-stage pipeline with heuristic and human quality controls represents mature LLMOps practices that balance automation with reliability requirements.

More Like This

Accelerating Drug Development with AI-Powered Clinical Trial Transformation

Novartis 2025

Novartis partnered with AWS Professional Services and Accenture to modernize their drug development infrastructure and integrate AI across clinical trials with the ambitious goal of reducing trial development cycles by at least six months. The initiative involved building a next-generation GXP-compliant data platform on AWS that consolidates fragmented data from multiple domains, implements data mesh architecture with self-service capabilities, and enables AI use cases including protocol generation and an intelligent decision system (digital twin). Early results from the patient safety domain showed 72% query speed improvements, 60% storage cost reduction, and 160+ hours of manual work eliminated. The protocol generation use case achieved 83-87% acceleration in producing compliant protocols, demonstrating significant progress toward their goal of bringing life-saving medicines to patients faster.

healthcare regulatory_compliance high_stakes_application +39

AI-Powered Multi-Agent System for Global Compliance Screening at Scale

Amazon 2025

Amazon developed an AI-driven compliance screening system to handle approximately 2 billion daily transactions across 160+ businesses globally, ensuring adherence to sanctions and regulatory requirements. The solution employs a three-tier approach: a screening engine using fuzzy matching and vector embeddings, an intelligent automation layer with traditional ML models, and an AI-powered investigation system featuring specialized agents built on Amazon Bedrock AgentCore Runtime. These agents work collaboratively to analyze matches, gather evidence, and make recommendations following standardized operating procedures. The system achieves 96% accuracy with 96% precision and 100% recall, automating decision-making for over 60% of case volume while reserving human intervention only for edge cases requiring nuanced judgment.

fraud_detection regulatory_compliance high_stakes_application +33

Observability Platform's Journey to Production GenAI Integration

New Relic 2023

New Relic, a major observability platform processing 7 petabytes of data daily, implemented GenAI both internally for developer productivity and externally in their product offerings. They achieved a 15% increase in developer productivity through targeted GenAI implementations, while also developing sophisticated AI monitoring capabilities and natural language interfaces for their customers. Their approach balanced cost, accuracy, and performance through a mix of RAG, multi-model routing, and classical ML techniques.

code_generation data_analysis data_cleaning +32