ZenML

Vision Language Models for Large-Scale Product Classification and Understanding

Shopify 2025
View original source

Shopify evolved their product classification system from basic categorization to an advanced AI-driven framework using Vision Language Models (VLMs) integrated with a comprehensive product taxonomy. The system processes over 30 million predictions daily, combining VLMs with structured taxonomy to provide accurate product categorization, attribute extraction, and metadata generation. This has resulted in an 85% merchant acceptance rate of predicted categories and doubled the hierarchical precision and recall compared to previous approaches.

Industry

E-commerce

Technologies

Overview

Shopify, a leading e-commerce platform supporting millions of merchants, has developed a comprehensive product understanding system that demonstrates mature LLMOps practices at scale. The system processes over 30 million predictions daily, classifying products into a taxonomy of more than 10,000 categories while extracting over 1,000 product attributes. This case study illustrates how large-scale Vision Language Models (VLMs) can be operationalized for production workloads in a demanding, high-throughput environment.

The problem Shopify faced was straightforward but complex at scale: understanding the diverse array of products sold on their platform—from handcrafted jewelry to industrial equipment—to enable better search, discovery, recommendations, tax calculations, and content safety features. What makes this case study particularly valuable from an LLMOps perspective is the detailed technical architecture and optimization strategies employed to make VLM inference viable at massive scale.

Evolution of the System

Shopify’s journey through product classification offers a useful timeline for understanding how production ML systems evolve. In 2018, they started with basic logistic regression and TF-IDF classifiers—traditional machine learning methods that worked for simple cases but struggled with product diversity. By 2020, they moved to multi-modal approaches combining image and text data, improving classification especially for ambiguous products.

By 2023, they identified that category classification alone was insufficient. They needed granular product understanding, consistent taxonomy, meaningful attribute extraction, simplified descriptions, content tags, and trust and safety features. The emergence of Vision Language Models presented an opportunity to address these requirements comprehensively.

Technical Architecture

The current production system is built on two foundational pillars: the Shopify Product Taxonomy and Vision Language Models.

The Shopify Product Taxonomy provides a structured framework spanning more than 26 business verticals with over 10,000 product categories and 1,000+ associated attributes. It offers hierarchical classification (e.g., Furniture > Chairs > Kitchen & Dining Room Chairs), category-specific attributes, standardized values for consistency, and cross-channel compatibility through crosswalks to other platform taxonomies.

The Vision Language Models provide capabilities that were not possible with earlier approaches:

Model Evolution and Selection

The case study reveals a pragmatic approach to model selection. Shopify has transitioned through several VLM generations: LLaVA 1.5 7B, LLaMA 3.2 11B, and currently Qwen2VL 7B. Each transition was evaluated against the existing pipeline considering both performance metrics and computational costs. This demonstrates a mature LLMOps practice of balancing prediction quality with operational efficiency rather than simply chasing the largest or newest models.

Inference Optimization Strategies

The production deployment employs several sophisticated optimization techniques that are critical for achieving the required throughput:

FP8 Quantization is used for the Qwen2VL model, providing reduced GPU memory footprint for efficient resource utilization, minimal impact on prediction accuracy, and enabling more efficient in-flight batch processing due to smaller model size. This represents a practical trade-off between model precision and operational efficiency that is essential for cost-effective production deployments.

In-Flight Batching through NVIDIA Dynamo improves throughput through dynamic request handling. Rather than pre-defining fixed batch sizes, the system dynamically groups incoming product requests based on real-time arrival patterns. It performs adaptive processing by adjusting batch composition on the fly, preventing resource underutilization. The system starts processing batches as soon as products arrive rather than waiting for fixed batch sizes, accepts additional products during processing to form new batches, and maximizes GPU utilization by minimizing idle times.

KV Cache Optimization stores and reuses previously computed attention patterns for improved LLM inference speed. This is particularly effective for their two-stage prediction process where both categories and attributes are generated sequentially.

Pipeline Architecture

The near real-time processing pipeline runs on a Dataflow pipeline that orchestrates the end-to-end process. It makes two separate calls to the Vision LM service—first for category prediction, then for attribute prediction based on the predicted category. The service runs on a Kubernetes cluster with NVIDIA GPUs, using Dynamo for model serving.

The pipeline includes several critical stages:

Input Processing handles dynamic request batching based on arrival patterns, preliminary validation of product data, and resource allocation based on current system load.

Two-Stage Prediction performs category prediction with simplified description generation first, followed by attribute prediction using category context. Both stages leverage the optimized inference stack.

Consistency Management implements transaction-like handling of predictions where both category and attribute predictions must succeed together. Automatic retry mechanisms handle partial failures, and monitoring and alerting systems track prediction quality.

Output Processing validates results against taxonomy rules, formats and stores results, and sends notifications for completed predictions.

Training Data Quality

A notable aspect of this case study is the attention to training data quality, which directly influences system reliability. Shopify developed a multi-stage annotation system with several components:

A Multi-LLM Annotation System where several large language models independently evaluate each product. Structured prompting maintains annotation quality, and products receive multiple independent annotations for robustness.

An Arbitration System employs specialized models acting as impartial judges to resolve conflicts when annotations disagree. This enforces careful ruling logic to address edge cases and ensures alignment with taxonomy standards.

A Human Validation Layer provides strategic manual review of complex edge cases and novel product types. This creates a continuous feedback loop for improvement and includes regular quality audits.

Results and Impact

The system delivers substantial improvements across several dimensions. For merchants, there is an 85% acceptance rate of predicted categories, indicating high trust in system accuracy. Enhanced product discoverability, consistent catalog organization, better search relevance, precise tax calculations, and reduced manual effort through automated attribute tagging are all reported benefits.

For buyers, the system provides more accurate search results, relevant product recommendations, consistently organized browsing experiences, and structured attributes that clarify product information for informed decisions.

At the platform level, the system processes over 30 million predictions daily. Hierarchical precision and recall have doubled compared to the earlier neural network approach. The structured attribute system spans all product categories, and automated content screening has been enhanced for trust and safety.

Future Directions

Shopify plans to incorporate new VLM architectures as they become available, expand attribute prediction to specialized product categories, improve multi-lingual product description handling, and further optimize inference pipelines for greater throughput.

A significant architectural evolution is the planned migration from a tree-based taxonomy to a Directed Acyclic Graph (DAG) structure. This will allow multiple valid categorization paths per product, supporting more flexible relationships and cross-category products.

Critical Assessment

While the case study presents impressive results, it is worth noting some considerations. The 85% acceptance rate, while high, still means that 15% of predictions require merchant intervention—at 30 million daily predictions, this represents millions of products needing review. The actual impact on merchant outcomes (sales, conversion rates) is not explicitly quantified, making it difficult to assess the full business value.

The model selection process emphasizing efficiency alongside performance is practical, but the specific accuracy metrics for each model generation are not disclosed. The doubled hierarchical precision and recall claim is compelling but lacks baseline numbers for context.

Overall, this case study demonstrates mature LLMOps practices including careful model selection, sophisticated inference optimization, robust pipeline architecture with error handling, and systematic approaches to training data quality—all essential elements for operating LLMs reliably at scale in production environments.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

AI-Powered Marketing Content Generation and Compliance Platform at Scale

Volkswagen 2025

Volkswagen Group Services partnered with AWS to build a production-scale generative AI platform for automotive marketing content generation and compliance evaluation. The problem was a slow, manual content supply chain that took weeks to months, created confidentiality risks with pre-production vehicles, and faced massive compliance bottlenecks across 10 brands and 200+ countries. The solution involved fine-tuning diffusion models on proprietary vehicle imagery (including digital twins from CAD), automated prompt enhancement using LLMs, and multi-stage image evaluation using vision-language models for both component-level accuracy and brand guideline compliance. Results included massive time savings (weeks to minutes), automated compliance checks across legal and brand requirements, and a reusable shared platform supporting multiple use cases across the organization.

content_moderation classification multi_modality +45

Automated Product Classification and Attribute Extraction Using Vision LLMs

Shopify

Shopify tackled the challenge of automatically understanding and categorizing millions of products across their platform by implementing a multi-step Vision LLM solution. The system extracts structured product information including categories and attributes from product images and descriptions, enabling better search, tax calculation, and recommendations. Through careful fine-tuning, evaluation, and cost optimization, they scaled the solution to handle tens of millions of predictions daily while maintaining high accuracy and managing hallucinations.

classification structured_output multi_modality +16