ZenML

LLM Feature Extraction for Content Categorization and Search Query Understanding

Canva 2023
View original source

Canva implemented LLMs as a feature extraction method for two key use cases: search query categorization and content page categorization. By replacing traditional ML classifiers with LLM-based approaches, they achieved higher accuracy, reduced development time from weeks to days, and lowered operational costs from $100/month to under $5/month for query categorization. For content categorization, LLM embeddings outperformed traditional methods in terms of balance, completion, and coherence metrics while simplifying the feature extraction process.

Industry

Tech

Technologies

Overview

This case study comes from a conference presentation by Sheen, an engineering machine learning professional at Canva who works in the content and discovery area. Canva is an online design platform with the mission of enabling everyone to design everything. The presentation focuses on a less commonly discussed use of LLMs: using them as a middle layer for feature extraction rather than building applications directly on top of them. This approach was evaluated across two production use cases within Canva’s content categorization systems.

The core thesis of the presentation is that LLMs can provide solutions with higher performance and accuracy, greater flexibility, and reduced cost when used as feature extraction methods for downstream tasks—a claim the speaker supports with concrete operational metrics from real production systems.

Case Study 1: User Search Query Categorization

Problem Context

Canva needed to understand user interests by categorizing aggregated search queries into their content’s tree structure (information architecture). This involved a multi-step funneling process where classification models at each node of the tree structure would route queries to different content categories. The first layer required an intent classification system to categorize queries into different content types (templates, features, etc.).

Traditional ML Approach

The conventional approach required a full model development cycle:

This end-to-end process took approximately four weeks including data collection, exploration, model training, inference development, and deployment.

LLM-Based Approach

The LLM approach dramatically simplified the workflow:

This process eliminated the need for training and deployment infrastructure setup and required a much smaller annotated dataset.

Operational Comparison

The presentation provided detailed operational metrics comparing both approaches:

Key Learnings from Case Study 1

The speaker shared several practical takeaways for production LLM deployments:

When to Use LLM APIs:

Prompt Engineering Considerations:

Error Mitigation:

Fine-Tuning Insights:

Case Study 2: Content Page Categorization

Problem Context

Canva has content pages with vastly different characteristics—from template collections with short-form metadata to informational articles with long-form text. The goal was to group pages together based on semantic similarity into topic clusters that align with their information architecture.

Traditional NLP Approach

Due to the significant variation in text length and content structure across pages, the pre-LLM approach required multiple different text feature extraction methods:

This fragmented approach required different frameworks and libraries, creating a scattered development process rather than a unified solution.

LLM-Based Approach

The team experimented with different feature extraction methods combining open-source embeddings (sentence transformers, BERT embeddings) and LLM embeddings. They discovered that LLM embeddings on plain page text—without any text feature transformation—provided the best results.

Evaluation Metrics

The team defined three specific metrics for evaluating content categorization performance:

Results

LLM embeddings achieved superior performance across all metrics:

Operational Comparison

Key Learnings from Case Study 2

Feature Variations:

Embeddings Insights:

Future Considerations:

Scaling Considerations

When asked about handling millions of samples daily, the speaker offered two strategies:

Aggregation and Preprocessing: For the search query use case, despite having millions of queries per day, the team performs aggregation and preprocessing to reduce volume before LLM processing. This allows them to still benefit from LLM simplicity while managing costs.

Open-Source LLMs: For cases where aggregation isn’t viable, the recommendation is to explore open-source LLMs deployed within organizational infrastructure. This involves more upfront setup but can eliminate ongoing API costs at scale.

Production Operations Insights

The presentation offers several practical insights for running LLMs in production:

Balanced Assessment

The case study presents compelling metrics for LLM adoption, though several caveats should be noted:

Overall, this case study provides valuable evidence that LLMs can serve as effective feature extraction tools for downstream ML tasks, offering faster development cycles and potentially lower costs compared to building single-purpose models, particularly for organizations already operating at scale with established ML infrastructure.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64

LLM-Powered Voice Assistant for Restaurant Operations and Personalized Alcohol Recommendations

Doordash 2025

DoorDash implemented two major LLM-powered features during their 2025 summer intern program: a voice AI assistant for verifying restaurant hours and personalized alcohol recommendations with carousel generation. The voice assistant replaced rigid touch-tone phone systems with natural language conversations, allowing merchants to specify detailed hours information in advance while maintaining backward compatibility with legacy infrastructure through factory patterns and feature flags. The alcohol recommendation system leveraged LLMs to generate personalized product suggestions and engaging carousel titles using chain-of-thought prompting and a two-stage generation pipeline. Both systems were integrated into production using DoorDash's existing frameworks, with the voice assistant achieving structured data extraction through prompt engineering and webhook processing, while the recommendations carousel utilized the company's Carousel Serving Framework and Discovery SDK for rapid deployment.

fraud_detection customer_support content_moderation +42