Superlinked: Production Vector Search and Retrieval System Optimization at Scale

LLMOps Database

Tech

Superlinked

Company

Superlinked

Title

Production Vector Search and Retrieval System Optimization at Scale

Industry

Tech

Link

https://www.youtube.com/watch?v=GnY4mdma7TU

Year

2025

Summary (short)

SuperLinked, a company focused on vector search infrastructure, shares production insights from deploying information retrieval systems for e-commerce and enterprise knowledge management with indexes up to 2 terabytes. The presentation addresses challenges in relevance, latency, and cost optimization when deploying vector search systems at scale. Key solutions include avoiding vector pooling/averaging, implementing late interaction models, fine-tuning embeddings for domain-specific needs, combining sparse and dense representations, leveraging graph embeddings, and using template-based query generation instead of unconstrained text-to-SQL. Results demonstrate 5%+ precision improvements through targeted fine-tuning, significant latency reductions through proper database selection and query optimization, and improved relevance through multi-encoder architectures that combine text, graph, and metadata signals.

## Overview This case study presents production insights from SuperLinked, a company building frameworks for vector search and information retrieval systems. The speaker, Daniel (CEO of SuperLinked), previously served as an ML tech lead at YouTube working on ad performance forecasting with representation learning. SuperLinked has deployed information retrieval systems across e-commerce and enterprise knowledge management use cases, with experience handling indexes up to 2 terabytes. The presentation focuses on practical lessons learned from getting these systems into production, with an emphasis on relevance optimization, latency management, and cost considerations. ## Establishing Clear Metrics and Business Objectives A critical first principle emphasized throughout is the importance of identifying the right optimization targets before building solutions. Daniel observes that many search teams optimize the wrong metrics or jump directly to technical solutions without understanding business objectives. He advocates for a framework that balances multiple factors including precision, recall, cost, latency, reliability, system complexity, and compliance requirements. The key insight is translating vague business objectives into measurable key results. For example, rather than "sell more products," a proper objective would be "increase AI search result click-through rate by 20% through improved relevance." This creates a direct link between the technical work (improving relevance) and the measurable outcome (CTR improvement), which then feeds into evaluation dataset construction. Without this clarity, teams struggle to prioritize their optimization efforts effectively. Daniel also shares data from vector search office hours with 79 responses, revealing that relevance of results is by far the biggest blocker to production deployment (more than cost, latency, or throughput combined). This validates the focus on relevance optimization as a primary concern for LLMOps practitioners. ## Evaluation Strategy and Dataset Construction The approach to evaluation splits into two components: a sampled, comprehensive eval set that covers broad ground and feels robust for iteration, and a separate collection of "observed crazy cases from production." This dual approach helps teams understand when their sampled eval diverges from real-world edge cases. The emphasis is on maintaining evals rather than completely abandoning them for production metrics alone, particularly for embeddings and retrieval systems where prompt optimization has less impact than with generative models. ## The Vector Pooling Problem One of the most technically detailed sections addresses the fundamental issue of vector pooling and averaging. Daniel demonstrates through concrete examples why averaging embeddings breaks down beyond trivial cases. Using a hotel amenity search example with the Qwen 3 600M parameter model (a top-performing model on MTEB benchmarks), he shows that when you average embeddings for "red fancy shoes," "cheap beach sandals," and "yellow high heels," the resulting vector is closer to "clothes" than to "footwear," and the nearest string to this average is actually "skirts show fleshy" - a nonsensical result. This illustrates that embedding spaces are not as smooth as intuition suggests. The problem occurs because models internally pool token embeddings, so even single-vector document representations involve this averaging behavior. The recommendation is to avoid making single vector representations of long documents, instead creating vectors of sentences or small passages while using late chunking techniques (making embeddings first through the full document context, then chunking and aggregating token vectors). Late interaction models like ColBERT work specifically because they avoid pooling - they keep token vectors separate for both queries and documents, performing multiple queries with each query token independently and aggregating results, rather than averaging vectors before comparison. This architectural choice is fundamental to why late interaction achieves better relevance. ## Fine-Tuning Embeddings for Production Daniel provides specific use cases where fine-tuning embeddings produces measurable improvements. The first example involves product reviews where adjective-noun coupling matters significantly. Using hotel amenity search, he demonstrates that pre-trained models (including Qwen 3) score "good Wi-Fi" most similarly to "good Wi-Fi" (correct), but then rank "bad Wi-Fi" and "no Wi-Fi" as second and third most similar - catastrophically wrong results since "bad Wi-Fi" represents the opposite of what users want. This is fixable through fine-tuning with triplet loss, where "good Wi-Fi" is the anchor, "good Wi-Fi" examples are positive, and "bad Wi-Fi" examples are negative. The optimizer learns to push negative examples away from the anchor in embedding space. This requires only small amounts of data and can be implemented in 15 lines of code with the sentence-transformers library, achieving 5%+ precision improvements. A second fine-tuning use case emerges when rerankers consistently find good results deep in candidate sets (e.g., at position 500+). This signals that the reranker knows something the embedding model doesn't. Fine-tuning the embedding model to approximate the cross-encoder's ranking using mean squared error on the domain-specific documents can make retrieval more efficient by reducing reranking requirements. The approach works because while cross-encoders have more capacity and resolution, embedding models can be tuned to approximate their behavior on specific domains (e-commerce products, internal knowledge bases, etc.). For cases where mining triplets is difficult, using just positive examples (identified by rerankers or LLM-as-judge) and random negatives with appropriate loss functions also works in practice. Daniel also notes that prompt optimization for embeddings (similar to LoRA for generative models) is worth exploring as a lighter-weight alternative to full fine-tuning. ## Context Engineering and Multi-Modal Signals A recurring theme is that not all context is strings, and treating everything as text tokens is inefficient. Numbers (like timestamps) should not be stringified and run through cross-encoders when they can be handled directly through proper scoring functions. Daniel criticizes the tendency to apply a single technique (like reranking) to all problems - teams that learn to add cross-encoder rerankers often just keep expanding candidate sets rather than fixing underlying retrieval issues. The recommendation is to bring signals down into retrieval itself rather than relying entirely on reranking. This includes properly handling metadata, timestamps, and structural information. He specifically warns about database implementations that force full scans when combining vector search with custom scoring (OpenSearch's behavior with timestamp boosting is cited as an example), which can make clusters 5x larger than necessary. ## Text-to-Query Generation and Template-Based Approaches For agent-based systems or natural language search, Daniel strongly advises against unconstrained text-to-SQL or text-to-query generation. These approaches are slow and unreliable. Instead, he recommends maintaining a finite set of query templates that represent specific tasks or patterns (e.g., "find recent items within radius" or "search by popularity and recency"). Language models then fill in template parameters using structured output libraries like Instructor. Breaking this into parallel LLM calls per intent (recency intent, popularity intent, location mention, etc.) enables using smaller models with focused prompts optimized via DSPy. This approach proved successful with Gemini Flash (the smallest hosted Gemini model) for query understanding tasks. The templates become tools that agents can discover and use, constraining them to proven query patterns rather than generating arbitrary database queries. This philosophy aligns with recent lessons from Vercel's MCP server work - tools should be more end-to-end task-oriented rather than providing fine-grained modular building blocks, because agents struggle to deeply inspect intermediate outputs and chain many small operations reliably. ## Graph Embeddings and Relationship Structure Beyond text and metadata, relationship structure provides valuable signal. Basic graph RAG approaches (finding nodes via vector search, then expanding to neighbors) only exploit local structure through a single join operation. To leverage global graph structure, graph embeddings compress the adjacency matrix in an unsupervised way, training models that place similar nodes (within the same structural neighborhood) near each other in embedding space. SuperLinked's framework combines graph embeddings with other encoder types, concatenating graph structure embeddings with content-based embeddings. This relates to the concept of semantic IDs - treating content properties (like product color or category) as tokens and user interactions as a "language" to train models that predict interactions or find similar entities. Graph embeddings are particularly valuable in e-commerce for recommendations and personalized search, and they work even for new users without interaction history when combined with content features. ## Sparse vs Dense Representations Daniel addresses the recent DeepMind paper on limits of dense representations, which some interpreted as evidence that dense vectors are inadequate. He clarifies that the paper constructs a worst-case scenario - a sparse, random dataset (people liking random fruits) where dense vectors naturally struggle because they exploit patterns and correlations in data. When data is random sparse noise, dense vectors can't compress it effectively, while BM25/sparse vectors directly represent the sparse matrix. However, real-world data has relationships and patterns where dense vectors excel. The practical recommendation is to use both sparse and dense representations together. Specific use cases like names, acronyms, and exact keyword matching benefit from sparse vectors, and trying to force these through dense embeddings requires excessive fine-tuning. The ideal implementation combines both in a single index pass that scores both distances and combines results, with rank fusion (separate queries combined via interleaving then reranking) as a fallback. ## Database Selection and Performance Considerations SuperLinked maintains a comparison table (vdbs.superlinked.com) of 48 different vector databases. Through extensive vendor work, Daniel identifies two fundamental database types: - **Full scan optimized databases** that don't build sophisticated ANNS indexes but excel at scanning with flexible scoring (allowing sparse+dense scoring in one pass). These require high compute costs, especially without selective filters. - **Traditional vector databases** with in-memory storage and ANNS indexes where memory dictates cost. These often implement metadata filtering or boosting before/after ANNS traversal, causing recall problems. Understanding these tradeoffs is critical for production deployment. Teams should benchmark databases on concurrent reads and writes (where most perform poorly) and read vendor documentation carefully. Many "features" like metadata boosting kick databases into full scan mode or reranking-only modes with significant performance implications. ## SuperLinked's Multi-Encoder Framework The SuperLinked framework addresses latency and relevance by combining multiple encoders (text, graph, metadata) into unified vector representations. This allows approximate nearest neighbor search indexes to incorporate diverse signals without metadata boosting or reranking. The approach uses simple mathematical combination rules and is open source (Apache 2.0 license). The company incentivizes adoption by offering $500 and publication for users who build something and share their experience. ## Production Deployment Considerations Throughout the presentation, Daniel emphasizes thinking about what information helps surface the right output versus how to describe outputs to LLMs. Since LLMs only understand text tokens, timestamps get tokenized and misunderstood - they can't reliably do timestamp math despite appearing to work on simple cases due to memorized addition tables. This creates weird errors in 5% of cases. For context assembly, the recommendation is to start with all available metadata about documents (creator, timestamp, organizational structure, access control, team/group location) and prune based on impact analysis. Often the most valuable signals aren't document content but metadata like who modified something last or which team created it, enabling workflows like emailing the right person or understanding usage patterns. The presentation concludes with practical recommendations to explore frameworks beyond early LangChain experiences, noting that while many dismissed frameworks after initial disappointing experiences, newer options like DSPy for prompt optimization, late interaction models, and specialized tools provide significant value in production systems. The overall message emphasizes that production vector search requires careful attention to multiple optimization dimensions simultaneously, with different techniques appropriate for different stages of the retrieval pipeline and different types of signals.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source