ZenML

Building a Scalable Retriever-Ranker Architecture: Malt's Journey with Vector Databases and LLM-Powered Freelancer Matching

Malt 2024
View original source

Malt's implementation of a retriever-ranker architecture for their freelancer recommendation system, leveraging a vector database (Qdrant) to improve matching speed and scalability. The case study highlights the importance of carefully selecting and integrating vector databases in LLM-powered systems, emphasizing performance benchmarking, filtering capabilities, and deployment considerations to achieve significant improvements in response times and recommendation quality.

Industry

Tech

Technologies

Overview

Malt is a freelancer marketplace platform connecting freelancers with projects. Their data team continuously works on improving the recommendation system that powers these connections. In 2023, they undertook a significant architectural overhaul of their matching system to address fundamental limitations in their existing approach. This case study documents their journey from a slow, monolithic matching model to a modern retriever-ranker architecture backed by a vector database, demonstrating practical considerations for deploying transformer-based models at scale in production.

The Problem

Malt’s original matching system, released in 2021, relied on a single monolithic model. While they had incrementally added capabilities like multilingual support, the architecture had fundamental limitations:

These limitations prompted the team to explore alternative architectures in 2023 that could balance accuracy with production-grade performance.

Technical Approach: Retriever-Ranker Architecture

The team evaluated multiple neural architectures before settling on their solution. Understanding their decision-making process provides valuable insight into the tradeoffs inherent in deploying ML systems at scale.

Cross-Encoder Evaluation

Cross-encoders process two input texts jointly to produce a single relevance score. For Malt’s use case, this would mean encoding a freelancer profile and project description together. While cross-encoders achieve high precision because they can capture complex relationships between entities, they have a fundamental scalability problem: for each incoming project, the system would need to create and process over 700,000 pairs (one for each freelancer). This approach simply doesn’t scale for real-time production use.

Bi-Encoder Approach

Bi-encoders encode each entity independently into embedding vectors in a shared semantic space. Similarity can then be measured using metrics like cosine similarity. The critical advantage is that freelancer embeddings can be precomputed and stored. At inference time, only the project embedding needs to be computed in real-time, followed by an Approximate Nearest Neighbor (ANN) search to find similar profiles. While this approach is much faster, it sacrifices some accuracy because the model cannot capture the complex interactions between specific freelancer-project pairs.

The Hybrid Solution

Malt implemented a two-stage retriever-ranker architecture that leverages both approaches:

The team made a pragmatic deployment decision: they rolled out the retriever first, placing it before their existing matching model (which served as the ranker). This allowed them to validate the scalability and performance of the retriever component independently before developing a more sophisticated ranker.

Embedding Model Development

The team built custom transformer-based encoder models rather than using off-the-shelf solutions. Key characteristics include:

The team explicitly noted that training their own specialized model provided better results than using generic embedding models like those from Sentence-Transformers. This is an important operational insight—domain-specific training data often leads to superior results for specialized use cases.

Vector Database Selection

The team identified three core requirements for their vector database:

Filtering Considerations

The team provided thoughtful analysis on why filtering within the vector database is necessary rather than as a pre or post-processing step:

Benchmark Results

The team conducted rigorous benchmarking using the GIST1M Texmex corpus (1 million vectors, 960 dimensions each):

Qdrant emerged as the winner, offering the best balance of speed (30x faster than Elasticsearch), reasonable precision, and critical geo-spatial filtering capabilities. The team noted that Pinecone, despite being a well-known player, had to be excluded due to missing filtering requirements—a reminder that feature completeness matters as much as raw performance.

Production Deployment

Kubernetes-Based Infrastructure

The team self-hosted Qdrant on their existing Kubernetes clusters rather than using a managed service. Key deployment decisions included:

Sharding and Replication

The deployment used distributed data architecture:

Monitoring

The team implemented comprehensive observability using:

Results

The production deployment achieved significant improvements:

Key Learnings and Operational Insights

The team shared valuable lessons from their implementation:

Future Work

The team outlined planned extensions:

This case study represents a practical example of scaling transformer-based models for production recommendation systems, with careful attention to the engineering tradeoffs between model accuracy and operational requirements.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Domain-Specific AI Platform for Manufacturing and Supply Chain Optimization

Articul8 2025

Articul8 developed a generative AI platform to address enterprise challenges in manufacturing and supply chain management, particularly for a European automotive manufacturer. The platform combines public AI models with domain-specific intelligence and proprietary data to create a comprehensive knowledge graph from vast amounts of unstructured data. The solution reduced incident response time from 90 seconds to 30 seconds (3x improvement) and enabled automated root cause analysis for manufacturing defects, helping experts disseminate daily incidents and optimize production processes that previously required manual analysis by experienced engineers.

customer_support data_analysis classification +49

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64