Malt's implementation of a retriever-ranker architecture for their freelancer recommendation system, leveraging a vector database (Qdrant) to improve matching speed and scalability. The case study highlights the importance of carefully selecting and integrating vector databases in LLM-powered systems, emphasizing performance benchmarking, filtering capabilities, and deployment considerations to achieve significant improvements in response times and recommendation quality.
Malt is a freelancer marketplace platform connecting freelancers with projects. Their data team continuously works on improving the recommendation system that powers these connections. In 2023, they undertook a significant architectural overhaul of their matching system to address fundamental limitations in their existing approach. This case study documents their journey from a slow, monolithic matching model to a modern retriever-ranker architecture backed by a vector database, demonstrating practical considerations for deploying transformer-based models at scale in production.
Malt’s original matching system, released in 2021, relied on a single monolithic model. While they had incrementally added capabilities like multilingual support, the architecture had fundamental limitations:
These limitations prompted the team to explore alternative architectures in 2023 that could balance accuracy with production-grade performance.
The team evaluated multiple neural architectures before settling on their solution. Understanding their decision-making process provides valuable insight into the tradeoffs inherent in deploying ML systems at scale.
Cross-encoders process two input texts jointly to produce a single relevance score. For Malt’s use case, this would mean encoding a freelancer profile and project description together. While cross-encoders achieve high precision because they can capture complex relationships between entities, they have a fundamental scalability problem: for each incoming project, the system would need to create and process over 700,000 pairs (one for each freelancer). This approach simply doesn’t scale for real-time production use.
Bi-encoders encode each entity independently into embedding vectors in a shared semantic space. Similarity can then be measured using metrics like cosine similarity. The critical advantage is that freelancer embeddings can be precomputed and stored. At inference time, only the project embedding needs to be computed in real-time, followed by an Approximate Nearest Neighbor (ANN) search to find similar profiles. While this approach is much faster, it sacrifices some accuracy because the model cannot capture the complex interactions between specific freelancer-project pairs.
Malt implemented a two-stage retriever-ranker architecture that leverages both approaches:
Retriever Stage: Uses a bi-encoder to quickly narrow down the 700,000+ freelancers to a manageable subset (e.g., 1,000 candidates). This stage optimizes for recall, ensuring potentially relevant matches aren’t missed.
Ranker Stage: Applies a cross-encoder to the smaller candidate set to precisely score and rank matches. This stage optimizes for precision, ensuring the most suitable freelancers appear at the top of recommendations.
The team made a pragmatic deployment decision: they rolled out the retriever first, placing it before their existing matching model (which served as the ranker). This allowed them to validate the scalability and performance of the retriever component independently before developing a more sophisticated ranker.
The team built custom transformer-based encoder models rather than using off-the-shelf solutions. Key characteristics include:
The team explicitly noted that training their own specialized model provided better results than using generic embedding models like those from Sentence-Transformers. This is an important operational insight—domain-specific training data often leads to superior results for specialized use cases.
The team identified three core requirements for their vector database:
The team provided thoughtful analysis on why filtering within the vector database is necessary rather than as a pre or post-processing step:
Pre-filtering requires sending binary masks of excluded freelancers, creating large payloads and potentially degrading ANN quality, especially with HNSW algorithms where deactivating too many nodes can create sparse, non-navigable graphs
Post-filtering requires retrieving many more candidates than needed to account for filtered results, impacting performance proportionally to candidate count
The team conducted rigorous benchmarking using the GIST1M Texmex corpus (1 million vectors, 960 dimensions each):
Qdrant emerged as the winner, offering the best balance of speed (30x faster than Elasticsearch), reasonable precision, and critical geo-spatial filtering capabilities. The team noted that Pinecone, despite being a well-known player, had to be excluded due to missing filtering requirements—a reminder that feature completeness matters as much as raw performance.
The team self-hosted Qdrant on their existing Kubernetes clusters rather than using a managed service. Key deployment decisions included:
The deployment used distributed data architecture:
The team implemented comprehensive observability using:
The production deployment achieved significant improvements:
The team shared valuable lessons from their implementation:
The team outlined planned extensions:
This case study represents a practical example of scaling transformer-based models for production recommendation systems, with careful attention to the engineering tradeoffs between model accuracy and operational requirements.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Articul8 developed a generative AI platform to address enterprise challenges in manufacturing and supply chain management, particularly for a European automotive manufacturer. The platform combines public AI models with domain-specific intelligence and proprietary data to create a comprehensive knowledge graph from vast amounts of unstructured data. The solution reduced incident response time from 90 seconds to 30 seconds (3x improvement) and enabled automated root cause analysis for manufacturing defects, helping experts disseminate daily incidents and optimize production processes that previously required manual analysis by experienced engineers.
DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.