Company
Linkedin
Title
AI-Powered Semantic Job Search at Scale
Industry
Tech
Year
2025
Summary (short)
LinkedIn transformed their traditional keyword-based job search into an AI-powered semantic search system to serve 1.2 billion members. The company addressed limitations of exact keyword matching by implementing a multi-stage LLM architecture combining retrieval and ranking models, supported by synthetic data generation, GPU-optimized embedding-based retrieval, and cross-encoder ranking models. The solution enables natural language job queries like "Find software engineer jobs that are mostly remote with above median pay" while maintaining low latency and high relevance at massive scale through techniques like model distillation, KV caching, and exhaustive GPU-based nearest neighbor search.
## Overview LinkedIn, serving 1.2 billion members globally, undertook a fundamental transformation of its job search functionality by moving from traditional keyword-based search to a semantic, AI-powered experience. The core problem with the legacy system was its inability to understand nuanced user intent—job seekers had to know the exact keywords to find relevant opportunities, which particularly disadvantaged newcomers to the workforce. The new system allows users to describe their ideal job in natural language, such as "Find software engineer jobs in Silicon Valley or Seattle that are mostly remote but not from sourcing companies, posted recently, with above median pay" or "I want to make a difference in the environment using my marketing background." This case study is particularly notable from an LLMOps perspective because it demonstrates a comprehensive production ML system that addresses model capacity, infrastructure scaling, training data generation, model distillation, and serving optimization—all critical challenges when deploying LLMs at scale. ## Architecture and System Design The fundamental architectural principle driving this system is the recognition that an "ideal" single powerful model comparing every query to every job is computationally infeasible at scale. LinkedIn's solution employs a multi-stage distillation approach where a powerful foundational "teacher" model is created first, and then retrieval and ranking models are fine-tuned to closely mimic this teacher through various distillation techniques. This architectural decision is significant from an LLMOps standpoint because it explicitly addresses the cost-compute-accuracy tradeoff that plagues production LLM deployments. The team reduced what was previously a nine-stage pipeline (often duplicated across a dozen different job search and recommendation channels) to a simplified, aligned system. This reduction in complexity not only improved model performance and interpretability but also significantly enhanced developer velocity—a crucial but often overlooked aspect of LLMOps. The system incorporates multiple optimization objectives through multi-objective optimization (MOO): - Semantic textual similarity: Measuring alignment between queries and job postings - Engagement-based prediction: Estimating likelihood of user actions like clicking or applying - Value-based prediction: Whether the user matches qualifications and probability of being hired A critical design decision was ensuring that retrieval and ranking stages use the same MOO framework, maintaining alignment throughout the pipeline. ## Training Data Generation One of the most innovative aspects of this implementation is the approach to training data. The team recognized that existing click log data would not capture the nuances of future natural language query patterns. They developed a synthetic data generation strategy using advanced LLMs with carefully designed prompt templates for semantic textual similarity tasks. The process of defining the grading policy itself proved challenging—they needed to explicitly codify dozens of different cases and how to grade them, reasoning that if a human cannot explain why something is rated "good" versus "excellent," neither could an LLM. Initially, human evaluators used this product policy to grade query-job pairs, but this proved too time-consuming to scale. The solution was to build an LLM fine-tuned on human annotations that could apply learned product policies to grade arbitrary query-member-job records automatically. This approach enables annotation at the scale of millions or tens of millions of grades per day—far beyond human capacity. This synthetic data pipeline serves dual purposes: continuously training models to improve over time, and acting as a safeguard to ensure relevance when testing new features. ## Query Engine and Semantic Understanding The query engine represents a sophisticated component that goes beyond simple embedding generation. It handles: - **Intent classification**: Understanding what the user is actually trying to accomplish - **External data fetching**: Retrieving profile information and preferences needed for effective search - **Natural entity recognition**: Tagging strict taxonomy elements for filtering The system leverages the Tool Calling pattern when querying the fine-tuned LLM, enabling it to resolve complex queries. For example, when a user searches for "jobs in New York Metro Area where I have a connection already," the query engine resolves "New York Metro Area" to a geographic ID and invokes LinkedIn's graph service to look up company IDs where the user has connections. These IDs become strict filters, while non-strict criteria are captured in the query embedding. The query engine also implements RAG (Retrieval-Augmented Generation) for generating personalized search suggestions. Attributes mined from millions of job postings are stored in a vector database and passed to the query engine LLM to help members explore potential facets (for broad queries) or refine results (for precise filtering). ## Scaling and Latency Optimization The team implemented multiple techniques to handle high query volumes while maintaining acceptable latency: - **Differentiated caching**: Non-personalized queries are cached separately from personalized queries, which depend on individual profiles and network contexts - **KV caching in LLM serving**: Significantly reduces computation overhead by eliminating duplicate work across requests - **Token optimization**: Response schemas were converted from verbose XML/JSON to minimized equivalents - **Model size reduction**: Through distillation and fine-tuning ## GPU-First Retrieval Infrastructure Perhaps the most technically interesting aspect of this case study is the approach to embedding-based retrieval. After years of using approximate nearest neighbor search, the team found it still struggled to meet requirements for low latency, high index turnover (jobs are often only live for a few weeks), maximizing liquidity, and supporting complex hard filters. Their counterintuitive solution was to return to exhaustive nearest neighbor search—a flat list scan with O(n) complexity. By leveraging GPU characteristics (no pointer chasing, no constant data passing between GPU and CPU), investing in fused kernels, and minor sub-partitioning of data layout, they achieved better latency than more complex index structures while retaining the simplicity of managing flat vectors. This demonstrates a key insight: O(n) approaches can outperform O(log(n)) when constant factors differ sufficiently. ## Retrieval Model Training The retrieval model fine-tuning approach incorporates several sophisticated techniques: - **RL training loop**: During training, the model retrieves top-K jobs per query, which are scored in real-time by the teacher model acting as a reward model. This trains the model directly on what good retrieval looks like rather than relying solely on static labeled data. - **Composite loss function**: The system jointly optimizes for pairwise (contrastive) accuracy, list-level ranking (ListNet), KL divergence loss (preventing catastrophic forgetting), and score regularization for well-calibrated outputs that separate Good, Fair, and Poor matches. - **Infrastructure**: Training uses Fully Sharded Data Parallel (FSDP), BF16 precision, and cosine learning rate scheduling for throughput and stability. - **Automated evaluation**: As models iterate, automated evaluations measure aggregate and query-type changes in relevance metrics. ## Cross-Encoder Ranking For the ranking stage, the retrieval model alone doesn't meet the relevance bar. While LinkedIn previously used Deep and Cross Network architecture for engagement and relevance estimation, this couldn't learn from the Teacher model and lacked sufficient capacity. The solution is a "small" language model (SLM) that learns from the Teacher while maintaining close performance. The model architecture takes job text and query text as input and outputs a relevance score. The training approach uses supervised distillation, which provides not just labels from the teacher but all logits during training, giving richer information for effective learning. Combined with model pruning, intelligent KV-caching, and sparse attention, the system meets relevance accuracy thresholds while reducing reliance on dozens of feature pipelines. ## Lessons for LLMOps Practitioners This case study illustrates several important LLMOps patterns: - **Multi-stage distillation for cost efficiency**: Using expensive teacher models offline to train smaller, faster student models for production - **Synthetic data generation at scale**: Using LLMs to generate and label training data, with human annotation only needed for the initial policy definition - **Hybrid retrieval strategies**: Combining semantic embeddings with strict taxonomy filters and external data lookups - **GPU-optimized inference**: Choosing algorithms that match hardware characteristics rather than theoretical complexity - **Aligned training objectives**: Ensuring retrieval and ranking models optimize for the same goals to avoid conflicts - **Developer velocity as a metric**: Recognizing that system complexity impacts iteration speed The team notes that over 100 team members contributed across infrastructure, AI modeling, user experience, and data science—highlighting the cross-functional nature of production LLM systems at this scale.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.