Company
LinkedIn
Title
Accelerating LLM Inference with Speculative Decoding for AI Agent Applications
Industry
HR
Year
2025
Summary (short)
LinkedIn's Hiring Assistant, an AI agent for recruiters, faced significant latency challenges when generating long structured outputs (1,000+ tokens) from thousands of input tokens including job descriptions and candidate profiles. To address this, LinkedIn implemented n-gram speculative decoding within their vLLM serving stack, a technique that drafts multiple tokens ahead and verifies them in parallel without compromising output quality. This approach proved ideal for their use case due to the structured, repetitive nature of their outputs (rubric-style summaries with ratings and evidence) and high lexical overlap with prompts. The implementation resulted in nearly 4× higher throughput at the same QPS and SLA ceiling, along with a 66% reduction in P90 end-to-end latency, all while maintaining identical output quality as verified by their evaluation pipelines.
## Overview LinkedIn's engineering blog post details their production implementation of speculative decoding to optimize the performance of Hiring Assistant, their first AI agent designed for recruiters. This case study provides valuable insights into practical LLM inference optimization for production systems with stringent latency requirements and high-volume workloads. The Hiring Assistant processes complex hiring workflows by analyzing job descriptions against candidate profiles, generating structured assessments that include match strength classifications, reasoning, and grounded evidence. The production context is critical here: Hiring Assistant must deliver conversational responses in seconds while processing thousands of input tokens and generating 1,000+ tokens of structured analysis per request. This creates a challenging operational environment where both time-per-output-token (TPOT) and throughput (QPS) must be optimized simultaneously without sacrificing response quality—a common tension in production LLM systems. ## Technical Problem and Context The blog clearly articulates the fundamental challenge with LLM inference, breaking it down into two distinct stages that behave very differently from a performance perspective. The prefill stage, which encodes the input sequence, is compute-heavy but benefits from parallelization and can achieve larger token throughput. However, the generation stage presents the primary bottleneck: it's memory-bound and inherently slow because it produces tokens one at a time, with each token requiring a complete forward pass through the model. For Hiring Assistant specifically, this sequential generation becomes particularly problematic when producing the long, structured outputs required for comprehensive candidate assessments. When you're generating 1,000+ tokens sequentially, even small per-token latencies accumulate into user-perceptible delays that can degrade the recruiter experience. The challenge is compounded by the need to serve this at scale to a growing number of customers globally. ## Solution Architecture: Speculative Decoding LinkedIn chose to implement speculative decoding, specifically the n-gram variant, as their optimization strategy. The fundamental insight behind speculative decoding is elegant: instead of generating tokens one at a time, the system drafts multiple tokens ahead speculatively and then verifies them in parallel. Because verification (checking multiple tokens simultaneously) is significantly cheaper than sequential generation, the system achieves time savings proportional to the number of tokens accepted. The critical feature of speculative decoding that makes it production-ready is its lossless guarantee. When speculative tokens are incorrect, the system gracefully falls back by discarding incorrect draft tokens and resuming generation from the last verified position. The verification step uses the target model's probability distributions to accept or reject proposed tokens through an acceptance/rejection sampling mechanism inspired by the Metropolis-Hastings algorithm. This ensures the final output distribution is mathematically identical to what the base model would have produced—there's no approximation or quality degradation. ## N-gram vs. Draft Model Approaches The blog distinguishes between two main implementations of speculative decoding, and LinkedIn's choice between them reveals important considerations for production LLM systems: **N-gram speculative decoding** (also known as prompt lookup decoding) is a model-agnostic, purely statistical approach that identifies patterns in the existing input to predict upcoming tokens. It has minimal drafting cost and works particularly well when outputs contain rephrasings or structured text with predictable patterns. **Draft-model speculation** uses a smaller auxiliary model to propose tokens that the main model then verifies. While this can accelerate less repetitive text, it introduces significant operational complexity because you're serving two models, managing their orchestration, dealing with potential latency tail risks from the draft model, and incurring additional infrastructure costs. For Hiring Assistant's specific workload characteristics, LinkedIn determined that n-gram speculation was the optimal choice. This decision was driven by several factors inherent to their use case that made it particularly well-suited to n-gram approaches. ## Workload Characteristics Favoring N-gram Speculation LinkedIn provides excellent detail on why their specific workload was ideal for n-gram speculation, offering valuable guidance for others evaluating this technique: **Structured output format**: Hiring Assistant produces rubric-style summaries with consistent formatting patterns including ratings, evidence sections, and rationale statements. This structural predictability creates stable phrasing patterns that n-gram matchers can exploit effectively. **High lexical overlap**: The generated assessments frequently quote directly from input materials—skill names, job titles, tools, certifications, and locations appear verbatim in both the prompt and the output. This creates excellent opportunities for n-gram matching, leading to high acceptance rates. **Long prompts with recurring schema**: The system processes multi-turn conversations, pre-screening Q&A, and ATS-connected workflows that all follow predictable patterns. Longer n-gram lookups become increasingly effective with this type of input. **Consistency requirements**: For a recruiting tool, consistency and transparency are critical—recruiters need explanations they can trust and that comply with policy requirements. The lossless verification guarantee of speculative decoding perfectly aligns with these requirements, ensuring outputs remain identical to the base model while improving performance. ## Implementation Details in vLLM LinkedIn deployed their solution using vLLM, a popular open-source library for efficient LLM serving, and they share the specific configuration parameters they tuned: **num_speculative_tokens** controls how many tokens are drafted in a single speculation attempt before verification. Increasing this value can dramatically improve speed when the drafted tokens are accepted, as more tokens are validated in parallel. However, there's a trade-off: if any token in the batch is incorrect, all tokens after the mismatch must be discarded, wasting compute on those incorrect predictions. This parameter requires careful tuning based on your specific workload's acceptance rate patterns. **prompt_lookup_max** sets the maximum n-gram length that the system searches for in the prompt history. Higher values allow the system to capture longer repetitive sequences like structured templates or boilerplate text, which can yield substantial gains when matches occur. The blog notes that because n-gram lookups are computationally lightweight, setting this parameter generously introduces minimal overhead. For Hiring Assistant's structured output patterns, this occasionally enabled acceleration of large text chunks. **prompt_lookup_min** defines the minimum match length required to trigger speculation. Higher values make the system more conservative, only speculating when it finds strong matches. This achieves higher acceptance rates with fewer wasted attempts, though it may miss opportunities for shorter but still beneficial matches. The tuning of these parameters represents a key LLMOps consideration: balancing speculation aggressiveness against acceptance rates to maximize overall throughput improvement. This requires empirical testing with representative production workloads. ## Production Results and Impact LinkedIn reports impressive quantitative improvements from their implementation, though as critical analysts we should note these are presented as internal measurements without external validation: **4× higher throughput** at the same QPS and SLA ceiling, meaning the system can serve four times as many requests while maintaining latency guarantees. **66% reduction in P90 end-to-end latency**, representing a substantial improvement in tail latency—the cases where users would experience the longest waits. **Zero quality degradation**, as verified by their internal evaluation pipelines. This is the critical lossless guarantee that makes speculative decoding production-ready. The blog emphasizes that these improvements came "without any quality degradation," which is verified through their evaluation pipelines. This distinction between measured performance improvements and maintained quality standards reflects mature LLMOps practices where both speed and correctness must be validated. From an operational perspective, the blog highlights that n-gram speculation delivered "operational simplicity at scale" by avoiding the need for a second model. This eliminated additional latency tail risks, infrastructure costs, and orchestration complexity—all critical concerns for global deployment. The solution enabled Hiring Assistant to operate globally with low cost and minimal tuning effort. ## Broader Applicability and Limitations The blog provides useful guidance on when n-gram speculation is appropriate, demonstrating thoughtful engineering judgment about the technique's boundaries: **Ideal workloads** include summarization, document question answering, code editing, and multi-turn conversations—tasks where outputs naturally repeat phrases or follow structured patterns. The technique particularly excels with long prompts and predictable structures because the likelihood of finding matching sequences (and achieving high acceptance rates) is significantly higher. **Limited applicability** exists for highly variable, creative, or unstructured text generation where patterns are less predictable. In those scenarios, the blog suggests draft-model approaches might be more suitable, though they come with the operational complexity discussed earlier. This honest assessment of applicability boundaries is valuable for practitioners considering this technique. The blog doesn't oversell n-gram speculation as a universal solution but rather positions it correctly as highly effective for specific workload characteristics. ## Integration with Broader LLMOps Stack The blog explicitly positions speculative decoding as one optimization within a comprehensive LLMOps approach. LinkedIn mentions combining it with "a plethora of other techniques such as ideal model choice, robust fine-tuning pipeline, agentic architecture, continuous batching and prefix caching" to deliver their complete production system. This contextualization is important for understanding real-world LLMOps: no single technique solves all challenges. Production LLM systems require careful orchestration of multiple optimization layers: - **Model selection and fine-tuning** to get the right base capabilities and adapt to domain-specific requirements - **Agentic architecture** to structure complex multi-step reasoning workflows - **Continuous batching** to maximize GPU utilization by dynamically batching requests - **Prefix caching** to avoid recomputing shared prompt prefixes across requests - **Inference acceleration** like speculative decoding to reduce per-token generation costs The combination of these techniques, properly tuned for the specific workload, enables production systems that meet both performance and quality requirements at scale. ## Critical Assessment While the blog provides valuable technical detail, readers should maintain some critical perspective: **Performance claims are unverified externally**: The 4× throughput improvement and 66% latency reduction are impressive but represent LinkedIn's internal measurements. Different workloads, model architectures, hardware configurations, and baseline optimization levels could yield different results. **Quality verification methodology unclear**: While LinkedIn states that quality was verified through "internal evaluation pipelines," the blog doesn't detail what those evaluations entailed, what metrics were used, or how comprehensive the testing was. The mathematical guarantee of lossless decoding should ensure identical outputs, but the operational verification process matters for confidence. **Configuration details incomplete**: While three vLLM parameters are mentioned, the specific values LinkedIn used aren't disclosed, nor is the tuning methodology explained. This limits reproducibility and makes it harder for others to apply these learnings directly. **Workload-specific success**: The blog honestly acknowledges that n-gram speculation works well for their specific structured output use case but may not generalize to all LLM applications. The reported improvements are conditional on having workloads with similar characteristics. **Absence of cost analysis**: While "low cost" is mentioned, the blog doesn't provide detailed cost metrics or ROI analysis that would help others evaluate the business case for implementation. ## Production Readiness and Operational Considerations The case study demonstrates several markers of mature LLMOps practices: **Observability**: LinkedIn clearly monitors detailed performance metrics including P90 latency, throughput (QPS), and time-per-output-token (TPOT), enabling data-driven optimization decisions. **Evaluation infrastructure**: The mention of "internal evaluation pipelines" suggests systematic quality validation processes, though details are limited. **Global deployment considerations**: The blog explicitly mentions enabling "global deployment" and operating "globally," indicating the solution was validated across distributed infrastructure. **SLA-driven optimization**: The focus on maintaining "strict latency budgets" and "SLA ceiling" shows production systems thinking where performance improvements must occur within defined service level boundaries. **Risk mitigation through lossless guarantees**: The emphasis on the lossless nature of speculative decoding reflects an understanding that quality cannot be sacrificed for performance in production AI systems, particularly in sensitive domains like hiring. ## Conclusion and Broader Implications This case study illustrates a thoughtful, engineering-driven approach to optimizing production LLM systems. Rather than pursuing bleeding-edge research techniques, LinkedIn identified a well-understood method (speculative decoding) that was particularly well-matched to their workload characteristics and implemented it pragmatically using existing infrastructure (vLLM). The "low-risk, high-reward" characterization in the blog's conclusion seems accurate for workloads with appropriate characteristics: the technique provides mathematical guarantees against quality degradation while offering substantial performance improvements, and it can be implemented using existing serving infrastructure without architectural changes. For the broader LLMOps community, this case study reinforces several important principles: understand your workload characteristics deeply before selecting optimization techniques, prioritize solutions that maintain quality guarantees in production, consider operational complexity as a key decision factor, and combine multiple complementary techniques rather than relying on any single optimization. The success of n-gram speculation for Hiring Assistant doesn't mean it's universally applicable, but for structured output generation with repetitive patterns—a common use case in enterprise AI applications—it represents a proven, production-ready approach to improving LLM inference performance.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.