ZenML

Optimizing LLM Server Startup Times for Preemptable GPU Infrastructure

Replit 2023
View original source

Replit faced challenges with running LLM inference on expensive GPU infrastructure and implemented a solution using preemptable cloud GPUs to reduce costs by two-thirds. The key challenge was reducing server startup time from 18 minutes to under 2 minutes to handle preemption events, which they achieved through container optimization, GKE image streaming, and improved model loading processes.

Industry

Tech

Technologies

Overview

This case study comes from a lightning talk by Bradley Halloran, an engineer at Replit (and notably, formerly employee number seven at YouTube). Replit is a web-based integrated development environment (IDE) that leverages LLMs extensively for features like code completion, code transformation, code explanation, and debugging assistance. The company has invested significantly in self-training and hosting their own models, with at least one model being open-sourced on Hugging Face.

The core challenge Replit faced was economic: serving large language models at low latency requires high-end GPUs like the NVIDIA A100 (with H100 testing mentioned), but these are expensive. On Google Cloud, an A100 costs approximately $3,000 per month at on-demand pricing, compared to about $1,000 per month for spot (preemptable) pricing. This represents a potential cost reduction of two-thirds—a significant savings at scale.

The Preemptable GPU Challenge

The fundamental tension in using preemptable instances for production LLM serving is that these instances come with significant reliability challenges. Google’s own documentation explicitly advises against running “highly available services” on spot nodes. The specific challenges include:

Despite these challenges, Replit pursued this approach and was able to maintain uptime while cutting costs significantly.

Strategy Overview

Replit addressed the preemptable instance challenges through three main strategies:

The lightning talk focused primarily on the third strategy: dramatically reducing server startup time.

The Startup Time Problem

When Bradley’s team analyzed their LLM serving infrastructure, they found that total server startup took approximately 18 minutes:

This 18-minute startup time was untenable for a system where nodes could disappear with only 15 seconds of warning. The goal was to dramatically reduce this time to enable rapid scaling and recovery.

Optimization 1: Container Size Reduction

The first optimization targeted the container images themselves. The team was able to shave approximately 10 gigabytes from the compressed container size through several techniques:

While these optimizations saved about 10GB of container size, the actual time savings were relatively modest—only about 1-2 minutes off the 18-minute total.

Optimization 2: GKE Image Streaming

The breakthrough for container startup time came from enabling GKE (Google Kubernetes Engine) Image Streaming. Google describes this feature as reducing “image pull time from minutes to seconds,” and that’s exactly what Replit experienced.

Image streaming works by streaming file contents in the background as they are read, rather than downloading the entire container image before starting. This approach is particularly effective when containers don’t need every file immediately at startup—which was the case for Replit’s LLM serving containers.

An additional benefit is that image streaming applies at the node level, so Kubernetes system containers also started booting faster, contributing to the overall startup time reduction.

Optimization 3: Model Loading and Storage

The next major bottleneck was loading the actual model weights. For context, a 3-billion parameter model might be approximately 12GB on disk. The team’s initial setup was fetching models from Google Cloud Storage (GCS) onto a remotely attached spinning disk.

The obvious first fix was to switch to locally attached NVMe SSDs—the fastest storage option available. Surprisingly, this change showed no improvement. With at least a gigabit network interface and faster disk speeds, they expected much better performance, but transfer speeds remained around 50 megabytes per second.

After extensive investigation, they discovered the problem was in the container image they were using for the gsutil tool (Google’s rsync equivalent for GCS). Switching from an Alpine-based container image to a Debian slim-based image quintupled the transfer speed—a 5x improvement.

The root cause was a fascinating bug/feature: the gsutil code contained a comment explaining that multi-processing was disabled on Alpine because it would cause hangs. This was not documented anywhere except in the source code repository itself. The Alpine image was silently running in single-process mode, severely limiting download throughput.

With this fix, model loading time dropped from approximately 4 minutes to under 30 seconds.

Results

Through these combined optimizations, Replit reduced their LLM server startup time from 18 minutes to approximately 2 minutes (and sometimes well under that). This dramatic improvement enabled them to successfully run their LLM serving infrastructure on preemptable GPU instances, achieving:

Technical Lessons and Observations

This case study offers several valuable lessons for LLMOps practitioners:

The first lesson is that infrastructure optimization for LLM serving often involves unglamorous but impactful work. Container size reduction, storage configuration, and tooling choices can have dramatic effects on operational efficiency.

The second lesson is the importance of understanding the entire stack. The gsutil multiprocessing bug was hidden deep in the source code and not documented. This kind of issue requires patience and willingness to dig into dependencies.

The third lesson is that cloud provider features like image streaming can provide substantial benefits with relatively low implementation effort. It’s worth staying current with cloud provider capabilities.

Finally, the case study demonstrates that running production workloads on preemptable instances is possible with the right engineering investment, despite cloud providers’ own warnings against it. The key is building systems that are resilient to frequent disruptions and can recover quickly.

Tools and Technologies Referenced

The talk mentions several specific tools and technologies:

This case study provides a practical, real-world example of the infrastructure engineering required to operate LLMs cost-effectively at scale, with concrete numbers and specific technical solutions that other teams can learn from.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik 2025

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot +53

Evolution of AI Systems and LLMOps from Research to Production: Infrastructure Challenges and Application Design

NVIDA / Lepton 2025

This lecture transcript from Yangqing Jia, VP at NVIDIA and founder of Lepton AI (acquired by NVIDIA), explores the evolution of AI system design from an engineer's perspective. The talk covers the progression from research frameworks (Caffe, TensorFlow, PyTorch) to production AI infrastructure, examining how LLM applications are built and deployed at scale. Jia discusses the emergence of "neocloud" infrastructure designed specifically for AI workloads, the challenges of GPU cluster management, and practical considerations for building consumer and enterprise LLM applications. Key insights include the trade-offs between open-source and closed-source models, the importance of RAG and agentic AI patterns, infrastructure design differences between conventional cloud and AI-specific platforms, and the practical challenges of operating LLMs in production, including supply chain management for GPUs and cost optimization strategies.

code_generation chatbot question_answering +51