ZenML

Kubernetes as a Platform for LLM Operations: Practical Experiences and Trade-offs

Various 2023
View original source

A panel discussion between experienced Kubernetes and ML practitioners exploring the challenges and opportunities of running LLMs on Kubernetes. The discussion covers key aspects including GPU management, cost optimization, training vs inference workloads, and architectural considerations. The panelists share insights from real-world implementations while highlighting both benefits (like workload orchestration and vendor agnosticism) and challenges (such as container sizes and startup times) of using Kubernetes for LLM operations.

Industry

Tech

Technologies

Overview

This case study captures insights from a panel discussion featuring multiple industry experts discussing the intersection of Kubernetes and large language model operations. The panelists include Manjot (investor at Lightspeed with background as a product manager on the Kubernetes team at Google), Rahul (founder of AI Hero focused on LLMOps deployment), Patrick (tech lead at a startup focused on LLM planning with previous experience at Heptio/VMware), and Shri (engineer at Outerbounds building managed ML platforms). The discussion provides a practitioner’s perspective on the real-world challenges of running LLMs in production on Kubernetes infrastructure.

Historical Context and Kubernetes Origins

The panel establishes important context about Kubernetes’ origins and design philosophy. Manjot explains that Kubernetes was inspired by Google’s internal Borg system and was primarily designed to make workloads scalable and reliable through container orchestration. Critically, the original design never considered machine learning workloads specifically—it was focused on traditional workloads like APIs and microservices. This architectural heritage has significant implications for how well Kubernetes handles LLM workloads today.

Despite this, the core principles of Kubernetes—providing an abstraction layer that hides hardware complexity while delivering reliability, scalability, and efficiency—remain valuable for ML workloads. The past few years have seen an explosion of products and libraries that help run machine learning workloads on Kubernetes, building on these foundational principles.

Enterprise Drivers for Kubernetes-Based LLM Deployment

Rahul highlights a key business driver: enterprises are increasingly looking to deploy and train models in-house because they don’t want data leaving their VPC. This privacy and data sovereignty requirement makes Kubernetes an attractive option because it provides a cloud-agnostic platform for training and deploying machine learning models. Organizations can avoid vendor lock-in while maintaining control over their data and infrastructure.

The panel notes that cloud service providers and startups have developed Kubernetes-native solutions for ML deployment, giving enterprises multiple options for building their LLM infrastructure while maintaining the flexibility to move between cloud providers.

Batch vs. Streaming Workloads

An interesting debate emerges around whether Kubernetes is well-suited for batch workloads, which have traditionally dominated machine learning. Shri notes that while Kubernetes offers support for batch jobs through the Kubernetes Job resource and community projects like Argo, batch workloads have not been first-class citizens in the Kubernetes ecosystem, which was always more focused on long-running services.

Patrick offers a counter-perspective, suggesting that generative AI workloads are actually more streaming-oriented than batch-oriented. He notes a shift away from DAGs (directed acyclic graphs) toward chains and agent-based architectures. Furthermore, he argues that as LLMs grow, they can be thought of as microservices themselves—with different model partitions that need to communicate with each other, requiring pod scheduling and networking. This actually aligns well with Kubernetes’ strengths.

GPU Challenges and Cost Management

The panel extensively discusses GPU-related challenges, which are central to LLM operations. Key points include:

Cost as the Primary Constraint: Patrick emphasizes that GPU costs are significant, especially for startups. Running LLM models is expensive, and keeping GPU nodes running continuously adds up quickly. The startup time for GPU nodes is also substantial—spinning up a node and loading model weights can take several minutes, which affects both cost and user experience.

GPU Scarcity: Rahul shares experience from a recent hackathon where the team attempted to fine-tune an LLM using RLHF with the DeepSpeed library on Kubernetes. The original goal of implementing auto-scaling had to be abandoned because GPUs are simply too scarce—even if you want to scale, you can’t get the resources. The practical advice is to negotiate with cloud providers to reserve GPU nodes in advance.

Future Market Dynamics: Rahul speculates that as H100 GPUs become generally available and adoption increases, A100 GPUs will become more available and cheaper. This market self-correction may help with GPU availability, though it remains to be seen.

Spot Instances: Manjot mentions that spot instances can help reduce costs, but notes that the details of how Kubernetes components like the scheduler and auto-scaler work require significant optimization for ML workloads.

Kubernetes Component Limitations

The panel identifies several Kubernetes components that need optimization for LLM workloads:

Scheduler: The way Kubernetes decides to provision and schedule nodes and pods can be significantly optimized for ML workloads, particularly around data transfer between nodes and pods, and location-aware scheduling.

Auto-scaler: Manjot notes that auto-scaling has always been a way to “shoot yourself in the foot” even for traditional workloads, and it’s even harder to make it work well for ML and LLM workloads. Given GPU scarcity, auto-scaling becomes almost moot—the focus should be on building a repeatable platform for rapid deployment and iteration rather than over-optimizing auto-scaling.

Container Size and Deployment Challenges

Rahul provides concrete technical details about the challenges of containerizing LLMs:

Large Container Sizes: LLM containers can be tens of gigabytes in size. This creates cascading problems including inconsistent pull times (some nodes pulling in 15 seconds, others taking a minute and a half), potential pod failures due to disk space limitations, and increased costs from repeatedly pulling large container images.

Model Weight Loading: The startup time to load model weights into GPU memory adds to the overall deployment latency.

Potential Solutions: Patrick mentions exploring “hot swap LoRAs or QLoRAs” as a potential optimization—instead of spinning up entire nodes, running base models and swapping in fine-tuned adapters for specific capabilities.

LLM Workload Types and Kubernetes Fit

The panel discusses four main categories of LLM workloads and how Kubernetes serves each:

Training from Scratch: Building foundational models requires petabytes of data and months of compute time. This is the most resource-intensive category.

Fine-tuning: Using techniques like PEFT and LoRA to adapt models to specific use cases. While you’re only training a small portion of parameters, you still need the entire model in memory.

Prompt Engineering: This is less infrastructure-intensive but still requires reliable serving infrastructure.

Inference/Serving: Manjot notes that Kubernetes is currently more battle-tested for training and fine-tuning than for inference. There’s still work needed to optimize the inference path.

Service-Oriented Architectures: Patrick and Rahul emphasize that Kubernetes excels at orchestrating multiple services—like a vector database, a web application, and an LLM—that need to work together. This composability is a key strength.

Developer Experience Considerations

The panel addresses the tension between data scientist productivity and Kubernetes complexity:

Abstraction Layers: Rahul suggests that one-line commands that containerize and deploy code to Kubernetes can be a good starting point. However, he cautions that this isn’t ideal for building repeatable, production-grade systems.

Organizational Structure: Manjot notes that organizations typically separate deployment teams from data science teams. In some cases, models are written in Python, converted to other code by a separate team, and then deployed—a process that seems highly inefficient.

Hidden Complexity: The consensus is that the best experience for data scientists is not having to deal with Kubernetes directly. Platforms like Metaflow (mentioned by Shri as a project he’s involved with) can abstract away Kubernetes complexity while providing the benefits of container orchestration.

Hardware Abstraction Challenges

Manjot raises an interesting tension: the entire point of Kubernetes and containers is to abstract away hardware, but LLM workloads often require specific libraries and drivers that work with accelerators or specific hardware components. This creates a situation where the abstraction isn’t complete—you still need to think about the underlying hardware, which somewhat undermines the containerization philosophy.

Emerging Architecture Patterns

Rahul describes service-oriented LLM architectures that combine:

The challenge is that these architectures involve data leaving the VPC to reach external services, which conflicts with the privacy requirements that drove organizations to Kubernetes-based solutions in the first place. This requires building robust, self-contained platforms that can support the full MLOps lifecycle including human feedback integration.

Startup vs. Enterprise Considerations

Patrick emphasizes that organizational size matters significantly. Startups face compounding costs—both Kubernetes overhead and LLM costs are high, making the total cost substantial. For smaller organizations, using something like a Cloudflare Worker with the OpenAI API might be significantly more cost-effective. Larger enterprises with capital available can invest in proper Kubernetes-based platforms that provide long-term benefits in terms of control and flexibility.

Current State and Future Outlook

The panel acknowledges that the space is moving so rapidly that definitive best practices don’t yet exist. New tools and platforms emerge weekly—from Colab to Replicate to various open-source solutions. This creates both opportunity for new products and services to fill gaps, and uncertainty for organizations trying to make technology choices.

Key problems that need solving include:

The panelists agree that Kubernetes will likely remain part of the answer, but significant work is needed to optimize Kubernetes components and build appropriate abstraction layers for LLM workloads. The opportunity exists for new cloud offerings and platforms that can address these challenges while hiding the underlying complexity from practitioners.

More Like This

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik 2025

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot +53

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Enterprise-Ready AI Development Infrastructure from Day One

Windsurf 2024

Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.

code_generation code_interpretation high_stakes_application +42