Company
H2O.ai
Title
Optimizing Cloud Storage Infrastructure for Enterprise AI Platform Operations
Industry
Tech
Year
2025
Summary (short)
H2O.ai, an enterprise AI platform provider delivering both generative and predictive AI solutions, faced significant challenges with their AWS EBS storage infrastructure that supports model training and AI workloads running on Kubernetes. The company was managing over 2 petabytes of storage with poor utilization rates (around 25%), leading to substantial cloud costs and limited ability to scale efficiently. They implemented Datafi, an autonomous storage management solution that dynamically scales EBS volumes up and down based on actual usage without downtime. The solution integrated seamlessly with their existing Kubernetes, Terraform, and GitOps workflows, ultimately improving storage utilization to 80% and reducing their storage footprint from 2 petabytes to less than 1 petabyte while simultaneously improving performance for customers.
## Overview This case study presents H2O.ai's infrastructure optimization journey, focusing on solving critical storage challenges that directly impact their ability to operate LLMs and AI models in production environments. H2O.ai positions itself as a global leader in enterprise AI, delivering both generative AI (including LLMs) and predictive AI capabilities through a comprehensive platform that can be deployed on-premises, in cloud environments, and in air-gapped scenarios. The company is noted as a leader in the Gaia benchmark, which suggests they're operating at the cutting edge of AI model performance. While this case study is primarily framed as an infrastructure optimization story featuring Datafi as the solution provider, it offers valuable insights into the operational challenges that companies face when running AI platforms at scale, particularly the intersection of storage infrastructure and ML/LLM operations. The presentation format suggests this is a joint case study or webinar between H2O.ai (represented by Ophira) and Datafi (represented by Devan/Ivan), which means the claims should be evaluated with appropriate skepticism given the commercial nature of the presentation. ## The Infrastructure Context for LLM Operations H2O.ai's platform architecture provides important context for understanding their LLMOps challenges. Their entire technology stack runs on Kubernetes, specifically utilizing Amazon EKS (Elastic Kubernetes Service) in cloud deployments. This Kubernetes-native approach is increasingly common for organizations deploying LLMs in production, as it provides the orchestration, scaling, and resource management capabilities needed for complex AI workloads. The heavy reliance on AWS EBS (Elastic Block Store) storage is particularly significant for their AI operations. As explained in the presentation, fast and reliable storage is critical when training models. This is especially true for large language models, which require substantial I/O throughput during training phases when processing massive datasets, managing checkpoints, and handling model weights. The choice of EBS over other storage solutions suggests they need the performance characteristics and persistence guarantees that block storage provides, which is typical for stateful AI workloads. The company also uses Bottleneck Rocket, AWS's minimal operating system designed for running containers, which indicates a focus on security and efficiency at the operating system level. This is complemented by their use of infrastructure-as-code tools, specifically Terraform for deployment and GitOps practices for managing their infrastructure lifecycle. ## The Core Problem: Storage Inefficiency at Scale The primary challenge H2O.ai faced was storage overprovisioning and inefficiency, which is particularly problematic in LLMOps contexts. They were managing over 2 petabytes of EBS storage with a utilization rate of only 25%, meaning they were effectively paying for 4x more storage capacity than they were actually using. This overprovisioning pattern is common in ML/AI operations for several reasons that the case study alludes to: First, model training workloads are often unpredictable in their storage needs. Training runs generate numerous artifacts including model checkpoints, logs, intermediate representations, and experimental data. Teams tend to overprovision to ensure they don't run out of space mid-training, as running out of storage during a multi-day training run could be catastrophic and expensive. Second, the fear of not being able to scale storage quickly enough leads to defensive overprovisioning. AWS EBS volumes have traditionally required manual intervention to resize, and in some cases, require downtime or data migration to new volumes. This creates a strong incentive to provision more capacity than needed upfront. The growth trajectory was also concerning—the storage footprint was "growing very quickly," which suggests that without intervention, costs would continue to escalate unsustainably. For an enterprise AI platform that hosts customer data and models, this represents both a significant cost challenge and an operational complexity issue. The inability to scale down efficiently meant that even when storage was freed up (perhaps after completing training runs, cleaning up old experiments, or archiving models), H2O.ai couldn't reclaim that capacity and reduce costs. This one-way scaling problem is particularly acute in ML operations where storage needs fluctuate significantly based on project lifecycles, training schedules, and data retention policies. ## The Solution Architecture Datafi's autonomous storage solution addresses these challenges through several technical mechanisms. The system consists of a low-level agent that runs on EC2 instances or within Kubernetes clusters, coupled with a SaaS control plane that runs in Datafi's own VPC. The agent dynamically changes EBS volumes' capacity without impacting running applications—a significant technical achievement given that EBS volumes are typically treated as relatively static resources. The architecture is designed to integrate seamlessly with existing infrastructure-as-code practices. The solution supports Terraform, CloudFormation, and native Kubernetes integrations, which means it can fit into established CI/CD pipelines and GitOps workflows without requiring fundamental changes to how teams deploy and manage infrastructure. This is particularly important for LLMOps environments where deployment pipelines are often complex and sensitive, involving model versioning, experiment tracking, and careful orchestration of training and inference workloads. A critical aspect of the solution is its ability to operate without downtime. The agents can be deployed in read-only mode initially, then activated with what the presenters describe as "flipping a switch." Once active, the system automatically grows volumes when they fill up and shrinks them when files are deleted. This automatic scaling addresses both sides of the storage lifecycle problem that H2O.ai was experiencing. ## Implementation Challenges and Security Considerations The case study candidly discusses several challenges encountered during implementation, which provides valuable insights into the real-world complexities of production AI infrastructure: **Bottleneck Rocket Compatibility**: H2O.ai needed to ensure that Datafi's agent could run on their existing Bottleneck Rocket infrastructure. Bottleneck Rocket is a minimal, security-focused OS, so ensuring compatibility required collaborative work between the two companies. This highlights a common challenge in production environments: solutions that work in standard Linux environments may require adaptation for specialized or hardened operating systems. **Security Requirements**: As an AI platform hosting customer data, H2O.ai had stringent security requirements. The key concern was ensuring that no customer data would leave the EKS clusters or be exposed to Datafi's control plane. The solution architecture addresses this by having the SaaS control plane only handle management commands and metadata, while actual data remains entirely within H2O.ai's infrastructure. This separation of control plane and data plane is a best practice in cloud architectures, particularly important when dealing with sensitive AI training data and proprietary models. For LLMOps specifically, this security model is crucial because training data often includes sensitive information, and model weights themselves may be considered intellectual property. The ability to use a third-party optimization solution without exposing these assets is essential for enterprise adoption. **Backup Integration**: H2O.ai uses Veeam (referred to as "Vallevo" in the transcript due to transcription errors) for backing up persistent volumes in Kubernetes. Ensuring that the Datafi solution wouldn't break existing backup and restore processes was critical. This speaks to the broader challenge of integrating new infrastructure components into production systems that have existing, proven operational procedures. For AI workloads, reliable backups are essential for model reproducibility, disaster recovery, and compliance requirements. ## Results and Impact on LLMOps The quantitative results are substantial. H2O.ai improved their storage utilization from 25% to approximately 80%, which Datafi presents as their target "success criteria" for leaving appropriate buffer space. This improvement translated to reducing their actual storage footprint from 2 petabytes to less than 1 petabyte, even as the amount of actual data they stored grew slightly during the observation period (from approximately 0.5 petabytes at the start to somewhat more by the end). From an LLMOps perspective, several impacts are noteworthy: **Cost Optimization**: The 4x reduction in wasted storage capacity directly translates to significant cost savings. For organizations running extensive model training pipelines, storage costs can represent a substantial portion of total infrastructure spend, particularly when considering the massive datasets used for training modern LLMs. This cost reduction could free up budget for additional compute resources, more experiments, or expanded model development efforts. **Performance Improvements**: Interestingly, H2O.ai reports that the solution actually improved performance for their customers. While the case study doesn't elaborate on the mechanisms, this could be related to more efficient storage allocation, better I/O patterns, or the ability to use appropriately sized volumes that are better tuned for their workloads. In AI operations, storage performance can be a bottleneck, particularly during data loading phases of training or when serving models that require frequent access to large parameter files. **Operational Simplicity**: The zero-downtime deployment and autonomous operation mean that the platform team doesn't need to manually manage storage scaling decisions. This reduces operational toil and allows engineering teams to focus on higher-value activities like model development, feature engineering, and platform enhancement. In LLMOps, where teams are already managing complex orchestration of training jobs, experiment tracking, model versioning, and deployment pipelines, reducing infrastructure management overhead is valuable. **Seamless Integration**: The fact that the solution integrated with their existing Terraform and GitOps workflows without requiring "substantial changes" to their infrastructure is significant. This suggests that the Datafi agents and controllers are designed with Kubernetes-native patterns and infrastructure-as-code best practices in mind. For production AI platforms, the ability to adopt new capabilities without disrupting existing workflows and tooling is often a determining factor in whether solutions can be successfully implemented. ## Critical Evaluation and Considerations While the case study presents impressive results, several considerations merit attention: **Vendor Presentation Context**: This is clearly a joint presentation between H2O.ai and Datafi, likely for marketing purposes. While there's no reason to doubt the fundamental accuracy of the reported metrics, the framing is uniformly positive without discussion of limitations, edge cases, or situations where the solution might not be appropriate. The claims of "zero downtime" and automatic operation should be understood as representing the best-case scenario rather than guarantees for all possible deployment contexts. **Specificity of Results**: The case study shows aggregate results across H2O.ai's infrastructure but doesn't break down performance by workload type. It would be valuable to know whether the benefits are uniform across training, inference, data preprocessing, and other workload categories. Different AI workload patterns may have different storage utilization characteristics, and what works well for one may be less optimal for another. **Time Frame**: The case study doesn't specify the time period over which these results were achieved. The chart shown indicates a progression over time, but without timescales, it's difficult to assess how quickly benefits accrued and whether the system has been operating stably at the improved utilization levels for an extended period. **Complexity of Implementation**: While the presenters describe the challenges overcome, the case study makes the implementation sound relatively straightforward ("flip a switch"). In reality, validating a new storage management system for production AI workloads likely required extensive testing, careful rollout across different customer environments, and ongoing monitoring. The actual implementation complexity is probably understated. **Cost-Benefit Analysis**: While the storage cost savings are clearly significant, the case study doesn't discuss the cost of the Datafi solution itself or provide a net ROI calculation. For organizations evaluating similar solutions, understanding the total cost of ownership and payback period would be important. ## Broader Implications for LLMOps This case study illuminates several broader themes relevant to operating LLMs and AI models in production: **Infrastructure Efficiency as a Priority**: As AI workloads scale, infrastructure efficiency becomes increasingly critical. The storage inefficiency H2O.ai experienced is likely replicated across many organizations running AI at scale. The case demonstrates that focusing on infrastructure optimization can yield substantial returns, both in cost reduction and operational improvement. **The Kubernetes Ecosystem for AI**: The case reinforces the trend of Kubernetes becoming the standard orchestration platform for production AI workloads. The ability of solutions like Datafi to integrate natively with Kubernetes, EKS, and associated tooling (Terraform, GitOps workflows) reflects the maturation of the cloud-native AI ecosystem. **Storage as a Critical Component**: While much LLMOps discussion focuses on compute resources (GPUs/TPUs), model architectures, and deployment strategies, this case highlights that storage infrastructure is equally critical. Poor storage design can impact training performance, inflate costs, and create operational headaches. Modern LLMOps platforms need to consider storage as a first-class concern alongside compute and networking. **Automation and Autonomous Operations**: The autonomous nature of the Datafi solution aligns with broader trends in AIOps and self-managing infrastructure. As AI platforms grow more complex, manual management becomes untenable. Solutions that can automatically optimize resources, respond to changing conditions, and operate without constant intervention become increasingly valuable. **Security and Data Residency**: The emphasis on ensuring that customer data never leaves H2O.ai's infrastructure reflects the critical importance of security and data governance in enterprise AI. Solutions that require data to move to external services for processing or optimization may be non-starters for many organizations, particularly those in regulated industries or handling sensitive data. ## Conclusion This case study, while presented in a promotional context, offers a valuable look at the infrastructure challenges facing organizations operating AI platforms at scale. H2O.ai's experience with storage overprovisioning and the subsequent optimization through autonomous storage management demonstrates that LLMOps extends well beyond model training and deployment to encompass the entire infrastructure stack. The successful integration of Datafi's solution into H2O.ai's Kubernetes-based platform, with substantial cost savings and operational improvements, suggests that focusing on infrastructure efficiency can yield significant benefits for organizations running production AI workloads. However, the commercial nature of the presentation means that organizations considering similar approaches should conduct thorough evaluations in their own environments, considering their specific workload patterns, security requirements, and operational constraints. The case ultimately reinforces that effective LLMOps requires holistic thinking about the entire technology stack, from model architectures and training frameworks down to storage provisioning and resource management. As the field matures, we can expect to see increasing sophistication in infrastructure optimization specifically tailored to the unique characteristics of AI and LLM workloads.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.