Exa.ai built a sophisticated GPU infrastructure combining a new 144 H200 GPU cluster with their existing 80 A100 GPU cluster to support their neural web search and retrieval models. They implemented a five-layer infrastructure stack using Pulumi, Ansible/Kubespray, NVIDIA operators, Alluxio for storage, and Flyte for orchestration, enabling efficient large-scale model training and inference while maintaining reproducibility and reliability.
Exa.ai presents a comprehensive case study of building and managing a large-scale GPU infrastructure for training neural web retrieval models. The company has made a significant investment in GPU computing capabilities, demonstrating their commitment to using neural approaches for web search and retrieval. This case study provides valuable insights into the challenges and solutions of operating LLMs and neural models at scale.
The infrastructure setup describes a sophisticated multi-layer approach to managing large-scale AI operations. At its core, the company operates two major GPU clusters:
* A legacy cluster with 80 NVIDIA A100-80GB GPUs
* A new cluster (Exacluster) with 144 NVIDIA H200-141GB GPUs
The combined infrastructure provides impressive capabilities:
* Total of 224 GPUs across 28 nodes
* 26.4 TB of GPU RAM
* 46 TB of system RAM
* 350 TB of NVMe storage
* Theoretical 168 PFLOPs of FP16 compute
What makes this case study particularly interesting from an LLMOps perspective is the well-thought-out software stack that manages this hardware. The company has implemented a five-layer architecture that addresses key challenges in operating AI infrastructure at scale:
Infrastructure Management:
They use Pulumi for infrastructure as code, choosing Python over YAML for better maintainability and type safety. This allows them to manage both on-premises and cloud resources consistently, with the ability to roll back changes if needed. This approach demonstrates modern DevOps practices applied to AI infrastructure.
Base Configuration:
Ansible and Kubespray handle the bare-metal configuration, automating the process of turning powered-on servers into Kubernetes nodes. This layer handles critical setup tasks including BIOS configuration, storage layout, OS installation, and Kubernetes deployment. The automation ensures consistency across the entire fleet.
Hardware Integration:
NVIDIA GPU and Network Operators provide seamless integration of accelerators and networking hardware. This layer handles driver management, CUDA toolkit installation, and network configuration. The containerized approach ensures version compatibility and enables rolling updates without system-wide downtime.
Storage Management:
Alluxio creates a unified storage layer that combines local NVMe storage with S3 cloud storage. This hybrid approach provides high-throughput access to training data while maintaining data durability. The system intelligently caches hot data on local storage while using S3 as the source of truth, optimizing both performance and cost.
Workload Orchestration:
Flyte serves as the scheduling and orchestration layer, handling complex workflows including multi-node training, cloud bursting, and development environments. Key features include:
* Code-first approach avoiding YAML configuration
* Support for distributed training with PyTorch DDP
* Automatic checkpointing and resume capabilities
* Dataset and artifact lineage tracking
* Priority queues and resource quotas
The integration of these layers enables sophisticated workflows such as:
* Running large-scale pre-training jobs across multiple nodes
* Parallel hyperparameter optimization
* Seamless scaling to cloud resources when needed
* Complete cluster rebuild capability in under an hour
From an LLMOps perspective, this infrastructure demonstrates several best practices:
* Strong emphasis on reproducibility and version control
* Automated management of hardware and software dependencies
* Efficient resource utilization through sophisticated scheduling
* Hybrid storage approach balancing performance and cost
* Support for both batch training and interactive development
The system's architecture shows careful consideration of failure modes and operational requirements. The use of Kubernetes provides container orchestration and scheduling, while Flyte adds ML-specific workflow management. The storage layer with Alluxio demonstrates understanding of the unique I/O patterns in ML workloads.
Areas for consideration:
* The infrastructure represents a significant capital investment ($5 million for the new cluster alone)
* Operating costs (100 KW power consumption) need to be factored into total cost of ownership
* The complexity of the stack requires significant expertise to maintain
* While cloud bursting is supported, the primary focus is on on-premises infrastructure
This case study provides valuable insights for organizations building large-scale AI infrastructure, particularly those focusing on neural search and retrieval models. The layered architecture and choice of tools demonstrate a mature approach to LLMOps, balancing performance, reliability, and maintainability.
Start your new ML Project today with ZenML Pro
Join 1,000s of members already deploying models with ZenML.