Salesforce: GPU Resource Optimization for Multi-Model LLM Deployment

LLMOps Database

Tech

Salesforce

Company

Salesforce

Title

GPU Resource Optimization for Multi-Model LLM Deployment

Industry

Tech

Link

https://aws.amazon.com/blogs/machine-learning/optimizing-salesforces-model-endpoints-with-amazon-sagemaker-ai-inference-components?tag=soumet-20

Year

2025

Summary (short)

Salesforce's AI Platform team addressed the challenge of inefficient GPU utilization and high costs when hosting multiple proprietary large language models (LLMs) including CodeGen on Amazon SageMaker. They implemented SageMaker AI inference components to deploy multiple foundation models on shared endpoints with granular resource allocation, enabling dynamic scaling and intelligent model packing. This solution achieved up to an eight-fold reduction in deployment and infrastructure costs while maintaining high performance standards, allowing smaller models to efficiently utilize high-performance GPUs and optimizing resource allocation across their diverse model portfolio.

Tags

continuous_deployment

reliability

scalability

amazon_aws

Salesforce's AI Platform Model Serving team represents a compelling case study in large-scale LLM production operations, demonstrating how enterprises can optimize resource utilization and cost efficiency while maintaining performance standards. The team is responsible for developing and managing services that power large language models and other AI workloads within Salesforce's ecosystem, with their primary focus on model onboarding and providing robust infrastructure for hosting various ML models across Agentforce and other inference-requiring applications. The fundamental challenge Salesforce faced reflects a common dilemma in production LLM operations: efficiently deploying models of varying sizes and traffic patterns while optimizing compute costs and maintaining performance. Their initial architecture relied on single model endpoints (SMEs) distributed across multiple Amazon EC2 P4d instances, with plans to migrate to P5en instances equipped with NVIDIA H200 Tensor Core GPUs. This setup created two distinct optimization problems that are characteristic of enterprise LLM deployments. Their larger models, ranging from 20-30 GB, experienced lower traffic patterns but were deployed on high-performance multi-GPU instances, resulting in significant underutilization of expensive compute resources. This scenario is particularly problematic in production LLM operations because large models often require substantial memory allocation but may not consistently utilize the full computational capacity of high-end GPUs. Meanwhile, their medium-sized models (approximately 15 GB) handling high-traffic workloads demanded low-latency, high-throughput processing capabilities, often leading to over-provisioning on similar multi-GPU setups to ensure performance guarantees. The solution architecture centered around Amazon SageMaker AI inference components, which represents a sophisticated approach to multi-tenant model serving in production environments. Inference components enable the deployment of multiple foundation models on a single SageMaker endpoint while providing granular control over accelerator allocation and memory reservation per model. This architectural pattern addresses several critical LLMOps concerns including resource efficiency, cost optimization, and operational complexity management. The technical implementation involved abstracting ML models through inference components that enable precise assignment of CPUs, GPUs, and scaling policies per model. This abstraction layer provides several operational benefits that are crucial for production LLM deployments. SageMaker AI optimally places and packs models onto ML instances to maximize utilization, leading to significant cost savings. Each model can scale independently based on custom configurations, providing optimal resource allocation tailored to specific application requirements. The platform dynamically scales instances to maintain availability while minimizing idle compute resources. A particularly innovative aspect of their implementation is the capability to scale down to zero copies of a model to free up resources for other models, while simultaneously allowing critical models to remain always loaded and ready to serve traffic for mission-critical workloads. This flexibility is essential in production LLM environments where usage patterns can vary dramatically across different models and time periods. Salesforce's specific implementation focused on their proprietary CodeGen model ecosystem, which represents a sophisticated ensemble approach to code generation and understanding. CodeGen is Salesforce's open-source LLM designed for translating natural language into programming languages, specifically tuned for the Apex programming language. The ensemble includes specialized variants: Inline for automatic code completion, BlockGen for code block generation, and FlowGPT for process flow generation. These models power ApexGuru, a solution within the Salesforce platform that helps developers identify and address critical anti-patterns and hotspots in their Apex code. The deployment strategy employed a hybrid approach that leverages the strengths of both single model endpoints and inference components. SMEs continue to provide dedicated hosting for models requiring predictable performance and consistent traffic patterns, while inference components optimize resource utilization for variable workloads through dynamic scaling and efficient GPU sharing. This hybrid model demonstrates mature LLMOps thinking, recognizing that different models and use cases may require different deployment patterns within the same organization. The technical configuration process involves creating a SageMaker endpoint with the desired instance type and initial instance count to handle baseline inference requirements. Model packages are then attached dynamically, with individual containers spun up as needed. Each model, such as BlockGen and TextEval, is configured as individual inference components with precise resource allocations including accelerator count, memory requirements, model artifacts, container image specifications, and the number of model copies to deploy. The auto scaling capabilities represent a sophisticated approach to handling variable traffic patterns in production LLM environments. Inference components can establish endpoints with multiple copies of models and automatically adjust GPU resources as traffic fluctuates. This allows each model to dynamically scale up or down within an endpoint based on configured GPU limits, enabling efficient handling of varying workloads without compromising performance. The results achieved by Salesforce demonstrate the significant impact that thoughtful LLMOps architecture can have on both cost and performance metrics. The implementation achieved up to an eight-fold reduction in deployment and infrastructure costs while maintaining high performance standards. This dramatic cost reduction was accomplished through intelligent GPU resource management and dynamic scaling, eliminating idle compute resources that were previously necessary under the single model endpoint approach. Enhanced performance for smaller models represents another significant benefit, as these models now utilize high-performance GPUs to meet latency and throughput requirements without incurring excessive costs. This optimization enables a more efficient utilization of expensive hardware resources, allowing organizations to justify investments in high-end GPUs like P4d, P5, and P5en instances by maximizing the value extracted from every computing resource. From an operational perspective, the solution significantly reduces the complexity of managing multiple endpoints while providing the flexibility to scale individual models based on their specific traffic patterns and performance requirements. This approach enables Salesforce to scale to hundreds of models while maintaining predictable performance characteristics and cost optimization. Looking toward future developments, Salesforce is positioned to leverage Amazon SageMaker AI rolling updates capability for inference component endpoints, which will enable batch-by-batch model updates rather than traditional blue/green deployments. This advancement will provide greater flexibility and control over model updates while using minimal extra instances, representing a significant improvement over approaches that require doubled instances during deployment cycles. The case study illustrates several important principles for production LLM operations. Resource optimization through intelligent model packing and dynamic allocation is crucial for managing GPU utilization challenges across diverse model portfolios. The hybrid deployment approach recognizes that different models may require different hosting strategies within the same organization. Cost-performance optimization can be achieved without sacrificing reliability or performance through careful architectural design and implementation of appropriate scaling policies. This implementation represents a mature approach to LLMOps that balances multiple competing priorities: cost optimization, performance maintenance, operational simplicity, and scalability. The solution demonstrates how enterprises can efficiently manage complex LLM portfolios in production environments while maintaining the flexibility to adapt to changing requirements and traffic patterns. The substantial cost savings achieved while maintaining performance standards provide a compelling example of how thoughtful LLMOps practices can deliver significant business value.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source