Salesforce: Streamlining Custom LLM Deployment with Serverless Infrastructure

LLMOps Database

Tech

Salesforce

Company

Salesforce

Title

Streamlining Custom LLM Deployment with Serverless Infrastructure

Industry

Tech

Link

https://aws.amazon.com/blogs/machine-learning/how-amazon-bedrock-custom-model-import-streamlined-llm-deployment-for-salesforce?tag=soumet-20

Year

2025

Summary (short)

Salesforce's AI platform team faced operational challenges deploying customized large language models (fine-tuned versions of Llama, Qwen, and Mistral) for their Agentforce agentic AI applications. The deployment process was time-consuming, requiring months of optimization for instance families, serving engines, and configurations, while also proving expensive due to GPU capacity reservations for peak usage. By adopting Amazon Bedrock Custom Model Import, Salesforce integrated a unified API for model deployment that minimized infrastructure management while maintaining backward compatibility with existing endpoints. The results included a 30% reduction in deployment time, up to 40% cost savings through pay-per-use pricing, and maintained scalability without sacrificing performance.

## Overview Salesforce's AI platform team provides a compelling case study in modernizing LLM deployment infrastructure for production environments. The team operates customized large language models—specifically fine-tuned versions of open-source models including Llama, Qwen, and Mistral—to power their Agentforce agentic AI applications. Their migration from a traditional managed inference approach using Amazon SageMaker to a serverless model using Amazon Bedrock Custom Model Import demonstrates both the operational complexity of running LLMs at scale and the potential benefits of managed, serverless solutions. The initial problem space was characterized by significant operational overhead. Teams were spending months optimizing various infrastructure parameters including instance family selection, serving engine choices (such as vLLM versus TensorRT-LLM), and configuration tuning. This optimization burden was compounded by the maintenance challenges associated with frequent model releases. Additionally, the cost structure proved problematic: the need to reserve GPU capacity for peak usage meant that resources sat idle during lower-traffic periods, resulting in significant waste. These challenges are representative of broader industry pain points when operating LLMs in production at scale. ## Integration Architecture The integration strategy Salesforce employed reveals thoughtful consideration of production constraints and backward compatibility requirements. Rather than undertaking a complete infrastructure replacement, the team implemented a hybrid approach that preserved existing investments while gaining serverless benefits. Their primary design goal centered on maintaining current API endpoints and model serving interfaces to achieve zero downtime and eliminate the need for changes to downstream applications. The deployment flow integration added a single step to their existing CI/CD pipeline. After their continuous integration and continuous delivery process saves model artifacts to their model store (an Amazon S3 bucket), they now invoke the Amazon Bedrock Custom Model Import API to register the model. This control plane operation proves relatively lightweight, adding approximately 5-7 minutes to the deployment timeline depending on model size, while their overall model release process remains at approximately 1 hour. A key architectural advantage emerges here: Amazon Bedrock preloads the model, eliminating the container startup time that SageMaker previously required for downloading weights. The configuration changes were primarily permission-based, involving cross-account access grants for Amazon Bedrock to access their S3 model bucket and IAM policy updates for inference clients. The inference flow architecture demonstrates pragmatic engineering tradeoffs. Client requests continue flowing through their established preprocessing layer, which handles business logic such as prompt formatting. To handle complex processing requirements while maintaining backward compatibility, Salesforce deployed lightweight SageMaker CPU containers that function as intelligent proxies. These containers run their custom model.py logic while forwarding actual inference to Amazon Bedrock endpoints. This design preserves their existing tooling framework: their prediction service continues calling SageMaker endpoints without routing changes, and they retain their mature SageMaker monitoring and logging infrastructure for preprocessing and postprocessing logic. However, this hybrid approach involves clear tradeoffs. The additional network hop introduces 5-10 milliseconds of latency, and the always-on CPU instances incur ongoing costs. The team evaluated alternative approaches using Amazon API Gateway and AWS Lambda functions for pre- and post-processing, which would offer complete serverless scaling and pay-per-use pricing. They found this approach less backward-compatible with existing integrations and observed cold start impacts when using larger libraries in their processing logic, ultimately choosing the proxy container approach for production stability. ## Performance and Scalability Analysis The scalability benchmarking conducted by Salesforce provides valuable insights into the production characteristics of Amazon Bedrock Custom Model Import. Their testing methodology focused on measuring how Amazon Bedrock's transparent auto-scaling behavior—automatically spinning up model copies on-demand and scaling out under heavy load—would perform under various concurrency scenarios. Each test involved standardized payloads containing model IDs and input data sent through their proxy containers to Amazon Bedrock endpoints. The benchmark results reveal interesting performance characteristics across different load patterns. At low concurrency (single concurrent request), Amazon Bedrock achieved a P95 latency of 7.2 seconds with throughput of 11 requests per minute, representing 44% lower latency compared to their ml.g6e.xlarge baseline running in bf16 precision. As concurrency increased to 4, P95 latency rose modestly to 7.96 seconds while throughput scaled linearly to 41 requests per minute. At 16 concurrent requests, latency increased to 9.35 seconds with throughput reaching 133 requests per minute. Under the highest tested load of 32 concurrent requests, P95 latency reached 10.44 seconds while throughput scaled to 232 requests per minute. During this highest load scenario, Amazon Bedrock automatically scaled from one to three model copies, with each copy consuming 1 model unit. The consistent throughput scaling with acceptable latency increases (remaining under 10 seconds at P95) demonstrates the serverless architecture's ability to handle production workloads without manual intervention. It's worth noting that these benchmarks were conducted on the ApexGuru model, which is a fine-tuned version of QWEN-2.5 13B. The specific characteristics of different model architectures, sizes, and serving configurations would naturally produce different performance profiles, so these results should be interpreted as representative rather than universal. ## Operational and Cost Outcomes The business impact metrics reported by Salesforce span two critical dimensions: operational efficiency and cost optimization. On the operational efficiency front, the team achieved a 30% reduction in time to iterate and deploy models to production. This improvement stems from eliminating complex decision-making around infrastructure parameters that previously consumed significant engineering time. Rather than evaluating instance types, tuning serving engine parameters, and choosing between competing serving frameworks, developers could focus on model performance and application logic. The cost optimization results proved even more substantial, with Salesforce reporting up to 40% cost reduction. This savings derived primarily from their diverse traffic patterns across generative AI applications. Previously, they had to reserve GPU capacity based on peak workloads, resulting in idle resources during lower-traffic periods. The pay-per-use model proved especially beneficial for non-production environments—development, performance testing, and staging—that only required GPU resources during active development cycles rather than continuously. This represents a fundamental shift in cost structure from capacity-based to consumption-based pricing. However, the case study itself, being published on the AWS blog, should be interpreted with appropriate skepticism regarding these claims. The 40% cost reduction likely represents best-case scenarios for specific workload patterns rather than universal savings. Organizations with consistently high traffic might see smaller benefits, while those with spiky or development-heavy workloads might see even greater savings. The actual cost comparison depends heavily on factors such as existing reserved instance commitments, traffic patterns, model sizes, and specific SageMaker configurations being replaced. ## Technical Considerations and Lessons Learned Several important technical considerations emerged from Salesforce's implementation that offer practical guidance for other organizations. Model architecture compatibility represents a primary concern: while Amazon Bedrock Custom Model Import supports popular open-source architectures including Qwen, Mistral, and Llama, teams working with cutting-edge or custom architectures may need to wait for support to be added. Organizations planning to fine-tune with the latest model architectures should verify compatibility before committing to deployment timelines. Cold start latency emerged as a critical consideration, particularly for larger models. Salesforce observed cold start delays of several minutes with their 26B parameter models, with latency varying based on model size. For latency-sensitive applications that cannot tolerate such delays, they recommend maintaining at least one model copy active through health check invocations every 14 minutes. This approach creates an interesting tradeoff between cost efficiency (pure pay-per-use) and performance requirements (keeping endpoints warm). The 14-minute interval appears chosen to balance keeping the model loaded while minimizing unnecessary inference costs. The preprocessing and postprocessing architecture decisions highlight broader considerations about where to place custom logic in serverless LLM deployments. While Salesforce chose SageMaker CPU containers for backward compatibility and library support, alternative approaches using API Gateway and Lambda might prove more cost-effective for simpler processing requirements or new applications without legacy integration constraints. The cold start impacts they observed with Lambda when using larger libraries suggest that preprocessing logic complexity should inform architecture decisions. ## Hybrid Deployment Strategy An important aspect of this case study is what it reveals about hybrid deployment strategies. Salesforce explicitly notes that for highly customized models or unsupported architectures, they continue using SageMaker as a managed ML solution. This suggests a pragmatic approach where different infrastructure choices serve different use cases based on their specific requirements. The ability to maintain both deployment pathways provides flexibility and avoids the risk of being locked into a single approach that may not fit all scenarios. The gradual migration strategy starting with non-critical workloads represents sound engineering practice for infrastructure transitions. This approach allows teams to validate performance, cost, and operational characteristics in production before migrating business-critical services. The successful deployment of ApexGuru (their Apex code analysis tool) to production using this new infrastructure provides validation of the approach for production workloads. ## Broader Context and Applicability This case study provides a blueprint for organizations managing LLMs at scale, particularly those with variable traffic patterns or multiple deployment environments. The benefits appear most pronounced for teams currently managing their own inference infrastructure with the associated optimization burden. Organizations already deeply invested in optimized SageMaker deployments with consistent traffic patterns might see smaller benefits and should carefully evaluate their specific cost structures and operational requirements. The architectural patterns demonstrated here—maintaining backward compatibility through proxy layers, gradual migration strategies, and hybrid deployment approaches—represent broadly applicable practices beyond the specific AWS services involved. The fundamental challenge of balancing operational simplicity, cost efficiency, and performance requirements while maintaining production stability applies across cloud providers and deployment platforms. The technical approach of separating control plane operations (model registration and deployment) from data plane operations (inference serving) through well-defined APIs enables this kind of flexibility. The ability to swap underlying infrastructure while maintaining stable application interfaces represents mature LLMOps practice that reduces coupling between model serving infrastructure and application logic.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source