ZenML

Fine-tuned LLM Deployment for Automotive Customer Engagement

Impel 2025
View original source

Impel, an automotive retail AI company, migrated from a third-party LLM to a fine-tuned Meta Llama model deployed on Amazon SageMaker to power their Sales AI product, which provides 24/7 personalized customer engagement for dealerships. The transition addressed cost predictability concerns and customization limitations, resulting in 20% improved accuracy across core features including response personalization, conversation summarization, and follow-up generation, while achieving better security and operational control.

Industry

Automotive

Technologies

Company and Use Case Overview

Impel is an automotive retail technology company that specializes in AI-powered customer lifecycle management solutions for automotive dealerships. Their flagship product, Sales AI, serves as a digital concierge that provides personalized customer engagement throughout the vehicle purchasing journey, from initial research to purchase, service, and repeat business. The system operates around the clock, handling vehicle-specific inquiries, automotive trade-in questions, and financing discussions through email and text communications.

The Sales AI platform encompasses three core functional areas that demonstrate sophisticated LLMOps implementation. The summarization feature processes past customer engagements to derive customer intent and preferences. The follow-up generation capability ensures consistent communication with engaged customers to prevent stalled purchasing journeys. The response personalization feature aligns responses with retailer messaging while accommodating individual customer purchasing specifications.

Technical Architecture and Implementation

Impel’s LLMOps implementation centers around Amazon SageMaker AI as the primary platform for model training, deployment, and inference. The company chose to fine-tune a Meta Llama model, recognizing its strong instruction-following capabilities, support for extended context windows, and efficient handling of domain-specific knowledge. This decision represents a strategic shift from general-purpose LLMs to domain-specific models tailored for automotive retail applications.

The fine-tuning process leverages Low-Rank Adaptation (LoRA) techniques, which provide an efficient and cost-effective method for adapting large language models to specialized applications. Impel conducted this training using Amazon SageMaker Studio notebooks running on ml.p4de.24xlarge instances, which provided the necessary computational resources for training large models. The managed environment enabled seamless integration with popular open-source tools including PyTorch and torchtune for model training workflows.

For model optimization, Impel implemented Activation-Aware Weight Quantization (AWQ) techniques to reduce model size and improve inference performance. This optimization step is crucial for production deployment, as it directly impacts both latency and computational costs while maintaining model quality. The quantization process helps balance the trade-off between model accuracy and inference efficiency that is fundamental to successful LLMOps implementations.

Production Deployment and Scaling

The production deployment utilizes SageMaker Large Model Inference (LMI) containers, which are purpose-built Docker containers optimized for serving large language models like Meta Llama. These containers provide native support for LoRA fine-tuned models and AWQ quantization, streamlining the deployment process. The inference endpoints run on ml.g6e.12xlarge instances, powered by four NVIDIA GPUs with high memory capacity, providing the computational resources necessary for efficient large model serving.

A critical aspect of Impel’s LLMOps implementation is the automatic scaling capability provided by SageMaker. The system automatically scales serving containers based on concurrent request volumes, enabling the platform to handle variable production traffic demands while optimizing costs. This elastic scaling approach is essential for customer-facing applications where demand can fluctuate significantly throughout the day and across different business cycles.

The deployment architecture incorporates comprehensive monitoring and performance tracking, including latency and throughput measurements validated using awscurl for SigV4-signed HTTP requests. This monitoring infrastructure ensures that the model maintains optimal performance in real-world production environments and provides the visibility necessary for ongoing optimization efforts.

Model Evaluation and Performance Metrics

Impel implemented a structured evaluation process that demonstrates best practices in LLMOps model assessment. The evaluation encompassed both automated metrics and human evaluation across the three core functional areas. For personalized replies, accuracy improved from 73% to 86%, representing a significant enhancement in the model’s ability to generate contextually appropriate responses. Conversation summarization showed improvement from 70% to 83% accuracy, indicating better comprehension of multi-turn dialogues and customer interaction patterns.

The most dramatic improvement occurred in follow-up message generation, which increased from 59% to 92% accuracy. This substantial gain demonstrates the effectiveness of domain-specific fine-tuning for specialized automotive retail tasks. The evaluation process involved Impel’s research and development team conducting comparative assessments between their incumbent LLM provider and the fine-tuned models across various use cases.

Beyond accuracy metrics, the evaluation included comprehensive performance testing covering latency, throughput, and resource utilization. These operational metrics are crucial for production readiness assessment and ensure that improved accuracy doesn’t come at the cost of user experience degradation. The evaluation framework represents a mature approach to LLMOps that balances multiple dimensions of model performance.

Cost Optimization and Operational Benefits

One of the primary drivers for Impel’s transition was cost optimization at scale. Their previous solution operated on a per-token pricing model that became cost-prohibitive as transaction volumes grew. The migration to SageMaker provided cost predictability through hosted pricing models, enabling better financial planning and budget management. This cost structure change is particularly important for applications with high transaction volumes and variable usage patterns.

The transition also delivered enhanced security benefits through in-house processing of proprietary data within Impel’s AWS accounts. This approach reduces dependency on external APIs and third-party providers while maintaining stricter control over sensitive customer data. The security improvements align with growing regulatory requirements and customer expectations regarding data privacy in automotive retail applications.

Operational control represents another significant benefit, enabling Impel to customize model behavior, implement specialized monitoring, and optimize performance based on their specific use case requirements. This level of control is difficult to achieve with third-party LLM providers and becomes increasingly important as applications mature and require more sophisticated customization.

Collaboration and Partnership Approach

The implementation involved extensive collaboration between Impel’s R&D team and various AWS teams, including account management, GenAI strategy, and SageMaker service teams. This partnership approach spanned multiple development sprints leading up to the fine-tuned Sales AI launch, encompassing technical sessions, strategic alignment meetings, and cost optimization discussions.

The collaborative approach included comprehensive model evaluation reviews, SageMaker performance benchmarking, scaling strategy optimization, and instance selection guidance. This level of partnership support is characteristic of enterprise LLMOps implementations where technical complexity and business criticality require deep expertise across multiple domains.

Future Roadmap and Expansion Plans

Impel’s success with fine-tuned models on SageMaker has established a foundation for expanding AI capabilities across their broader product suite. The company plans to transition additional components of their Customer Engagement Product suite to in-house, domain-specific models, leveraging the operational patterns and technical capabilities developed through the Sales AI implementation.

The future roadmap includes incorporating Retrieval Augmented Generation (RAG) workflows, which will enable the integration of real-time data sources and external knowledge bases into the model’s responses. Advanced function calling capabilities are planned to enable more sophisticated interaction patterns and integration with external systems. The development of agentic workflows represents an evolution toward more autonomous AI systems capable of complex reasoning and multi-step task execution.

Technical Considerations and Trade-offs

While the case study presents significant improvements, it’s important to consider the technical trade-offs inherent in fine-tuning approaches. Domain-specific fine-tuning can potentially reduce model generalization capabilities, making it less effective for tasks outside the training domain. The 20% accuracy improvement, while substantial, should be evaluated in the context of the specific evaluation criteria and may not generalize to all automotive retail scenarios.

The infrastructure requirements for hosting large language models represent ongoing operational overhead that must be balanced against the benefits of model customization and cost predictability. The choice of ml.g6e.12xlarge instances reflects significant computational resource allocation that may not be cost-effective for all use cases or traffic volumes.

The success of this implementation appears to be closely tied to Impel’s access to substantial domain-specific training data and the resources to conduct proper evaluation and optimization. Organizations considering similar approaches should carefully assess their data assets, technical capabilities, and long-term commitment to model maintenance and improvement.

This case study represents a mature approach to LLMOps implementation that successfully balances multiple objectives including cost optimization, performance improvement, security enhancement, and operational control. The comprehensive evaluation methodology and collaborative implementation approach provide valuable insights for organizations considering similar transitions from third-party LLM services to in-house fine-tuned models.

More Like This

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik 2025

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot +53

Evolution of AI Systems and LLMOps from Research to Production: Infrastructure Challenges and Application Design

NVIDA / Lepton 2025

This lecture transcript from Yangqing Jia, VP at NVIDIA and founder of Lepton AI (acquired by NVIDIA), explores the evolution of AI system design from an engineer's perspective. The talk covers the progression from research frameworks (Caffe, TensorFlow, PyTorch) to production AI infrastructure, examining how LLM applications are built and deployed at scale. Jia discusses the emergence of "neocloud" infrastructure designed specifically for AI workloads, the challenges of GPU cluster management, and practical considerations for building consumer and enterprise LLM applications. Key insights include the trade-offs between open-source and closed-source models, the importance of RAG and agentic AI patterns, infrastructure design differences between conventional cloud and AI-specific platforms, and the practical challenges of operating LLMs in production, including supply chain management for GPUs and cost optimization strategies.

code_generation chatbot question_answering +51

Streamlining Custom LLM Deployment with Serverless Infrastructure

Salesforce 2025

Salesforce's AI platform team faced operational challenges deploying customized large language models (fine-tuned versions of Llama, Qwen, and Mistral) for their Agentforce agentic AI applications. The deployment process was time-consuming, requiring months of optimization for instance families, serving engines, and configurations, while also proving expensive due to GPU capacity reservations for peak usage. By adopting Amazon Bedrock Custom Model Import, Salesforce integrated a unified API for model deployment that minimized infrastructure management while maintaining backward compatibility with existing endpoints. The results included a 30% reduction in deployment time, up to 40% cost savings through pay-per-use pricing, and maintained scalability without sacrificing performance.

customer_support chatbot code_generation +19