ZenML

Enterprise Challenges and Opportunities in Large-Scale LLM Deployment

Barclays 2024
View original source

A senior leader in industry discusses the key challenges and opportunities in deploying LLMs at enterprise scale, highlighting the differences between traditional MLOps and LLMOps. The presentation covers critical aspects including cost management, infrastructure needs, team structures, and organizational adaptation required for successful LLM deployment, while emphasizing the importance of leveraging existing MLOps practices rather than completely reinventing the wheel.

Industry

Tech

Technologies

Overview

This case study is derived from a conference presentation by Andy, a senior industry leader at West Group (a large enterprise with 500 data scientists and engineers) and author of “Machine Learning Engineering with Python.” The talk focuses on the practical challenges and strategic opportunities when deploying LLMs and generative AI at enterprise scale. This is not a specific implementation case study but rather a practitioner’s perspective on the state of LLMOps in large organizations, offering valuable insights into what makes enterprise LLM deployment difficult and how organizations can navigate these challenges.

The speaker makes a critical observation that resonates across the industry: while many organizations are actively using generative AI, very few have successfully deployed it at production scale, especially in larger enterprises. Most organizations, according to the speaker, are getting stuck at the “develop” stage of the machine learning lifecycle, unable to make the transition to actual production deployment.

The Four-Stage ML Lifecycle and Where Organizations Struggle

The presentation references a four-stage machine learning lifecycle framework: Discover (understanding the problem), Play (building a proof of concept), Develop, and Deploy. The key insight is that the generative AI revolution has created a bottleneck at the development stage, where organizations struggle to transition from experimentation to production-ready systems.

Key Differences Between MLOps and LLMOps

The speaker emphasizes that traditional MLOps and LLMOps are fundamentally different, which creates challenges for organizations that have built significant muscle memory around classical machine learning operations. Some of the critical differences highlighted include:

This transition is particularly challenging for enterprises that have invested heavily in building classical MLOps capabilities over the years.

Foundation Model Selection Criteria

An important perspective shared is how enterprise leaders should think about foundation model selection. Rather than focusing on which models top the Hugging Face leaderboard, the speaker advocates for a more pragmatic evaluation framework centered on:

This practical approach contrasts with the hype-driven model selection that often occurs in early experimentation phases.

The Emerging GenAI Stack

The presentation references the a16z (Andreessen Horowitz) diagram of the emerging GenAI/LLM stack, which includes both familiar components (orchestration, monitoring, logging, caching) and new elements. Key observations include:

The speaker notes that large organizations often struggle to adapt quickly to these changes due to their inherent bureaucratic processes around budget approval and infrastructure provisioning.

Enterprise-Scale Considerations

At enterprise scale, several factors become particularly challenging:

Strategic Recommendations

The speaker offers practical guidance for enterprises:

The New Data Layer Challenge

One of the most significant challenges highlighted is the evolution of the enterprise data layer. Organizations that have built data lakes, lakehouses, experiment metadata trackers, and model registries must now augment these with:

This represents a fundamental shift in how data and analytics teams structure their data infrastructure.

Monitoring and Evaluation

The speaker emphasizes that monitoring in LLMOps is critically important but substantially more complex than in traditional MLOps. The challenge lies in building workflows that effectively combine:

This multi-faceted approach to evaluation is still evolving, and best practices are not yet well established.

Guardrails Implementation

For safety and control, the speaker recommends tools like NVIDIA’s NeMo Guardrails, which are described as easy to configure and build. However, a significant open question remains: how do you standardize guardrails implementations across very large teams? At West Group, with 500 data scientists and engineers, ensuring everyone works to the same standards is a substantial organizational challenge.

Organizational and Process Challenges

The dynamic, rapidly evolving architecture of the GenAI space poses particular challenges for large organizations. The speaker notes that even RAG, which just arrived on the scene, already has articles arguing it’s becoming obsolete. This pace of change is difficult for enterprises to absorb.

Strategic Organizational Responses

The speaker recommends:

Team Structure Evolution

The composition of ML/AI teams is changing with the advent of generative AI:

Historical Context and Maturity Perspective

The speaker provides valuable historical perspective, noting that when data science first took off, 80-90% of organizations couldn’t put solutions in production. There was similar confusion and hype. MLOps emerged to help mature these practices, and the speaker predicts the same will happen with LLMOps and generative AI.

Winning Strategy for Organizations

The presentation concludes with a clear thesis: organizations that will succeed in this space are those that can:

The key message is that enterprises should not reinvent the wheel. Much of what has been built in MLOps over recent years still applies; organizations are simply adding new pieces to an existing foundation.

Cost Estimation Discussion

During the Q&A, an important practical question arose about estimating cost per query. The speaker’s advice:

Critical Assessment

It’s worth noting that this presentation is a practitioner’s perspective rather than a documented case study with measurable outcomes. The insights are valuable for understanding enterprise challenges, but claims about best practices should be evaluated in context. The speaker’s experience is primarily from one large organization, and approaches may vary across industries and organizational cultures. The recommendation to avoid pre-training, while generally sound, may not apply to all situations, particularly as costs decrease and specialized use cases emerge.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Enterprise AI Platform Integration for Secure Production Deployment

Rubrik 2025

Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.

customer_support content_moderation chatbot +53

Building Enterprise-Ready AI Development Infrastructure from Day One

Windsurf 2024

Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.

code_generation code_interpretation high_stakes_application +42