A comprehensive overview of the current state and challenges of production machine learning and LLMOps, covering key areas including motivations, industry trends, technological developments, and organizational changes. The presentation highlights the evolution from model-centric to data-centric approaches, the importance of metadata management, and the growing focus on security and monitoring in ML systems.
This case study is derived from a keynote presentation by Alejandro Saucedo, who serves as Director of Engineering Science and Product at Zalando, scientific adviser at The Institute for Ethical AI, and member at large at the ACM. The presentation, titled “The State of Production Machine Learning in 2024,” provides a broad industry-level perspective on how organizations are tackling the challenges of deploying and operating machine learning systems in production, with particular attention to how LLMs fit into this landscape.
The talk is notable for its emphasis that LLMs have not fundamentally changed the challenges of production ML but rather have accelerated and made more visible the existing complexities. This is a valuable perspective that counters some of the hype around LLMs requiring entirely new operational paradigms. The speaker draws from his experience at Zalando and involvement in regulatory efforts (including contributions to UK AI regulation) to provide both practical and policy-oriented insights.
The presentation begins by emphasizing a fundamental principle that resonates throughout: the lifecycle of a machine learning model begins once it is trained, not when training completes. This perspective shapes everything that follows about production considerations.
The key challenges identified include specialized hardware requirements (particularly GPUs, which have become increasingly prominent), complex data flows that span both training and inference pipelines, compliance requirements that vary by use case, and the critical need for reproducibility of components. What makes production ML particularly challenging compared to traditional software infrastructure includes considerations around bias, traditional outages caused by ML-specific contexts, handling of personal or sensitive data (including data that might be exposed through models), and cybersecurity elements that span the entire ML pipeline.
A crucial point made is that “the impact of a bad solution can be worse than no solution at all.” This principle should guide practitioners when deciding whether to deploy ML systems and how much overhead is appropriate for different use cases.
One of the most insightful observations in the presentation is that LLMs provide excellent intuition for understanding the challenges of production ML more broadly. The speaker argues that LLMs make visible the complex architectures, multiple component interactions, and sophisticated data flows that have always been part of production ML systems but were perhaps less obvious with simpler models.
The example given is Facebook’s search retrieval system, which contains both offline and online components with multiple different parts, each requiring its own monitoring considerations. This complexity is not unique to LLMs but is now more apparent because LLM-based systems often require similar architectural sophistication from the start.
The presentation acknowledges the explosion of MLOps tools available today, noting that there are now more than a dozen tools to choose for any given area. To navigate this complexity, the speaker recommends resources like the “awesome MLOps” list, which has been curated for five years and covers tools across privacy engineering, feature engineering, visualization, and more.
The anatomy of production ML is described as comprising training data, model artifacts, and inference data, with experimentation connecting training data to artifacts, and deployment/serving connecting artifacts to inference. Critical additions include monitoring (drift detection, explainability), and the feedback loop that connects inference data back to training.
A significant trend discussed is the transition from model-centric to data-centric thinking. The presentation introduces the concept of metadata interoperability as a key architectural challenge: as organizations move from productionizing individual models to productionizing ML systems (pipelines of multiple models deployed across different environments), traditional artifact stores become insufficient.
The speaker references the “my mlops” project as a useful visualization tool that helps organizations reason about architectural blueprints by showing how different tools can be combined. Organizations must decide between heterogeneous best-of-breed open-source combinations versus end-to-end single-provider solutions.
The relationship mapping has evolved from simple data-to-model relationships to complex multi-dimensional relationships where the same model artifact might be deployed to multiple environments, and systems might combine models from different contexts. This introduces a fundamentally new paradigm that requires thinking beyond traditional artifact stores.
The monitoring section emphasizes that production ML monitoring goes beyond traditional software metrics. Key areas include:
The emphasis on “observability by design” rather than reactive dashboard monitoring is particularly relevant for LLM deployments, where understanding model behavior is crucial but challenging.
Security is identified as a growing area of importance, with vulnerabilities potentially existing throughout the entire ML pipeline: data processing, model training, model serving, and the metadata layer. The speaker mentions chairing a working group at the Linux Foundation focused on machine learning and MLOps security, indicating this is an area requiring active community attention.
A key organizational insight is that traditional software development lifecycle approaches cannot be directly copied to ML. While traditional SDLC has rigid steps (code, test, ops approval, deploy, monitor), ML requires more flexibility because different use cases have different requirements. Some may require risk assessments or ethics board approvals, while others may need rapid iteration for experiments. The governance and operations of production ML must be adapted rather than transplanted from software practices.
The presentation identifies an important trend: MLOps and DataOps are converging. DataOps, with concepts like data mesh architectures, enables business units to make nimble use of their datasets. This is now colliding with ML operationalization requirements, creating new frameworks needed for compliance at scale, not just for personal data but for tracing usage of different datasets across ML systems.
A significant mindset shift is occurring from treating ML as projects (with defined endpoints) to treating ML as products (with ongoing lifecycles). Since models need maintenance, new versions, and capability extensions, product thinking methodologies are becoming more relevant. This includes adopting team structures like Spotify’s squad model, bringing together not just ML practitioners but also UX researchers, full-stack engineers, and domain experts into cross-functional teams.
Since the introduction of LLMs, the speaker has observed accelerated adoption of product team thinking, with increasing numbers of designers and UX researchers joining what were previously purely technical ML teams. This creative and design dimension is seen as driving innovation that marries cutting-edge technology with domain expertise.
The presentation provides a framework for thinking about team evolution as ML maturity increases:
Similarly, automation, standardization, control, security, and observability should increase proportionally with MLOps requirements. Organizations should not front-load all complexity but should scale governance with actual needs.
The Q&A portion reveals practical guidance on deployment decisions. The speaker emphasizes “proportionate risk” assessment: high-risk applications affecting users’ finances, livelihoods, or organizational reputation require closer alignment with domain experts and more stringent SLO setting. Lower-risk applications can be sandboxed to smaller user groups or mitigated through human-in-the-loop approaches.
Importantly, if the required overhead seems excessive for a particular use case, the conclusion may be that advanced AI simply isn’t appropriate for that context. The speaker cites examples of banks and hedge funds still using linear models because explainability requirements or compliance demands make more complex models infeasible.
The speaker traces an evolution from AI ethics discussions (2018) to regulatory frameworks (2020 onward) to software frameworks and libraries that can actually enforce higher-level principles. The EU AI Act is mentioned as entering enforcement. A key insight is that without underlying infrastructure set up to enforce principles, high-level ethical discussions remain ineffective.
The accountability framework extends from individual practitioners (using best practices and relevant tools) to team level (cross-functional skill sets, domain experts at touch points) to organizational level (governing structures, aligned objectives). Open-source frameworks like those from Hugging Face are becoming critical infrastructure for responsible ML development.
The presentation concludes with a reminder that not everything needs to be solved with AI, and that practitioners have growing responsibility as critical infrastructure increasingly depends on ML systems. Regardless of abstractions or LLM capabilities, the impact of these systems is ultimately human, and this should remain central to how practitioners approach production ML challenges.
Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.
New Relic, a major observability platform processing 7 petabytes of data daily, implemented GenAI both internally for developer productivity and externally in their product offerings. They achieved a 15% increase in developer productivity through targeted GenAI implementations, while also developing sophisticated AI monitoring capabilities and natural language interfaces for their customers. Their approach balanced cost, accuracy, and performance through a mix of RAG, multi-model routing, and classical ML techniques.
This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.