ZenML
Blog
Hamza Tahir

Hamza Tahir

133 posts
Newsletter Edition #16 - The future of LLMOps @ ZenML (Your Voice Needed)

Newsletter Edition #16 - The future of LLMOps @ ZenML (Your Voice Needed)

We're expanding ZenML beyond its original MLOps focus into the LLMOps space, recognizing the same fragmentation patterns that once plagued traditional machine learning operations. We're developing three core capabilities: native LLM components that provide unified APIs and management across providers like OpenAI and Anthropic, along with standardized prompt versioning and evaluation tools; applying established MLOps principles to agent development to bring systematic versioning, evaluation, and observability to what's currently a "build it and pray" approach; and enhancing orchestration to support both LLM framework integration and direct LLM calls within workflows. Central to our philosophy is the principle of starting simple before going autonomous, emphasizing controlled workflows over fully autonomous agents for enterprise production environments, and we're actively seeking community input through a survey to guide our development priorities, recognizing that today's infrastructure decisions will determine which organizations can successfully scale AI deployment versus remaining stuck in pilot phases.

Banking on AI: Implementing Compliant MLOps for Financial Institutions

Banking on AI: Implementing Compliant MLOps for Financial Institutions

Traditional banks face growing pressure to deploy machine learning rapidly while meeting strict regulatory requirements. This blog post explores how modern MLOps practices, like automated data lineage, validation testing, and model observability can help financial institutions bridge the gap. Featuring real-world insights from NatWest and an open-source ZenML pipeline, it offers a practical roadmap for compliant, scalable AI deployment.

Why Retail MLOps Is Harder Than You Think

Why Retail MLOps Is Harder Than You Think

An in-depth analysis of retail MLOps challenges, covering data complexity, edge computing, seasonality, and multi-cloud deployment, with real-world examples from major retailers like Wayfair and Starbucks, and practical solutions including ZenML's impact in reducing deployment time from 8.5 to 2 weeks at Adeo Leroy Merlin.

Managing MLOps at Scale on Kubernetes: When Your 8×H100 Server Needs to Serve Everyone

Managing MLOps at Scale on Kubernetes: When Your 8×H100 Server Needs to Serve Everyone

Kubernetes powers 96% of enterprise ML workloads but often creates more friction than function—forcing data scientists to wrestle with infrastructure instead of building models while wasting expensive GPU resources. Our latest post shows how ZenML combined with NVIDIA's KAI Scheduler enables financial institutions to implement fractional GPU sharing, create team-specific ML stacks, and streamline compliance—accelerating innovation while cutting costs through intelligent resource orchestration.

Unified MLOps for Defense: Bridging Cloud, On-Premises, and Tactical Edge AI

Unified MLOps for Defense: Bridging Cloud, On-Premises, and Tactical Edge AI

Learn how ZenML unified MLOps across AWS, Azure, on-premises, and tactical edge environments for defense contractors like the German Bundeswehr and French aerospace manufacturers. Overcome hybrid infrastructure complexity, maintain security compliance, and accelerate AI deployment from development to battlefield. Essential guide for defense AI teams managing multi-classification environments and $1.5B+ military AI initiatives.

10 Databricks Alternatives You Must Try

10 Databricks Alternatives You Must Try

Discover the top 10 Databricks alternatives designed to eliminate the pain points you might face when using Databricks. This article will walk you through these alternatives and educate you about what the platform is all about - features, pricing, pros, and cons.

Scaling ML Workflows Across Multiple AWS Accounts (and Beyond): Best Practices for Enterprise MLOps

Scaling ML Workflows Across Multiple AWS Accounts (and Beyond): Best Practices for Enterprise MLOps

Enterprises struggle with ML model management across multiple AWS accounts (development, staging, and production), which creates operational bottlenecks despite providing security benefits. This post dives into ten critical MLOps challenges in multi-account AWS environments, including complex pipeline languages, lack of centralized visibility, and configuration management issues. Learn how organizations can leverage ZenML's solutions to achieve faster, more reliable model deployment across Dev, QA, and Prod environments while maintaining security and compliance requirements.

ZenML 0.80.0: Workspace Hierarchy for Pro, Performance Gains for All

ZenML 0.80.0: Workspace Hierarchy for Pro, Performance Gains for All

ZenML 0.80.0 transforms tenant structures into workspace/project hierarchies with advanced RBAC for Pro users, while enhancing tagging, resource filtering, and dashboard design. Open-source improvements include Kubernetes security upgrades, SkyPilot integration, and significantly faster CLI operations. Both Pro and OSS users benefit from dramatic performance optimizations, GitLab improvements, and enhanced build tracking.

AI Engineering vs ML Engineering: Evolving Roles in the GenAI Era

AI Engineering vs ML Engineering: Evolving Roles in the GenAI Era

The rise of Generative AI has shifted the roles of AI Engineering and ML Engineering, with AI Engineers integrating generative AI into software products. This shift requires clear ownership boundaries and specialized expertise. A proposed solution is layer separation, separating concerns into two distinct layers: Application (AI Engineers/Software Engineers), Frontend development, Backend APIs, Business logic, User experience, and ML (ML Engineers). This allows AI Engineers to focus on user experience while ML Engineers optimize AI systems.

Automating Lightning Studio ML Pipelines For Fine Tuning LLM (s)

Automating Lightning Studio ML Pipelines For Fine Tuning LLM (s)

In the AI world, fine-tuning Large Language Models (LLMs) for specific tasks is becoming a critical competitive advantage. Combining Lightning AI Studios with ZenML can streamline and automate the LLM fine-tuning process, enabling rapid iteration and deployment of task-specific models. This approach allows for the creation and serving of multiple fine-tuned variants of a model, with minimal computational resources. However, scaling the process requires resource management, data preparation, hyperparameter optimization, version control, deployment and serving, and cost management. This blog post explores the growing complexity of LLM fine-tuning at scale and introduces a solution that combines the flexibility of Lightning Studios with the automation capabilities of ZenML.

Navigating the MLOps Galaxy: ZenML meets Neptune for advanced Experiment Tracking

Navigating the MLOps Galaxy: ZenML meets Neptune for advanced Experiment Tracking

The combination of ZenML and Neptune can streamline machine learning workflows and provide unprecedented visibility into experiments. ZenML is an extensible framework for creating production-ready pipelines, while Neptune is a metadata store for MLOps. When combined, these tools offer a robust solution for managing the entire ML lifecycle, from experimentation to production. The combination of these tools can significantly accelerate the development process, especially when working with complex tasks like language model fine-tuning. This integration offers the ability to focus more on innovating and less on managing the intricacies of your ML pipelines.

Boost Your MLOps Efficiency: Integrate ZenML and Comet for Better Experiment Tracking

Boost Your MLOps Efficiency: Integrate ZenML and Comet for Better Experiment Tracking

This blog post discusses the integration of ZenML and Comet, an open-source machine learning pipeline management platform, to enhance the experimentation process. ZenML is an extensible framework for creating portable, production-ready pipelines, while Comet is a platform for tracking, comparing, explaining, and optimizing experiments and models. The combination offers seamless experiment tracking, enhanced visibility, simplified workflow, improved collaboration, and flexible configuration. The process involves installing ZenML and enabling Comet integration, registering the Comet experiment tracker in the ZenML stack, and customizing experiment settings.

Huggingface Model to Sagemaker Endpoint: Automating MLOps with ZenML

Huggingface Model to Sagemaker Endpoint: Automating MLOps with ZenML

Deploying Huggingface models to AWS Sagemaker endpoints typically only requires a few lines of code. However, there's a growing demand to not just deploy, but to seamlessly automate the entire flow from training to production with comprehensive lineage tracking. ZenML adeptly fills this niche, providing an end-to-end MLOps solution for Huggingface users wishing to deploy to Sagemaker.

Using ZenML with LLMs to Analyze Your Databases: A Case Study with you-tldr.com and Supabase/GPT-4

Using ZenML with LLMs to Analyze Your Databases: A Case Study with you-tldr.com and Supabase/GPT-4

Explore how ZenML, an MLOps framework, can be used with large language models (LLMs) like GPT-4 to analyze and version data from databases like Supabase. In this case study, we examine the you-tldr.com website, showcasing ZenML pipelines asynchronously processing video data and generating summaries with GPT-4. Understand how to tackle large language model limitations by versioning data and comparing summaries to unlock your data's potential. Learn how this approach can be easily adapted to work with other databases and LLMs, providing flexibility and versatility for your specific needs.

10 mins read
Run your steps on the cloud with Sagemaker, Vertex AI, and AzureML

Run your steps on the cloud with Sagemaker, Vertex AI, and AzureML

With ZenML 0.6.3, you can now run your ZenML steps on Sagemaker, Vertex AI, and AzureML! It’s normal to have certain steps that require specific infrastructure (e.g. a GPU-enabled environment) on which to run model training, and Step Operators give you the power to switch out infrastructure for individual steps to support this.

6 Mins Read