ZenML

Building and Scaling Enterprise LLMOps Platforms: From Team Topology to Production

Various 2023
View original source

A comprehensive overview of how enterprises are implementing LLMOps platforms, drawing from DevOps principles and experiences. The case study explores the evolution from initial AI adoption to scaling across teams, emphasizing the importance of platform teams, enablement, and governance. It highlights the challenges of testing, model management, and developer experience while providing practical insights into building robust AI infrastructure that can support multiple teams within an organization.

Industry

Tech

Technologies

Summary

This presentation was delivered by Patrick Debois, widely known as the founder of DevOps Days (the 2009 conference where the term “DevOps” was coined). Drawing on 15 years of experience watching enterprise transformations and two years of hands-on experience building generative AI applications, Debois presents a comprehensive framework for how enterprises should approach scaling LLM-based applications. The talk is notable for its practitioner perspective, applying lessons learned from the DevOps movement to the emerging discipline of AI engineering and LLMOps.

Debois openly acknowledges his bias, noting that he’s applying old paradigms (DevOps, platform engineering, team topologies) to the new world of GenAI, which could either be correct or completely wrong. This self-awareness lends credibility to his observations while appropriately cautioning listeners that the field is still evolving.

Organizational Patterns and Team Dynamics

One of the most valuable aspects of this presentation is its focus on the organizational challenges of scaling GenAI, not just the technical ones. Debois describes a pattern he witnessed firsthand: when generative AI first emerged, it naturally landed with data science teams. However, these teams immediately faced friction because they weren’t accustomed to running things in production. This created a gap that needed bridging.

The solution his organization adopted involved gradually moving engineers into data science teams. Over time, the composition shifted—as GenAI work turned out to be more about integration than pure data science, the ratio of traditional engineers to data scientists increased. Eventually, this capability was scaled out to feature teams and abstracted into a platform.

Debois references the Team Topologies framework (originally called “DevOps Topologies”) to explain how organizations can structure this work. The model involves platform teams that abstract away complexity and provide services to feature teams (the “you build it, you run it” teams). The platform team can interact with feature teams in different modes: pure API/service provision, collaborative product development, or facilitation when teams need help. This is a known pattern for dealing with new technology abstractions in enterprises.

The key insight is that GenAI shouldn’t remain siloed in data science—it needs to be brought to traditional application developers so they can integrate it into their domains. This requires both platform infrastructure and enablement programs.

AI Platform Components

Debois provides a comprehensive overview of what services should compose an enterprise AI platform. While he explicitly notes he has no vendor affiliations, he outlines several critical components:

Model Access: Rather than having every team independently figure out which models to use and how to access them, a central team curates and provides access to appropriate models for the company. This is typically related to cloud vendor relationships.

Vector Databases and RAG Infrastructure: While vector databases were initially specialized products, Debois notes they’re now being incorporated into traditional database vendors, making them “not that special anymore.” However, teams still need to understand embeddings, vectors, and how to use them effectively. Some organizations are calling this “RAG Ops”—yet another ops discipline to manage.

Data Connectors: The platform team builds and exposes connectors to various data sources across the company, so individual teams don’t have to figure this out repeatedly. This enables secure experimentation with company data.

Model Registry and Version Control: Similar to code version control, organizations need version control for models. A centralized repository allows teams to reuse models and makes them visible like a library.

Model Provider Abstraction: Larger enterprises often want to be model-provider agnostic, similar to how they’ve sought cloud agnosticism. Debois speculates that the OpenAI protocol might become the S3-equivalent standard for model access. This layer also handles access control through proxies, preventing unrestricted model usage.

Observability and Tracing: The platform needs to capture prompts running in production, similar to traditional observability but with important differences. One user prompt often leads to five or six iterations/sub-calls, requiring different tracing approaches. This shouldn’t be something every team builds independently.

Production Monitoring and Evals: Traditional monitoring and metrics aren’t sufficient for LLM applications. Instead of simple health checks for API calls, organizations need “health checks for evals running all the time in production.” This enables detection of model changes, unusual end-user behavior, and data quality issues.

Caching and Feedback Services: Centralized caching reduces costs, while feedback services go beyond simple thumbs up/down to include inline editing of responses, providing richer training signals. These are expensive to build and should be centrally managed.

Debois compares this emerging ecosystem to the Kubernetes ecosystem landscape, predicting rapid expansion in the number of solutions and vendors.

Enablement and Developer Experience

Infrastructure alone isn’t sufficient—teams need enablement to use it effectively. The enablement function includes:

Prototyping Tools: Making experimentation easy, including for product owners who want to explore use cases. Simple tools get people excited and help identify the right applications.

Frameworks: Tools like LangChain (mentioned implicitly through references to Microsoft stack preferences) help teams learn how things work. Debois notes he learned a lot through frameworks, though he also observes that many teams get “bitten” by frameworks that change too fast and retreat to lower-level APIs. He expresses hope that the ecosystem will mature rather than everyone doing DIY implementations.

Local Development Environments: Engineers value being able to work locally, especially when traveling. With quality models increasingly runnable on laptops, this supports faster iteration.

Education and Documentation: Frameworks serve as learning tools as much as production tools.

Common Pitfalls

Debois identifies several anti-patterns he’s observed:

Testing and Evaluation Challenges

One of the most honest and valuable sections of the talk addresses the challenges of testing LLM outputs. Debois describes the typical developer reaction: “How do I write tests for this?”

He outlines a spectrum of testing approaches:

He acknowledges this isn’t a solved problem but notes “it’s the best thing we got.”

A specific pain point Debois experienced: over two years, his organization went through eight different LLM models. Each model change required reworking all application prompts—and this was done manually without guarantee it would work. This underscores the need for evaluation frameworks before undertaking such refactoring.

The Ironies of Automation

Debois references an influential DevOps paper called “The Ironies of Automation” and notes there’s now a corresponding “Ironies of GenAI Automation” paper. The core insight is that automation shifts human roles from production to management and review.

He shares data showing that while coding copilots increased code output, review times went up and PR sizes increased. The efficiency gains may be partially offset by increased review burden. Additionally, as humans produce less and review more, they risk losing situational awareness and domain expertise, potentially leading to uncritical acceptance of AI suggestions. He mentions anecdotal evidence of higher AI suggestion acceptance rates on weekends “because developers are like whatever.”

This connects to broader DevOps lessons: the 20% automation promise spawned an entire industry focused on preparing for failure—CI/CD, monitoring, observability, chaos engineering. The same evolution may happen with GenAI.

Governance and Security

Governance considerations include:

On guardrails specifically, an audience question revealed a nuanced approach: central governance teams set generic rules, while individual teams layer use-case-specific rules on top in a self-service model.

Organizational Structure Recommendations

Debois proposes consolidating related platform functions:

He references the “Unfix Model” and its concept of an “Experience Crew” that ensures AI looks consistent across products by working with feature teams on UX.

This consolidated structure enables cross-collaboration: SecOps helps with access control and governance, CloudOps knows the cloud vendors and how to provision resources. He cautions against premature optimization for small companies—this pattern makes sense when scaling to 10+ teams.

Critical Assessment

While Debois provides valuable practical insights, some caveats are warranted:

Nonetheless, the presentation offers a thoughtful, practitioner-grounded perspective on LLMOps that acknowledges uncertainty while providing actionable frameworks for enterprise adoption.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Enterprise-Ready AI Development Infrastructure from Day One

Windsurf 2024

Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.

code_generation code_interpretation high_stakes_application +42

Deploying Secure AI Agents in Highly Regulated Financial and Gaming Environments

Sicoob / Holland Casino 2025

Two organizations operating in highly regulated industries—Sicoob, a Brazilian cooperative financial institution, and Holland Casino, a government-mandated Dutch gaming operator—share their approaches to deploying generative AI workloads while maintaining strict compliance requirements. Sicoob built a scalable infrastructure using Amazon EKS with GPU instances, leveraging open-source tools like Karpenter, KEDA, vLLM, and Open WebUI to run multiple open-source LLMs (Llama, Mistral, DeepSeek, Granite) for code generation, robotic process automation, investment advisory, and document interaction use cases, achieving cost efficiency through spot instances and auto-scaling. Holland Casino took a different path, using Anthropic's Claude models via Amazon Bedrock and developing lightweight AI agents using the Strands framework, later deploying them through Bedrock Agent Core to provide management stakeholders with self-service access to cost, security, and operational insights. Both organizations emphasized the importance of security, governance, compliance frameworks (including ISO 42001 for AI), and responsible AI practices while demonstrating that regulatory requirements need not inhibit AI adoption when proper architectural patterns and AWS services are employed.

healthcare fraud_detection customer_support +50