Company
Various
Title
Blueprint for Scalable and Reliable Enterprise LLM Systems
Industry
Tech
Year
2023
Summary (short)
A panel discussion featuring leaders from Bank of America, NVIDIA, Microsoft, and IBM discussing best practices for deploying and scaling LLM systems in enterprise environments. The discussion covers key aspects of LLMOps including business alignment, production deployment, data management, monitoring, and responsible AI considerations. The panelists share insights on the evolution from traditional ML deployments to LLM systems, highlighting unique challenges around testing, governance, and the increasing importance of retrieval and agent-based architectures.
## Overview This panel discussion, held at an AI quality conference in San Francisco, brought together enterprise AI leaders from four major organizations: Bank of America, Nvidia, Microsoft (formerly Google), and IBM. The discussion centered on creating a "blueprint for scalable and reliable Enterprise AI/ML systems," covering the full lifecycle from strategy alignment through production deployment and monitoring. The panelists shared practical insights from their experiences deploying both traditional ML models and generative AI systems at scale within large enterprises. The panelists included Hera Dongo from Bank of America (leading AI/ML and automation platform development), Rama Akaru from Nvidia (leading AI and automation under the CIO, focusing on enterprise transformation), Natin Agal from Microsoft (leading the GenAI team for marketing, building copilot systems for content creation), and Steven Eluk from IBM (seven years building corporate data and AI services, governance, and platforms for 300,000+ employees across 150+ countries). ## Aligning AI Initiatives with Business Objectives Rama from Nvidia emphasized that AI technology must always be "a means to an end," with clear business metrics established before any project begins. She provided concrete examples from IT operations transformation, where baseline metrics include mean time to detect incidents, mean time to resolve, mean time to diagnose, and business impact measurements. As AI capabilities like incident summarization, better detection, chatbots for SRE productivity, alert noise reduction, and anomaly detection are deployed, teams continuously measure progress against these baselines. Another example from supply chain demonstrated similar principles: if planners take 3-5 hours for what-if scenario analysis, GPU-optimized planning can reduce this to 3-5 minutes, providing a clear productivity metric. This iterative measurement approach helps teams avoid getting "carried away with technology" — sometimes a one-click solution is better than a multi-turn conversational interface, and automated (zero-click) solutions may be optimal for certain use cases. ## Data Management and Governance Challenges Steven from IBM stressed the importance of consistent data practices and standards across large organizations. He noted that while everyone has justifications for siloed data (sometimes rightfully so, such as customer data restrictions), consistent data standards enable cross-organizational search and integration. Key considerations include privacy-related data standards, regional and sovereignty constraints, and the sustainability aspect of data decisions — avoiding choices today that will be hard to implement tomorrow. Rama added that generative AI introduces entirely new data-related risks. She described a phenomenon where enterprise documents that were "not properly protected" become suddenly exposed when powerful LLMs and vector databases make them searchable and findable. This has led Nvidia to apply generative AI technology itself to classify sensitive documents and automatically remediate improper access controls before ingesting data into RAG systems. This represents a new category of risk unique to the generative AI era, requiring new guardrails and controls around data sensitivity. ## LLMOps vs Traditional MLOps Natin from Microsoft provided one of the most technically nuanced perspectives on how LLMOps differs from traditional MLOps. He argued that fundamental concepts have changed with LLMs: The traditional CI/CD model doesn't apply the same way. There's no concept of traditional "model retraining" — instead, organizations choose between RAG-based implementations or fine-tuning approaches. The classic workflow of training, comparing metrics, model management, and versioning has been fundamentally altered. Traditional metrics like precision, recall, F1 score, and accuracy are being replaced (or supplemented) by metrics like BLEU score and ROUGE for evaluating generated text. However, system-level metrics remain critical — latency and throughput are still essential, but expectations have increased dramatically. Users who once waited minutes for AI model responses now expect real-time responses as they type. System availability requirements may be moving from "five nines" to "ten nines." Natin cited a Stanford HAI study showing that LLM systems hallucinate more on legal documents compared to retail or marketing data, highlighting how the lack of domain-specific training data affects different verticals differently. This raises fundamental questions about how continuous integration and continuous delivery should work when feedback mechanisms and ground truth are unclear. ## Production Deployment Challenges and Timelines The panelists discussed the rigorous testing and deployment processes required for enterprise AI. Steven emphasized the importance of understanding use case implications — specifically what happens when models get things wrong. False positives and negatives propagate through systems, making continuous monitoring critical. He recommended: - Having an outside group (not the model creators) define evaluation criteria, as creators tend to showcase what models do well rather than comprehensive testing - Building automation and tooling to validate models continue performing as expected in production - Using consistent platform approaches across the organization so improvements benefit all projects, not just individual ones - Implementing audit committees and red teaming practices Rama provided concrete deployment timelines from Nvidia's experience. While building the first version of a chatbot might take 6-8 weeks, the path to production is significantly longer — potentially 3+ months even for straightforward cases. For chatbots dealing with sensitive data, deployment was delayed because sensitive documents had improper access controls. The bot had to remain in "Early Access testing" while the organization fixed access control problems across potentially hundreds of thousands of documents — a process with no quick solution. The key difference between traditional ML and generative AI in testing lies in ground truth evaluation. Traditional models have clear true/false positive/negative measurements with representative test datasets and accuracy thresholds. Generative AI chatbots produce natural language responses that are often subjective. While LLM-as-judge approaches exist, human-in-the-loop evaluation remains necessary for many use cases, significantly extending deployment timelines. ## Responsible AI and Governance Monitoring Natin addressed the challenges of quantifying responsible AI metrics for generative AI systems. He described working with human labelers who would give "thumbs up" or "thumbs down" feedback, but when asked why, could only articulate subjective preferences like "I like it" or "this is more relatable." Translating such subjective assessments into quantifiable signals that models can learn from remains an unsolved challenge. Current governance approaches flag issues like toxicity, unfriendliness, and adult content, but comprehensive measurement remains difficult. Natin pointed out that major systems from Google (Gemini), OpenAI (ChatGPT), Meta (LLaMA), and Microsoft (Sydney) have all failed in various ways despite presumably rigorous testing processes — indicating the fundamental challenge is the subjective nature of the underlying technology. Multiple frameworks and tools are emerging — including Nexus Flow, LangGraph, and others building evaluation metrics — but no "Holy Grail" or universally applicable framework currently exists. The panelists expressed hope that stronger, more robust, and more quantified responsible AI metrics will emerge in the future. Steven added an important sustainability consideration: even if a perfect evaluation framework existed today, it would need to change tomorrow. The criteria, evaluations, RAG data, and information feeding into models constantly evolve. Organizations should factor the frequency of required model updates into their ROI calculations to understand true operational costs. ## Emerging Trends: Agentic Workflows The panel concluded with observations about emerging trends. Rama described an evolution in the generative AI space: initial excitement focused on LLMs alone, then the industry realized that for chatbots and similar applications, retrieval accuracy is the most critical factor. Now the field is moving toward agentic workflows — combining "good old software engineering" with agents, orchestration, retrieval, and LLMs working together as a complete system. Natin characterized LLMs and generative AI as orchestrators and automation tools that can make AI accessible to non-technical users — not just data scientists and engineers. Steven emphasized the agentic view as important, with agents performing real tasks within companies, and noted that the ability to access and understand data has never been greater, though he expressed hope that original human thought isn't lost in the process. ## Key Takeaways for LLMOps Practitioners The discussion surfaced several practical insights for organizations deploying LLMs in production: The path from proof-of-concept to production is significantly longer for generative AI than traditional ML, primarily due to the subjective nature of natural language evaluation and the need for human-in-the-loop testing. Organizations should plan for 3+ month deployment timelines even for relatively simple use cases, with additional time needed when sensitive data is involved. Data governance becomes more critical — and more complex — with generative AI. Traditional access control gaps become exploitable when powerful search and summarization capabilities make previously obscure documents findable. Organizations should consider using AI itself to identify and remediate these gaps before deploying RAG-based systems. Consistent platforms, standards, and governance frameworks across the organization are essential for scale. Siloed approaches don't scale, and improvements made in consistent platforms benefit all projects rather than individual ones. The fundamental concepts of CI/CD and model retraining are changing with LLMs. Organizations need to adapt their MLOps practices for RAG and fine-tuning paradigms rather than traditional retraining workflows. Finally, sustainability must be considered — models and their evaluation criteria will need continuous updating as data, requirements, and the technology landscape evolve. This ongoing maintenance cost should be factored into ROI calculations from the beginning.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.