IBM: Enterprise LLMOps Platform with Focus on Model Customization and API Optimization

LLMOps Database

Tech

IBM

Company

IBM

Title

Enterprise LLMOps Platform with Focus on Model Customization and API Optimization

Industry

Tech

Link

https://www.youtube.com/watch?v=aywAuRlzB0M&list=PLeVLmyUq90f_ilFVfqnONalzdzUfAf-xr&index=2

Year

2024

Summary (short)

IBM's Watson X platform addresses enterprise LLMOps challenges by providing a comprehensive solution for model access, deployment, and customization. The platform offers both open-source and proprietary models, focusing on specialized use cases like banking and insurance, while emphasizing API optimization for LLM interactions and robust evaluation capabilities. The case study highlights how enterprises are implementing LLMOps at scale with particular attention to data security, model evaluation, and efficient API design for LLM consumption.

## Overview This podcast episode from "Agents at Work" features a conversation between JY (the host) and Roy DS, who works on developer experience for IBM's WatsonX platform. The discussion provides valuable insights into how large enterprises are approaching AI agent deployment, the challenges they face in production, and the evolving best practices in the LLMOps space. Roy joined IBM about two years ago when his previous startup (Stepen, a GraphQL-as-a-service company) was acquired, bringing a startup perspective to enterprise AI development. ## IBM WatsonX Platform WatsonX is IBM's platform for building AI applications, serving a wide range of customers from self-service developers to Fortune 500 enterprises and government agencies. The platform offers several key capabilities relevant to LLMOps: - **Model Access**: APIs to access both open-source models (like Llama, Mistral from Hugging Face) and IBM's proprietary Granite model series - **Agent Deployment**: Support for deploying agents built with popular open-source frameworks including LangChain, CrewAI, LlamaIndex, and upcoming Autogen support - **Model Customization and Fine-tuning**: Helping enterprises adapt models to their specific use cases rather than relying on general-purpose models trained on irrelevant public data - **RAG and Data Integration**: Tools for working with enterprise data sources, though Roy notes the terminology has evolved beyond "big data" ## The Granite Model Series IBM develops its own open-source models called Granite, currently at version 3.2 or 3.3. These are characterized as small models with the largest being around 8 billion parameters, with most around 3 billion. The series includes reasoning models, vision models, and general-purpose models, as well as models customized for specific verticals like banking and insurance. The strategic positioning of smaller models makes sense for several LLMOps considerations: - **Cost Efficiency**: Smaller models are cheaper to run at scale - **Routing Architectures**: Multi-agent systems can use smaller specialized models for specific tasks while larger models handle planning and orchestration - **Edge Deployment**: Potential for mobile and edge device deployment as hardware improves - **Scale Economics**: When processing millions of requests daily (like invoice extraction for large enterprises), even small efficiency gains compound significantly Roy candidly notes that he personally mostly uses larger models, but sees the value proposition of smaller models for specific production use cases. The Granite models are available open-source on Hugging Face, which helps with adoption and trust-building even if IBM would prefer customers use their paid APIs. ## Types of Enterprise Customers Roy identifies two distinct customer segments with different LLMOps needs: **Group 1 - Exploratory Stage**: Companies interested in building agents but unsure where to start. IBM serves these with low-code or no-code tools to build initial agents. This represents the education and enablement phase. **Group 2 - Production-Ready**: Companies already building agents, testing different vendors and frameworks, creating internal POCs. These customers have more sophisticated needs including: - Reliable hosting and scaling for agents - Observability and tracing capabilities - Continuous evaluation to ensure models perform as expected, especially when model versions change - Cost optimization strategies The second group presents more interesting LLMOps challenges because they've moved beyond experimentation and face real production concerns. ## Agent Architecture Patterns The discussion reveals several patterns in how enterprises are building agents: **Single-Purpose vs. Multi-Agent**: Most current deployments on WatsonX are single-purpose agents focused on specific tasks like ArXiv paper research, market research, or data analysis. IBM is exploring multi-agent workflows through other products. **Tool Design Philosophy**: There's an interesting discussion about tool granularity. Some developers create individual tools for each API endpoint (create repository, delete repository, etc.), while others are finding success with higher-level "agent tools" that accept natural language instructions. The latter approach uses an agent that knows how to interact with GitHub rather than exposing 200 individual endpoints. **Model Dependency**: The choice of tool architecture heavily depends on available models. Larger models can compose arbitrary tool calls effectively, while smaller models may need more deterministic workflows baked into the tools themselves. Roy notes that "the layer of making LLMs more deterministic is probably happening inside your tools." ## The API Problem for LLMs A significant insight from the discussion is that APIs designed for web applications are often poorly suited for LLM consumption: **The Core Issue**: Traditional REST APIs return fixed data structures with all fields, including UI-specific elements (dropdown visibility, alert states, aggregate fields). LLMs don't need this information and paying for tokens to process irrelevant data becomes expensive at scale. **The Solution - Dynamic APIs**: Roy advocates for GraphQL as a solution, allowing LLMs to request only the specific fields they need. The comparison to SQL is apt - you give the model a schema, and it generates the precise query needed rather than receiving a dump of all available data. **Precision in Retrieval**: The key insight is "recall and precision for APIs" - models need exactly the information required to answer questions, nothing more and nothing less. This reduces token costs and prevents context window pollution. ## MCP (Model Context Protocol) The conversation extensively covers MCP, Anthropic's protocol for connecting LLMs to external tools and data: **Adoption Trajectory**: Roy notes that MCP started slow after its late 2023 release but "exploded" around the AI Engineer conference in early 2025. The adoption is heavily driven by coding agents (Cursor, Bolt, Replit, Lovable) which have effectively become distribution channels for developer tools. **Value Proposition**: The main value for companies is in exposing MCP servers based on their data, not building clients. Clients are becoming commoditized, but data access remains the differentiator. **Technical Evolution**: There's ongoing development around transport mechanisms - from stdio to SSE to the newly proposed "streamable HTTP" (which Roy wryly notes seems like regular HTTP with optional SSE). Remote MCP support is particularly important for enterprise adoption. **GraphQL + MCP Synergy**: Roy sees these as complementary - MCP provides the connection protocol while GraphQL provides the dynamic query capability that makes tool calling more efficient. ## Evaluation and Observability Continuous evaluation emerges as a critical LLMOps concern: **Model Version Changes**: When Granite (or any model) releases a new version, all system prompts and optimizations may break. Enterprises need reliable evaluation pipelines to detect regressions. **The Evaluation Paradox**: Roy points out an interesting dynamic - better evaluation tools make it easier to switch models and reduce vendor lock-in, which is good for builders but challenging for model providers. This suggests pressure toward more specialized, fine-tuned models that perform uniquely well on specific evaluation suites. **Observability Stack**: Tracing and monitoring are critical for understanding agent behavior in production, especially given the non-deterministic nature of LLM-based systems. ## Enterprise Adoption Dynamics The discussion touches on the organizational challenges of AI adoption in large enterprises: **Legal vs. Developer Tension**: Initial responses to AI typically split between excited developers wanting to build and legal teams advising caution until compliance is understood. The outcome depends on organizational culture and who "wins" this initial battle. **Data Privacy Concerns**: Most enterprises are reluctant to use public model APIs for internal data, despite provider assurances about data usage. Roy expresses healthy skepticism: "if history taught us one thing, they might say they won't use your data for something but in the end they might very well be using your data for something." **Model Customization over RAG**: For enterprises with specialized data (like insurance policies varying by country, package, and customer tenure), fine-tuning models once is more cost-effective than running RAG for every query. ## Production Use Cases The discussion reveals common patterns in enterprise agent deployment: **Internal Focus**: Most production agents are for internal processes rather than customer-facing applications. This includes: - Connecting to internal knowledge bases (SharePoint, wikis, Dropbox-style systems) - Reviewing and analyzing historical documentation - Speeding up development with coding agents - HR system integration (with careful guardrails against leaking sensitive data) **Customer Service**: External agents are primarily in customer service, where modern LLM-based chatbots significantly outperform the rule-based systems from 2016-2017. **Guardrails are Essential**: Any agent connecting to systems with sensitive data (HR, customer records) requires guardrails to prevent data leakage. ## Best Practices and Recommendations Roy offers several practical recommendations for LLMOps practitioners: **Treat LLMs as Team Members**: Use proper software engineering practices with LLM-generated code - version control, testing, code review. Don't just copy-paste model outputs. **Design APIs for LLMs**: Build dynamic data retrieval mechanisms (GraphQL, generated SQL) rather than force-fitting web APIs into LLM workflows. **Plan for Model Changes**: Build evaluation pipelines that can detect when model updates break existing functionality. **Start with Single-Purpose Agents**: Complex multi-agent orchestration adds significant operational overhead; single-purpose agents are easier to deploy and maintain. ## Future Outlook Looking ahead 6-9 months, Roy expresses excitement about: - MCP maturation with robust remote server support - Models with better built-in tool calling capabilities - Potential for models to execute code artifacts directly - More developers adopting proper software engineering practices with AI tools The conversation provides a grounded perspective on enterprise LLMOps, balancing optimism about AI capabilities with realistic assessments of operational challenges in production environments.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source