Company
Barclays
Title
MLOps Evolution and LLM Integration at a Major Bank
Industry
Finance
Year
2024
Summary (short)
Discussion of MLOps practices and the evolution towards LLM integration at Barclays, focusing on the transition from traditional ML to GenAI workflows while maintaining production stability. The case study highlights the importance of balancing innovation with regulatory requirements in financial services, emphasizing ROI-driven development and the creation of reusable infrastructure components.
## Overview This case study emerges from a podcast conversation with Andy McMahan, Principal AI and MLOps Engineer at Barclays, one of the UK's largest banks. The discussion provides valuable insights into how a major financial institution approaches MLOps and the emerging field of LLMOps within a heavily regulated environment. McMahan, who has authored "Machine Learning Engineering in Python" (now part of Oxford's AI curriculum), brings a practitioner's perspective on scaling ML and AI operations in an enterprise context. ## The MLOps Philosophy at Barclays McMahan presents a compelling definition of MLOps that has gained traction in the community: the practice of going from "n to n+1 models" in production. This framing, borrowed from mathematical proof by induction, emphasizes that getting a single model into production is merely the beginning. The real challenge—and where MLOps truly becomes valuable—is when organizations need to systematically replicate that success across multiple models, products, and services. The conversation emphasizes that MLOps, at its core, is the software development lifecycle with additional components specific to machine learning. Organizations still go through requirements gathering, sprints, development, testing, and deployment, but with extra considerations around data provenance, model validation, and monitoring. McMahan cautions against the tendency to reinvent the wheel when new technologies emerge, noting that many organizations raced to build entirely new processes for ML only to discover that existing DevOps practices could be leveraged with targeted additions. ## Adapting to Generative AI A central theme of the discussion is how traditional ML infrastructure and practices translate to the generative AI era. McMahan advocates for viewing LLMOps as an extension of MLOps rather than an entirely separate discipline. He uses the term "LLMOps" out of necessity but maintains that MLOps as a concept encompasses all of these practices, with infrastructure being just one component among many. The key difference with generative AI workloads is that organizations typically don't own or train the underlying models—they consume them as commodities or services. This fundamentally shifts the focus of the ML lifecycle. Traditional MLOps was heavily focused on training workflows: experiment tracking, hyperparameter optimization, model validation across training runs, and reproducibility. With LLMs, the emphasis shifts to building retrieval-augmented generation (RAG) pipelines, chunking and indexing data in vector databases, and crafting effective interaction patterns with the models. McMahan envisions an interesting future challenge when organizations want to build truly hybrid solutions that combine traditional ML with LLM capabilities. For example, agentic workflows might use an LLM as the orchestration backbone while calling proprietary models trained in-house for specific analytical tasks. This raises important questions about exposing internal models as services that can be consumed across the organization and validating these complex, multi-model workflows. ## The Ecosystem Approach to Platform Building Rather than thinking about rigid, monolithic platforms, McMahan advocates for building an ecosystem of capabilities that teams can tap into. The metaphor is tools on a shelf—various capabilities that can be selected and combined based on specific needs. This approach recognizes that use cases vary widely and that forcing everything through a single platform architecture may not serve all needs equally well. The ecosystem includes traditional ML tools alongside new GenAI components. Vector databases (PG Vector, Chroma, and similar tools) are now part of the standard toolkit, even though they weren't commonly used by data science teams before the RAG era. Different cloud providers offer their own model endpoints, agent frameworks, guardrails, and supporting infrastructure. The key decision point is how far up the abstraction ladder an organization wants to go—from consuming SaaS products like Microsoft 365 Copilot at the highest level, down to compiling Llama.cpp for bare-metal deployments at the lowest. McMahan emphasizes a stratified strategy where different tiers of applications exist simultaneously. At one level, LLMs are baked into existing vendor offerings—copilots appearing in email clients, word processors, and IDEs. At another level, low-code/no-code tools allow business users to build their own simple applications. At the deepest level, ML engineers and AI engineers maintain full control over bespoke implementations. All of these approaches are valid; the key is being clear about which level is appropriate for each use case. ## Evaluation and Monitoring Challenges The conversation delves into the significant challenges of evaluation and monitoring for LLM-based systems. In traditional supervised ML, ground truth exists and can be obtained eventually, enabling calculation of performance metrics either on a schedule or in an event-driven manner. This made monitoring relatively straightforward—comparing predictions against actual outcomes. With generative AI applications, ground truth becomes a slippery concept. If an organization is providing a chatbot experience, there's no such thing as a single correct response. Some aspects can be measured—retrieval precision for RAG systems, or standard NLP metrics like BLEU and ROUGE—but comprehensive evaluation requires different approaches. Human evaluation remains important but raises scalability concerns. LLM-as-a-judge approaches, using LLMs to evaluate other LLMs, have emerged as a practical solution, with specialized models like Llama Guard trained to detect toxicity or policy violations. McMahan emphasizes the importance of proxy metrics that align with business outcomes rather than vanity metrics. A compelling example is containment rate for chatbots—measuring how long until a user demands to speak to a human. Combined with sentiment analysis, this provides meaningful insight into chatbot performance. The warning is against optimizing for metrics like time-to-first-token while ignoring whether the actual content is useful. An optimization from 0.1 seconds to 0.09 seconds means little if users abandon the conversation after three exchanges because the responses are unhelpful. ## Operating in a Regulated Environment Banking presents unique challenges for ML and AI operations. Barclays, as critical infrastructure for the British economy, cannot embrace the "fail fast and break things" mentality. Stability, trust, and data protection are paramount—the bank is a custodian of highly personal financial information. This environment naturally produces more conservative approaches to technology adoption. McMahan describes this as a "dovish" rather than "hawkish" culture, where there's more hesitation about wholesale technology replacement. Some core banking infrastructure has been around for decades and continues to work reliably. While this might sound concerning to technologists eager to modernize everything, there's wisdom in the approach—these systems are well-maintained, well-governed, and stable. The challenge becomes balancing innovation with this conservative posture. Rather than vertical scaling (doing things faster end-to-end), large regulated organizations often horizontally scale by running many initiatives in parallel. Different parts of the bank may be at different stages of cloud migration while simultaneously building ML and AI capabilities. The value proposition in these environments is different—catching fraud, protecting livelihoods, even disrupting modern slavery rings. The impact, when it lands, is massive and tangible. ## ROI and Value-Driven Development A recurring theme throughout the conversation is the importance of focusing on return on investment and business value rather than technology for its own sake. McMahan cautions against "silver bullet thinking"—the belief that finding the perfect tool or architecture will solve all problems. He maintains that organizations could accomplish MLOps objectives using Excel if they had excellent processes and people, though he later humorously qualifies this statement. The recommendation is to push the boundaries of open source first before becoming a buyer of commercial solutions. This makes organizations more informed buyers who understand what features actually matter for their use cases. Playing with LangChain, Ollama, and other open-source tools helps practitioners develop a nose for what's realistic and what's overhyped. When vendors present slick demos, experienced practitioners can better evaluate whether the claimed capabilities match reality. McMahan shares the DevOps Research and Assessment (DORA) metrics as valuable indicators: change failure rate, deployment frequency, time from change to production. He's applied these to MLOps contexts, noting that organizations can play vanity metric games—celebrating a zero change failure rate while deploying only once a year. The focus should be on time to value and actual business outcomes. ## Leadership and Team Culture The conversation touches on what makes effective ML and AI teams, with McMahan attributing success primarily to great leadership. He defines this as providing clarity, stability, psychological safety, and an environment where innovation can happen within appropriate bounds. Good leaders provide air cover for their teams and genuinely listen to ideas regardless of seniority. McMahan warns against "HiPPO" (Highest Paid Person's Opinion) dynamics, noting that the newest data scientist might have the best idea anyone has ever heard. His personal approach is to assume everyone in the room is smarter than him—perhaps rooted in imposter syndrome, but valuable as a tool for genuine listening and collective problem-solving. Mission alignment emerges as critical. Teams should be clear on why they exist and what value they're driving. McMahan notes that people in leadership roles often think they're communicating the mission frequently, only to discover team members have never heard it. Repeatedly connecting daily work to larger purpose—stopping financial crime, protecting customer livelihoods—helps maintain focus and motivation. ## Practical Recommendations The discussion yields several practical takeaways for organizations building ML and AI platforms. First, focus on workflows and processes before tools—the specific technologies are often interchangeable, but the underlying workflows provide stability. Second, educate stakeholders about when traditional ML is preferable to generative AI—a logistic regression might be cheaper, faster, and more controlled than an LLM for classification tasks. Third, think like a product manager when prioritizing use cases, avoiding the temptation to use generative AI just because it's trendy. Fourth, in regulated industries, accept that some initiatives will move slowly but focus on the massive impact when they land. Finally, build an ecosystem of capabilities rather than a rigid platform, enabling teams to compose solutions from available tools based on their specific needs.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.