BlackRock implemented Aladdin Copilot, an AI-powered assistant embedded across their proprietary investment management platform that serves over 11 trillion in assets under management. The system uses a supervised agentic architecture built on LangChain and LangGraph, with GPT-4 function calling for orchestration, to help users navigate complex financial workflows and democratize access to investment insights. The solution addresses the challenge of making hundreds of domain-specific APIs accessible through natural language queries while maintaining strict guardrails for responsible AI use in financial services, resulting in increased productivity and more intuitive user experiences across their global client base.
Blackrock, one of the world’s leading asset managers with over $11 trillion in assets under management, presented their approach to building and operating Aladdin Copilot—an AI-powered assistant integrated into their proprietary Aladdin investment management platform. This presentation was delivered by Brennan Rosales (AI Engineering Lead) and Pedro Vicente Valdez (Principal AI Engineer), providing insight into both the architectural decisions and the operational practices that enable them to run a large-scale agentic AI system in production serving financial services clients globally.
The Aladdin platform itself is a comprehensive investment management solution used internally by Blackrock and sold to hundreds of clients across 70 countries. The platform comprises approximately 100 front-end applications maintained by around 4,000 engineers within the 7,000-person Aladdin organization. Aladdin Copilot is embedded across all of these applications as “connective tissue,” aiming to increase user productivity, drive alpha generation, and provide personalized experiences.
The architecture follows a supervisor-based agentic pattern, which the presenters acknowledged is a common approach across the industry due to its simplicity in building, releasing, and testing. They noted that while autonomous agent-to-agent communication might be the future direction, the current supervisor pattern provides the reliability and testability required for production financial systems.
A critical component of the architecture is the plugin registry, which enables a federated development model across the organization. With 50-60 specialized engineering teams owning different domains (such as trading, portfolio management, etc.), the AI team’s role is to make it easy for these domain experts to plug their existing functionality into the copilot system.
The registry supports two onboarding paths for development teams:
This approach is particularly notable from an LLMOps perspective as it distributes the responsibility for AI functionality across the organization while maintaining central orchestration and quality control. The AI team doesn’t need to be domain experts in finance—they focus on the infrastructure that enables other teams to contribute.
The presenters mentioned that when they started designing the system approximately two and a half years ago (around 2022-2023), they had to develop their own standardized agentic communication protocol. They are now actively evaluating more established protocols like LangChain’s agent protocol and the A2A (agent-to-agent) protocol as these mature.
The system processes user queries through a carefully designed orchestration graph built on LangGraph. The lifecycle of a query includes:
Context Collection: When a user submits a query, the system captures extensive contextual information including which Aladdin application they’re using, what’s displayed on their screen (portfolios, assets, widgets), and predefined global settings/preferences. This context-awareness is crucial for providing relevant responses.
Input Guardrails: The first node in the orchestration graph handles responsible AI moderation, including detection and handling of off-topic content, toxic content, and PII identification. This represents a critical compliance layer for a financial services platform.
Filtering and Access Control: With potentially thousands of tools and agents registered in the plugin registry, this node reduces the searchable universe to a manageable set (typically 20-30 tools). The filtering considers which environments plugins are enabled in, which user groups have access, and which applications can access specific plugins. This is important both for security/compliance and for LLM performance—the presenters noted that sending more than 40-50 tools to the planning step would degrade performance.
Orchestration/Planning: The system relies heavily on GPT-4 function calling, iterating through planning and action nodes until either an answer is found or the model determines it cannot answer the query. This represents a classic ReAct-style agent loop.
Output Guardrails: Before returning responses to users, the system runs hallucination detection checks and domain-specific output moderation.
A significant portion of the presentation focused on evaluation practices, which the presenters emphasized is critical for operating LLM systems in production—particularly in regulated financial services environments. They explicitly compared this to test-driven development in traditional software engineering, coining the term “evaluation-driven development.”
Given the sensitivity of financial applications, Blackrock takes a paranoid approach to system prompt validation. For every intended behavior encoded in system prompts, they generate comprehensive test cases. For example, if a system prompt states “you must never provide investment advice,” they work with subject matter experts to:
The presenters explicitly referenced wanting to avoid incidents like the infamous Chevrolet chatbot case, where an AI provided inappropriate responses that made headlines.
Evaluation pipelines are fully integrated into CI/CD processes:
This automation is essential because the system is under constant development by many engineers. Without automated evaluation, it would be impossible to know whether changes are improving or degrading the system. The presenters noted that “it’s very easy to chase your own tail with LLMs,” and automated evaluation provides the statistical grounding needed to make confident changes.
Beyond system prompt testing, Blackrock has built an end-to-end testing framework that validates the full orchestration pipeline. This framework is exposed to all the development teams contributing plugins, enabling them to ensure their integrations work correctly within the broader system.
The testing configuration allows developers to specify:
The solution layer requires teams to provide ground truth data, specifying exactly how queries should be solved. This can include multi-threaded execution paths—for example, a query about buying shares might require parallel checks for compliance limits and available cash in a portfolio.
The system provides performance reporting broken down by plugin/team, allowing each contributing team to understand how their components are performing within the overall system. This federated visibility is important for maintaining accountability across a large organization.
Several production-focused insights emerged from the presentation:
Scalability through federation: By enabling 50-60 teams to contribute tools and agents independently, the system can scale its capabilities without requiring the central AI team to be experts in every domain.
Access control at multiple levels: The plugin registry supports fine-grained access control by environment, user group, and application—essential for enterprise compliance requirements.
Context utilization: The system makes extensive use of application context to improve relevance, showing mature thinking about how AI assistants should integrate with existing workflows rather than operating in isolation.
Honest assessment of limitations: The presenters were candid that they’re using a supervisor pattern (not more autonomous multi-agent systems) because it’s easier to build, release, and test—showing pragmatic engineering judgment over hype.
Protocol evolution: The acknowledgment that they’re evaluating newer agent communication protocols (LangChain agent protocol, A2A) shows awareness that the tooling landscape is rapidly evolving and their architecture needs to remain adaptable.
The system is built primarily on:
The presenters didn’t specify their infrastructure (cloud provider, container orchestration, etc.), but the emphasis on CI/CD integration suggests mature DevOps practices.
While the presentation provides valuable insight into enterprise-scale LLMOps practices, a few areas warrant balanced consideration:
The system’s reliance on GPT-4 function calling creates dependency on a single model provider, which could present risks for a platform serving financial clients. The presenters didn’t discuss fallback strategies or model diversification.
The evaluation approach, while comprehensive, relies heavily on synthetic data and LLM-as-judge patterns, which have known limitations in detecting novel failure modes. The ground truth requirement from plugin teams also introduces potential for gaps if teams don’t provide comprehensive test cases.
The federated development model is powerful but introduces coordination challenges—ensuring consistent quality across 50-60 contributing teams requires significant governance that wasn’t fully addressed in the presentation.
Overall, this case study represents a mature, production-focused approach to building agentic AI systems in a highly regulated industry, with particular strengths in evaluation practices and federated development at scale.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.
Nubank, one of Brazil's largest banks serving 120 million users, implemented large-scale LLM systems to create an AI private banker for their customers. They deployed two main applications: a customer service chatbot handling 8.5 million monthly contacts with 60% first-contact resolution through LLMs, and an agentic money transfer system that reduced transaction time from 70 seconds across nine screens to under 30 seconds with over 90% accuracy and less than 0.5% error rate. The implementation leveraged LangChain, LangGraph, and LangSmith for development and evaluation, with a comprehensive four-layer ecosystem including core engines, testing tools, and developer experience platforms. Their evaluation strategy combined offline and online testing with LLM-as-a-judge systems that achieved 79% F1 score compared to 80% human accuracy through iterative prompt engineering and fine-tuning.