Company
Blackrock
Title
Agentic AI Architecture for Investment Management Platform
Industry
Finance
Year
2025
Summary (short)
BlackRock implemented Aladdin Copilot, an AI-powered assistant embedded across their proprietary investment management platform that serves over 11 trillion in assets under management. The system uses a supervised agentic architecture built on LangChain and LangGraph, with GPT-4 function calling for orchestration, to help users navigate complex financial workflows and democratize access to investment insights. The solution addresses the challenge of making hundreds of domain-specific APIs accessible through natural language queries while maintaining strict guardrails for responsible AI use in financial services, resulting in increased productivity and more intuitive user experiences across their global client base.
## Overview Blackrock, one of the world's leading asset managers with over $11 trillion in assets under management, presented their approach to building and operating Aladdin Copilot—an AI-powered assistant integrated into their proprietary Aladdin investment management platform. This presentation was delivered by Brennan Rosales (AI Engineering Lead) and Pedro Vicente Valdez (Principal AI Engineer), providing insight into both the architectural decisions and the operational practices that enable them to run a large-scale agentic AI system in production serving financial services clients globally. The Aladdin platform itself is a comprehensive investment management solution used internally by Blackrock and sold to hundreds of clients across 70 countries. The platform comprises approximately 100 front-end applications maintained by around 4,000 engineers within the 7,000-person Aladdin organization. Aladdin Copilot is embedded across all of these applications as "connective tissue," aiming to increase user productivity, drive alpha generation, and provide personalized experiences. ## Architectural Design The architecture follows a supervisor-based agentic pattern, which the presenters acknowledged is a common approach across the industry due to its simplicity in building, releasing, and testing. They noted that while autonomous agent-to-agent communication might be the future direction, the current supervisor pattern provides the reliability and testability required for production financial systems. ### Plugin Registry and Federated Development A critical component of the architecture is the plugin registry, which enables a federated development model across the organization. With 50-60 specialized engineering teams owning different domains (such as trading, portfolio management, etc.), the AI team's role is to make it easy for these domain experts to plug their existing functionality into the copilot system. The registry supports two onboarding paths for development teams: - **Tool registration**: Mapping directly to existing Aladdin APIs already in production, allowing quick integration of existing capabilities - **Custom agents**: For complex workflows, teams can spin up their own agents and register them in the system This approach is particularly notable from an LLMOps perspective as it distributes the responsibility for AI functionality across the organization while maintaining central orchestration and quality control. The AI team doesn't need to be domain experts in finance—they focus on the infrastructure that enables other teams to contribute. The presenters mentioned that when they started designing the system approximately two and a half years ago (around 2022-2023), they had to develop their own standardized agentic communication protocol. They are now actively evaluating more established protocols like LangChain's agent protocol and the A2A (agent-to-agent) protocol as these mature. ### Query Lifecycle and Orchestration The system processes user queries through a carefully designed orchestration graph built on LangGraph. The lifecycle of a query includes: **Context Collection**: When a user submits a query, the system captures extensive contextual information including which Aladdin application they're using, what's displayed on their screen (portfolios, assets, widgets), and predefined global settings/preferences. This context-awareness is crucial for providing relevant responses. **Input Guardrails**: The first node in the orchestration graph handles responsible AI moderation, including detection and handling of off-topic content, toxic content, and PII identification. This represents a critical compliance layer for a financial services platform. **Filtering and Access Control**: With potentially thousands of tools and agents registered in the plugin registry, this node reduces the searchable universe to a manageable set (typically 20-30 tools). The filtering considers which environments plugins are enabled in, which user groups have access, and which applications can access specific plugins. This is important both for security/compliance and for LLM performance—the presenters noted that sending more than 40-50 tools to the planning step would degrade performance. **Orchestration/Planning**: The system relies heavily on GPT-4 function calling, iterating through planning and action nodes until either an answer is found or the model determines it cannot answer the query. This represents a classic ReAct-style agent loop. **Output Guardrails**: Before returning responses to users, the system runs hallucination detection checks and domain-specific output moderation. ## Evaluation-Driven Development A significant portion of the presentation focused on evaluation practices, which the presenters emphasized is critical for operating LLM systems in production—particularly in regulated financial services environments. They explicitly compared this to test-driven development in traditional software engineering, coining the term "evaluation-driven development." ### System Prompt Testing Given the sensitivity of financial applications, Blackrock takes a paranoid approach to system prompt validation. For every intended behavior encoded in system prompts, they generate comprehensive test cases. For example, if a system prompt states "you must never provide investment advice," they work with subject matter experts to: - Define what constitutes investment advice across various scenarios - Generate extensive synthetic data covering edge cases - Build evaluation pipelines using LLM-as-judge approaches - Run these evaluations continuously The presenters explicitly referenced wanting to avoid incidents like the infamous Chevrolet chatbot case, where an AI provided inappropriate responses that made headlines. ### CI/CD Integration Evaluation pipelines are fully integrated into CI/CD processes: - Runs daily on development environments - Runs on every pull request - Provides reports on system performance and any regressions This automation is essential because the system is under constant development by many engineers. Without automated evaluation, it would be impossible to know whether changes are improving or degrading the system. The presenters noted that "it's very easy to chase your own tail with LLMs," and automated evaluation provides the statistical grounding needed to make confident changes. ### End-to-End Testing Framework Beyond system prompt testing, Blackrock has built an end-to-end testing framework that validates the full orchestration pipeline. This framework is exposed to all the development teams contributing plugins, enabling them to ensure their integrations work correctly within the broader system. The testing configuration allows developers to specify: - **Application context**: Which Aladdin application, what the user sees on screen, loaded portfolios and assets, enabled widgets - **System settings**: User preferences and configurations that should be respected during planning and API calls - **Multi-turn scenarios**: Chat history, query, and expected response for complex conversational flows The solution layer requires teams to provide ground truth data, specifying exactly how queries should be solved. This can include multi-threaded execution paths—for example, a query about buying shares might require parallel checks for compliance limits and available cash in a portfolio. ### Performance Monitoring by Plugin Owner The system provides performance reporting broken down by plugin/team, allowing each contributing team to understand how their components are performing within the overall system. This federated visibility is important for maintaining accountability across a large organization. ## Production Considerations Several production-focused insights emerged from the presentation: **Scalability through federation**: By enabling 50-60 teams to contribute tools and agents independently, the system can scale its capabilities without requiring the central AI team to be experts in every domain. **Access control at multiple levels**: The plugin registry supports fine-grained access control by environment, user group, and application—essential for enterprise compliance requirements. **Context utilization**: The system makes extensive use of application context to improve relevance, showing mature thinking about how AI assistants should integrate with existing workflows rather than operating in isolation. **Honest assessment of limitations**: The presenters were candid that they're using a supervisor pattern (not more autonomous multi-agent systems) because it's easier to build, release, and test—showing pragmatic engineering judgment over hype. **Protocol evolution**: The acknowledgment that they're evaluating newer agent communication protocols (LangChain agent protocol, A2A) shows awareness that the tooling landscape is rapidly evolving and their architecture needs to remain adaptable. ## Technology Stack The system is built primarily on: - **LangChain/LangGraph**: Core orchestration framework - **GPT-4**: Primary LLM for function calling and orchestration - **LLM-as-judge**: For evaluation pipelines The presenters didn't specify their infrastructure (cloud provider, container orchestration, etc.), but the emphasis on CI/CD integration suggests mature DevOps practices. ## Critical Assessment While the presentation provides valuable insight into enterprise-scale LLMOps practices, a few areas warrant balanced consideration: The system's reliance on GPT-4 function calling creates dependency on a single model provider, which could present risks for a platform serving financial clients. The presenters didn't discuss fallback strategies or model diversification. The evaluation approach, while comprehensive, relies heavily on synthetic data and LLM-as-judge patterns, which have known limitations in detecting novel failure modes. The ground truth requirement from plugin teams also introduces potential for gaps if teams don't provide comprehensive test cases. The federated development model is powerful but introduces coordination challenges—ensuring consistent quality across 50-60 contributing teams requires significant governance that wasn't fully addressed in the presentation. Overall, this case study represents a mature, production-focused approach to building agentic AI systems in a highly regulated industry, with particular strengths in evaluation practices and federated development at scale.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.