J.P. Morgan Chase: Multi-Agent Investment Research Assistant with RAG and Human-in-the-Loop

LLMOps Database

Finance

J.P. Morgan Chase

Company

J.P. Morgan Chase

Title

Multi-Agent Investment Research Assistant with RAG and Human-in-the-Loop

Industry

Finance

Link

https://www.youtube.com/watch?v=yMalr0jiOAc

Year

2025

Summary (short)

J.P. Morgan Chase's Private Bank investment research team developed "Ask David," a multi-agent AI system to automate investment research processes that previously required manual database searches and analysis. The system combines structured data querying, RAG for unstructured documents, and proprietary analytics through specialized agents orchestrated by a supervisor agent. While the team claims significant efficiency gains and real-time decision-making capabilities, they acknowledge accuracy limitations requiring human oversight, especially for high-stakes financial decisions involving billions in assets.

## Overview J.P. Morgan Chase's Private Bank investment research team has developed "Ask David" (Data Analytics Visualization Insights and Decision-making assistant), a sophisticated multi-agent AI system designed to automate investment research processes. The system addresses the challenge of managing thousands of investment products backed by years of valuable data, where the small research team previously had to manually search through databases and files to answer client questions—a time-consuming process that limited scalability and insight generation. The speakers, David and Jane from the investment research team, presented their journey at an industry conference, emphasizing both the technical achievements and the practical constraints they encountered in deploying LLMs in a high-stakes financial environment where billions of dollars of assets are at risk. ## Technical Architecture and Multi-Agent Design Ask David employs a sophisticated multi-agent architecture centered around a supervisor agent that acts as an orchestrator. The supervisor agent interfaces with end users, understands their intentions, and delegates tasks to specialized sub-agents. Critically, this supervisor maintains both short-term and long-term memory to customize user experiences and knows when to invoke human-in-the-loop processes to ensure accuracy and reliability—a crucial consideration in financial services. The system incorporates three primary specialized agents: **Structured Data Agent**: This agent translates natural language queries into SQL queries or API calls, leveraging large language models to summarize the retrieved data. This addresses the challenge of making decades of structured data more accessible without requiring users to navigate multiple systems manually. **Unstructured Data Agent**: Operating as a RAG (Retrieval Augmented Generation) agent, this component handles the vast amount of unstructured documentation including emails, meeting notes, presentations, and increasingly, video and audio recordings from virtual meetings. The data undergoes preprocessing, vectorization, and storage in a database before the RAG agent can effectively derive information. **Analytics Agent**: This specialized agent interfaces with proprietary models and APIs that generate insights and visualizations for decision-making. For simple queries answerable through direct API calls, it employs a ReAct agent using APIs as tools. For more complex queries requiring deeper analysis, it utilizes text-to-code generation capabilities with human supervision for execution. ## Workflow and Processing Pipeline The end-to-end workflow begins with a planning node that routes queries through two distinct subgraphs based on the nature of the question. General investment questions (such as "how do I invest in gold?") follow one path, while specific fund-related queries follow another specialized flow. Each subgraph is equipped with its own supervisor agent and team of specialized agents. The processing pipeline includes several critical steps beyond basic query handling. After retrieving answers, the system implements a personalization node that tailors responses based on user roles—for instance, a due diligence specialist receives detailed technical analysis while an advisor gets more general information suitable for client interactions. A reflection node employs LLM-as-judge methodology to validate that generated answers make logical sense, triggering retry mechanisms if quality standards aren't met. The workflow concludes with a summarization step that performs multiple functions: conversation summary, memory updates, and final answer delivery. This comprehensive approach ensures that the system maintains context across interactions while providing quality-controlled outputs. ## Evaluation-Driven Development Approach One of the most significant aspects of J.P. Morgan's LLMOps strategy is their emphasis on evaluation-driven development. The team acknowledged that while GenAI projects typically have shorter development phases compared to traditional AI projects, they require extensive evaluation phases. Their approach involves starting evaluation early in the development process and thinking carefully about appropriate metrics aligned with business goals. In financial services, accuracy is paramount, and the team implements continuous evaluation to build confidence in daily improvements. They advocate for independent evaluation of sub-agents to identify weak links and improvement opportunities. The selection of evaluation metrics depends on the specific agent design—for example, using conciseness metrics for summarization agents or trajectory evaluation for tool-calling agents. Notably, they dispel the myth that evaluation requires extensive ground truth data from the start. Their approach allows for evaluation with or without ground truth, utilizing various metrics beyond simple accuracy. As evaluation processes mature and reviews accumulate, ground truth examples naturally develop, creating a positive feedback loop for system improvement. ## Human-in-the-Loop Integration The team's approach to human oversight reflects the realities of deploying LLMs in high-stakes financial environments. They observe that applying general models to specific domains typically yields less than 50% accuracy initially. Quick improvements through chunking strategies, search algorithm optimization, and prompt engineering can achieve approximately 80% accuracy. The 80-90% range requires workflow chains and subgraph creation for fine-tuning specific question types without impacting other functionalities. The most challenging aspect is the "last mile" from 90% to 100% accuracy. The team acknowledges that achieving 100% accuracy may not be feasible for GenAI applications, making human-in-the-loop essential. Given the billions of dollars at stake, they cannot afford inaccuracies, so "Ask David still consults with real David whenever needed." This pragmatic approach balances automation benefits with risk management requirements. ## Implementation Journey and Lessons Learned The development approach emphasized starting simple and refactoring often. Rather than building the complex multi-agent system from day one, they began with a vanilla ReAct agent to understand basic functionality. They progressively evolved to specialized agents (initially a RAG agent), then integrated these into multi-agent flows with supervisor coordination, and finally developed subgraphs for specific intentions. This iterative approach allowed for thorough testing and validation at each stage while building institutional knowledge about LLM behavior in their specific domain. The team currently handles two main intention types but has designed the architecture for easy scaling to additional use cases. ## Production Considerations and Scaling The system addresses real-world scenarios such as financial advisors in client meetings who receive unexpected questions about fund terminations. Previously, this would require reaching out to the research team, manual analysis of status change history, research compilation, and custom presentation creation. Ask David enables real-time access to the same data, analytics, and insights directly within client meetings, supporting immediate decision-making. However, the team's presentation reveals both the potential and limitations of current LLM technology in financial services. While they demonstrate impressive technical capabilities in orchestrating multiple AI agents and processing diverse data types, the persistent need for human oversight highlights the gap between AI capabilities and the reliability requirements of financial services. ## Critical Assessment While the presentation showcases impressive technical achievements, several aspects warrant careful consideration. The team's claims about efficiency gains and real-time decision-making capabilities are compelling, but the persistent requirement for human oversight suggests that the automation benefits may be more limited than initially presented. The 90-100% accuracy challenge represents a significant practical constraint that could limit the system's autonomous operation in many scenarios. The multi-agent architecture, while sophisticated, introduces complexity that may create maintenance and debugging challenges not fully addressed in the presentation. The reliance on LLM-as-judge for quality control, while practical, may introduce biases or miss edge cases that human reviewers would catch. The evaluation methodology appears robust, but the presentation lacks specific metrics about system performance, cost implications, or detailed comparisons with previous manual processes. The emphasis on continuous evaluation is commendable, but more transparency about actual performance outcomes would strengthen the case study. ## Broader Implications for LLMOps This case study illustrates several important principles for LLMOps in regulated industries. The iterative development approach, emphasis on evaluation, and pragmatic acceptance of accuracy limitations provide a realistic framework for similar implementations. The multi-agent architecture demonstrates how complex business processes can be decomposed into specialized AI components while maintaining human oversight where necessary. The team's three key takeaways—iterate fast, evaluate early, keep humans in the loop—reflect practical wisdom gained from deploying LLMs in a high-stakes environment. Their experience suggests that successful LLMOps in financial services requires balancing automation ambitions with risk management realities, resulting in human-AI collaboration rather than full automation.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free