ZenML

Multi-Agent Investment Research Assistant with RAG and Human-in-the-Loop

J.P. Morgan Chase 2025
View original source

J.P. Morgan Chase's Private Bank investment research team developed "Ask David," a multi-agent AI system to automate investment research processes that previously required manual database searches and analysis. The system combines structured data querying, RAG for unstructured documents, and proprietary analytics through specialized agents orchestrated by a supervisor agent. While the team claims significant efficiency gains and real-time decision-making capabilities, they acknowledge accuracy limitations requiring human oversight, especially for high-stakes financial decisions involving billions in assets.

Industry

Finance

Technologies

Overview

J.P. Morgan Chase’s Private Bank investment research team presented their journey building “Ask David” at the Interrupt conference, sharing insights into deploying a multi-agent LLM system for automating investment research. The team manages thousands of investment products backed by years of valuable data, and prior to this initiative, answering questions about these products required manual research across databases, files, and materials—a time-consuming process that limited the team’s ability to scale and provide timely insights. The acronym DAVID stands for “Data Analytics Visualization Insights and Decision-making assistant.”

The presentation, delivered by David (representing the business/research side) and Jane (covering technical implementation), candidly addresses both the potential of the system and its limitations. Importantly, the team frames this not as a replacement for human expertise but as an augmentation tool, explicitly acknowledging that “Ask David still consults with real David whenever needed” given the high stakes involved with billions of dollars of client assets.

Technical Architecture

The system is built as a multi-agent architecture with several key components working in coordination:

Supervisor Agent: Acts as the primary orchestrator, interfacing with end users to understand their intent and delegating tasks to specialized sub-agents. The supervisor maintains both short-term and long-term memory to enable personalized user experiences and knows when to invoke human-in-the-loop processes to ensure accuracy and reliability.

Specialized Sub-Agents: The system employs three main categories of specialized agents:

Workflow Graph Structure: The end-to-end workflow begins with a planning node that routes queries to one of two main subgraphs—a general QA flow for broad questions (e.g., “how do I invest in gold?”) and a specific fund flow for product-specific inquiries. Each subgraph contains its own supervisor agent and team of specialized agents.

Post-Processing Nodes: After retrieving answers, the system includes:

Development Methodology

The team emphasized an iterative development approach with several key principles:

Start Simple and Refactor Often: Rather than building the complex multi-agent system from day one, the team followed an evolutionary path. They started with a simple ReAct agent to understand fundamentals, then built specialized agents (particularly the RAG agent), then integrated these into a multi-agent flow with a supervisor, and finally developed the current subgraph architecture that can scale to handle different intention types.

Evaluation-Driven Development: The team strongly advocates for starting evaluation early, noting that compared to traditional AI projects, GenAI projects have shorter development phases but longer evaluation phases. They recommend defining metrics and goals early, with accuracy being paramount in financial services, and using continuous evaluation to build confidence in improvements.

The team shared specific evaluation practices:

Accuracy Progression and Human-in-the-Loop

The team presented a realistic framework for accuracy improvement when applying general models to specific domains:

For the last mile, human-in-the-loop is essential. The team explicitly states that with billions of dollars at stake, they cannot afford inaccuracy, so the AI system still consults with human experts when needed. This is a refreshingly honest acknowledgment of LLM limitations in high-stakes domains.

Use Case Example

The presentation included a walkthrough of a real scenario: a client asking “Why was this fund terminated?” during a meeting with their financial advisor. Previously, this would require the advisor to contact the research team, work with human analysts to understand status change history, research the fund, identify similar alternatives, and manually prepare a client-appropriate presentation.

With Ask David, the workflow proceeds as follows: the planning node identifies this as a fund-specific query, routes to the appropriate subgraph, the supervisor agent extracts fund context and delegates to the document search agent, which retrieves data from MongoDB. The answer is then personalized based on who is asking, validated through reflection, and summarized with reference links for the advisor to explore further.

Key Takeaways and Lessons Learned

The team concluded with three main takeaways for practitioners building similar systems:

Critical Assessment

While the presentation provides valuable insights into building production multi-agent systems in finance, several aspects warrant balanced consideration. The team does not share specific accuracy numbers or quantitative results, making it difficult to assess actual production performance. The system appears to still be in development or early deployment phases given the forward-looking language used (“aiming to provide,” “we are making our vision a reality”).

The honest acknowledgment that 100% accuracy may not be achievable and that human oversight remains necessary is commendable and realistic. However, the actual production deployment status and client usage metrics are not detailed. The architecture and methodology shared are valuable for practitioners, but the case study would benefit from concrete performance metrics and lessons from real production operation.

The technical approach—using a supervisor pattern with specialized sub-agents, combining structured data access, RAG, and code generation—represents a sophisticated but now increasingly common pattern for enterprise LLM applications. The emphasis on evaluation-driven development and the practical progression framework for accuracy improvement provide actionable guidance for similar projects.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

AI-Powered Vehicle Information Platform for Dealership Sales Support

Toyota 2025

Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.

customer_support chatbot question_answering +47

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61