ZenML

Multi-Agent System for Prediction Market Resolution Using LangChain and LangGraph

Chaos Labs 2024
View original source

Chaos Labs developed Edge AI Oracle, a decentralized multi-agent system built on LangChain and LangGraph for resolving queries in prediction markets. The system utilizes multiple LLM models from providers like OpenAI, Anthropic, and Meta to ensure objective and accurate resolutions. Through a sophisticated workflow of specialized agents including research analysts, web scrapers, and bias analysts, the system processes queries and provides transparent, traceable results with configurable consensus requirements.

Industry

Finance

Technologies

Edge AI Oracle: Multi-Agent System for Prediction Market Resolution

Overview and Context

Chaos Labs announced the alpha release of Edge AI Oracle, a multi-agent system designed to resolve queries in prediction markets. Prediction markets operate by having an “oracle” determine outcomes and resolve bets, and Edge AI Oracle represents an LLM-based approach to this traditionally human-driven or rule-based process. The system is built on LangChain and LangGraph frameworks and aims to provide objective, transparent, and efficient query resolution for questions ranging from election outcomes to sports statistics and award winners.

It’s worth noting that this case study comes from a guest blog post on the LangChain blog, which means there’s an inherent promotional aspect both for Chaos Labs’ product and for the LangChain/LangGraph frameworks. The system is described as being in alpha release, meaning it’s still in early stages and real-world performance data is limited.

Problem Statement

The case study identifies three fundamental challenges that Edge AI Oracle aims to address:

Traditional oracles in prediction markets face challenges around objectivity and transparency. The Chaos Labs team positions their multi-model, multi-agent approach as a way to sidestep these limitations through what they describe as a “decentralized network of agents.”

Technical Architecture and Multi-Agent Design

The core innovation of Edge AI Oracle is its “AI Oracle Council” architecture, which orchestrates multiple agents in a sequential workflow. Each agent has a specific role in the resolution pipeline, and the system leverages models from multiple providers including OpenAI, Anthropic, and Meta to provide diverse perspectives.

Agent Workflow

The multi-agent orchestration follows a directed, sequential flow with six distinct agent roles:

Research Analyst: This agent serves as the entry point for the workflow. It receives the query and performs initial parsing, identifying key data points and required sources for resolution. This step is crucial for understanding what information needs to be gathered to answer questions like “Who won the election?” or “How many goals did Messi score?”

Web Scraper: After the research analyst identifies requirements, the web scraper agent retrieves data from external sources and databases. The system claims to prioritize reputable, verified information sources, though the specific criteria for determining source reliability is not detailed in the case study.

Document Bias Analyst: This agent applies filters to the gathered data and checks for potential bias, aiming to ensure the data pool remains neutral and credible. The inclusion of a dedicated bias-checking step is notable, though the methodology for detecting and filtering bias is not elaborated upon.

Report Writer: The report writer synthesizes all the research and filtered data into a cohesive report, presenting an initial answer based on the analysis conducted by previous agents.

Summarizer: This agent condenses the full report into a concise form, distilling key insights and findings for final processing.

Classifier: The final agent evaluates the summarized output, categorizing and validating it against preset criteria before the workflow concludes.

LangChain and LangGraph Integration

The technical foundation of the system relies heavily on the LangChain and LangGraph frameworks. LangChain provides the essential building blocks for each agent, including prompt templates, retrieval tools, and output parsers. A key advantage highlighted is LangChain’s role as a “flexible gateway to multiple frontier models” through a unified API, which simplifies the process of incorporating diverse LLMs into the Oracle Council.

LangGraph enables the graph-based, stateful workflow orchestration that connects all agents. The framework’s support for directed, cyclical workflows allows each agent to build on the work of others in a coordinated manner. The edge-based orchestration ensures smooth handoffs between tasks, creating what the team describes as a “cohesive and logical resolution process.”

Consensus Mechanism and Confidence Requirements

One of the distinctive LLMOps considerations in this system is the configurable consensus mechanism. For the Wintermute Election market deployment, the Oracle Council was configured to require unanimous agreement with over 95% confidence from each Oracle AI Agent. This is a notably high bar that suggests the system prioritizes precision over recall in high-stakes decision-making contexts.

The consensus requirements are described as “fully configurable on a per-market basis,” indicating that different prediction markets can have different thresholds based on their specific needs. The upcoming beta release is expected to give developers and market creators autonomous control over these settings, suggesting a move toward a more self-service platform model.

Production Considerations and LLMOps Challenges

While the case study presents an ambitious architecture, several LLMOps considerations emerge from the described system:

Model Diversity and Provider Management: Running a “decentralized network” of agents across multiple LLM providers (OpenAI, Anthropic, Meta) introduces operational complexity. This includes managing multiple API integrations, handling different rate limits and pricing models, ensuring consistent behavior across models, and dealing with potential availability issues from any single provider.

Stateful Workflow Management: LangGraph’s stateful interactions enable the multi-agent orchestration, but managing state across a complex agent pipeline introduces challenges around error handling, retry logic, and maintaining consistency when individual agents fail or produce unexpected outputs.

Bias Detection at Scale: The document bias analyst represents an attempt to address bias in retrieved information, but the effectiveness of automated bias detection remains a significant challenge in the LLM space. The case study does not provide details on how bias is detected or what types of bias are filtered.

Consensus and Confidence Calibration: Requiring 95% confidence from each agent raises questions about how confidence is measured and whether LLM confidence scores are well-calibrated for this use case. LLM confidence does not always correlate with factual accuracy, and this is an active area of research.

Transparency and Explainability: The case study emphasizes that resolutions are “fully explainable,” which is valuable for prediction markets where users need to understand and trust outcomes. The sequential agent workflow with distinct stages likely supports this by creating an audit trail through the resolution process.

Limitations and Balanced Assessment

The case study, being a promotional announcement, presents the technology optimistically. Several aspects warrant more cautious evaluation:

The system is in alpha release, meaning production-scale performance and reliability have not been demonstrated. No quantitative metrics on accuracy, latency, or cost are provided. The claim of “decentralized” architecture is somewhat ambiguous—while multiple LLM providers are used, the orchestration appears to be centralized on the Edge Oracle Network.

The effectiveness of multi-model approaches for reducing bias is theoretically sound but depends heavily on implementation details not provided in the case study. Simply using multiple models does not guarantee bias reduction if the models share common training data biases or if the aggregation method is flawed.

The application to prediction markets is interesting because it represents a high-stakes use case where incorrect resolutions have financial consequences. This creates strong incentives for accuracy and transparency, which could drive meaningful innovation in LLMOps practices around validation and verification.

Future Directions

The beta release is expected to provide developers and market creators with autonomous control over consensus settings, suggesting a platform evolution toward greater configurability. The team positions Edge AI Oracle as applicable beyond prediction markets to “blockchain security” and “decentralized data applications,” indicating broader ambitions for the multi-agent oracle architecture.

The integration of research agent patterns with structured consensus mechanisms represents an interesting approach to production LLM systems where reliability and explainability are paramount. As the system matures from alpha to production, monitoring how they address the operational challenges of multi-model, multi-agent orchestration will be valuable for the broader LLMOps community.

More Like This

Multi-Agent Financial Research and Question Answering System

Yahoo! Finance 2025

Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.

question_answering data_analysis chatbot +49

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Running LLM Agents in Production for Accounting Automation

Digits 2025

Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.

healthcare fraud_detection customer_support +50