## Overview
Ripple operates the XRP Ledger (XRPL), a decentralized layer-1 blockchain that has been running since 2012 with over 900 nodes distributed globally. The platform team faced a significant operational challenge: monitoring and troubleshooting this decentralized peer-to-peer network required deep C++ expertise and manual analysis of massive log files (30-50GB per node, totaling 2-2.5 petabytes across their infrastructure). A single incident investigation could take 2-3 days as engineers manually correlated debug logs with the C++ codebase. This created a critical bidirectional dependency between platform engineers and core C++ experts, limiting operational efficiency and feature development velocity.
To address this challenge, Ripple built an AI-powered multi-agent operations platform on AWS that automates the correlation between code and logs, transforming what was a multi-day manual process into a conversational interface that delivers insights in minutes. The solution represents a sophisticated production LLMOps implementation that evolved over approximately one year, moving from initial machine learning concepts through prototyping with AWS's Pace team to production deployment with AWS ProServe.
## Architecture and Technical Implementation
The system consists of three main components: a multi-agent platform, a log processing pipeline, and a code analysis pipeline. The multi-agent platform serves as the orchestration layer, featuring four specialized AI agents built using the Strands SDK, an open-source framework from AWS designed for multi-agent coordination.
### Multi-Agent Architecture
The **orchestrator agent** serves as the entry point and coordination hub. When users submit queries through a web interface (backed by Amazon API Gateway and Cognito for authentication), the orchestrator performs intent classification to determine which specialist agents to invoke and in what sequence. Given API Gateway's 29-second timeout limitation, the orchestrator immediately creates a task entry in DynamoDB for state management and updates progress asynchronously. This design pattern ensures the system can handle long-running analytical tasks that may exceed API timeout constraints. The orchestrator uses Claude Sonnet 3.5 via Amazon Bedrock and employs two Strands tools for invoking downstream agents via Lambda and HTTP with JWT authentication.
The **code analysis agent** is responsible for deriving insights from the XRPL codebase, which is written in C++ and hosted on GitHub. This agent uses a Knowledge Base powered by Amazon Bedrock as its primary tool, querying the graph-based RAG system to retrieve relevant code snippets, function definitions, and log message patterns. The agent also has access to Git-based sync actions that can retrieve recent commits and commit details, enabling it to understand code evolution over time.
The **log analysis agent** performs operational analytics on CloudWatch log groups where all node logs are aggregated. This agent works closely with the query generator agent to formulate accurate CloudWatch Insights queries based on the log patterns and code context provided by the code analysis agent.
The **CloudWatch query generator agent** has a specialized responsibility: generating syntactically accurate CloudWatch Insights queries. It uses a static JSON file stored in S3 as a tool that contains log patterns and estimated pattern counts, helping it form optimal queries with appropriate limits. The agent provides detailed instructions back to the log analysis agent on how to execute queries, including whether they can run in parallel and what time ranges to use.
Initially, the orchestrator and query generator agents were deployed on AWS Lambda, while as the platform matured, Ripple began migrating to Amazon Bedrock Agent Core, a purpose-built serverless runtime environment for AI agents that became generally available during their development cycle. This migration reduces infrastructure management overhead and provides built-in capabilities for agent hosting at production scale.
### Log Processing Pipeline
The log processing pipeline brings operational data from distributed validator nodes, hubs, and client handlers into the cloud for analysis. Raw logs are first ingested into S3 using GitHub workflows orchestrated via AWS Systems Manager (SSM). When logs land in S3, an event trigger invokes a Lambda function that analyzes each file to determine optimal chunking boundaries—respecting log line integrity while adhering to configured chunk sizes. These chunk metadata messages are placed into SQS for distributed processing.
Consumer Lambda functions read from SQS and retrieve only the relevant chunks from S3 based on the metadata, parse individual log lines to extract metadata (timestamps, severity levels, node identifiers, etc.), and write structured log entries to CloudWatch Logs. This architecture enables parallel processing of massive log volumes while maintaining cost efficiency by only loading necessary data segments.
### Code Analysis Pipeline and Graph RAG Implementation
The code analysis pipeline represents one of the most sophisticated aspects of this LLMOps implementation. Rather than using a standard vector database like OpenSearch, Ripple implemented a graph-based RAG approach using Amazon Neptune Analytics. This design choice stems from the need to understand relationships within a large, complex C++ codebase where function calls, module dependencies, and cross-file relationships are critical for accurate code-to-log correlation.
The pipeline monitors two GitHub repositories: the rippled repository (containing the XRPL server software) and the standards repository (containing XRPL specifications and standards). Amazon EventBridge Scheduler triggers periodic synchronization jobs that pull the latest code and documentation changes. A Git repository processor versions these changes and stores them in S3.
The Knowledge Base ingestion job then performs several sophisticated operations. First, it chunks the code and documentation using fixed-size chunking, which the team found works well with structured content like code. These chunks are then processed by the Titan Text Embedding V2 model to generate semantic embeddings. Critically, an entity extraction step analyzes each chunk to identify domain-specific entities such as function names, class definitions, module references, and other code identifiers using Claude Sonnet 3.5.
These entities and their relationships form a lexical graph stored in Neptune Analytics. The graph structure captures not just semantic similarity (via embeddings) but also explicit relationships like "function A calls function B" or "log message X is defined in file Y." This graph becomes the retrieval layer that enables efficient context retrieval with minimal token usage when agents query the knowledge base.
### Re-ranking for Improved Retrieval Quality
Ripple implemented a re-ranking layer on top of the graph RAG workflow to further improve retrieval quality. The system first retrieves a broad set of candidate chunks from the Knowledge Base using vector similarity search. These candidates, along with the user query, are then passed to a Cohere Rerank model. The reranker evaluates each candidate document in the context of the specific query, assigning relevance scores from 0 to 1. The system then returns only the top-ranked results (typically top 10) to the LLM, ensuring high-quality context while preserving token budget for generation.
In their demonstration, they showed how a chunk originally ranked #4 by vector search was correctly promoted to #1 by the reranker when the query asked "what log messages are defined inside a function," demonstrating the reranker's ability to understand nuanced relationships between query intent and document content.
### Model Context Protocol Integration
The log analysis agent integrates with CloudWatch using the Model Context Protocol (MCP), an open standard developed by Anthropic that enables AI agents to interact with external systems through standardized interfaces. The agent uses two MCP tools: one to execute queries on CloudWatch log groups (returning a query ID) and another to retrieve actual log results using that query ID. This abstraction allows the agent to work with CloudWatch programmatically while maintaining clean separation between agent logic and infrastructure APIs.
## Prompt Engineering and System Design
Ripple emphasized the critical importance of "prompt hygiene" and what they call "context engineering." Each agent has a carefully crafted system prompt that defines its role, responsibilities, explicit tasks, and strong guardrails about what it should not do. The team noted that while LLMs have significant capabilities, proper context is essential to prevent hallucinations and ensure reliable outputs.
The orchestrator's system prompt focuses on task delegation and coordination. The log analysis agent's prompt positions it as "an expert in analyzing XRPL logs stored in Amazon CloudWatch." The code analysis agent is instructed to understand code dependencies, the XRPL codebase structure, and Git commit relationships. The query generator's prompt emphasizes precision in generating CloudWatch Insights queries. These carefully differentiated prompts enable each agent to operate within its domain of expertise while the Strands framework handles communication, message passing, context management, and multi-agent coordination.
## Query Flow and Agent Coordination
A typical end-to-end query demonstrates the sophisticated coordination between agents. When a user asks a question like "for the given time range, how many proposals did a validator see from other peers?" (a proposal being a validator's suggested view of the next ledger), the flow proceeds as follows:
- The query reaches the orchestrator via API Gateway
- The orchestrator classifies the intent and determines it needs both code and log analysis
- It invokes the code analysis agent, which queries the graph RAG knowledge base to find relevant log message patterns in the C++ code, specifically identifying the log line in the consensus.h file that records proposal messages
- The code analysis agent passes this context to the log analysis agent
- The log analysis agent invokes the query generator agent, providing it with the log patterns and the user's question
- The query generator retrieves pattern statistics from S3, generates appropriate CloudWatch Insights queries with proper time ranges and limits, and provides execution instructions (e.g., whether queries can run in parallel)
- The log analysis agent uses MCP tools to execute the queries on CloudWatch and retrieve results
- Results flow back through the agent chain: log analysis agent synthesizes findings, code analysis agent provides code context, and the orchestrator produces a final coherent response
- The UI displays both a summary response and detailed breakdowns from each agent, with full observability into the agent execution chain
In the demonstration, the system correctly identified that 267,000 proposals were received from other peers in the specified time range, provided hourly distributions, and listed all peer node IDs that sent proposals—going beyond the specific question to provide operational context.
A more complex example demonstrated the system's ability to handle multi-step reasoning. When asked to identify what happened between two specific consensus events (canonical transaction set formation and ledger building), the system needed to understand XRPL's consensus rounds (which occur every 3-5 seconds) and correlate multiple log messages across time. The code analysis agent identified the relevant log patterns, the query generator created queries that would capture the sequence, and the log analysis agent successfully retrieved and summarized all intermediate events with precise timestamps.
## Model Selection and Flexibility
The system primarily uses Claude Sonnet 3.5 for most agents, leveraging its strong performance on code understanding and complex reasoning tasks. However, the architecture provides model flexibility through Amazon Bedrock's unified API. The orchestrator could potentially use a lighter, faster model for intent classification, while specialist agents requiring deep reasoning could use more capable models. The embedding layer uses Amazon Titan Text Embedding V2, chosen for its balance of performance and cost-effectiveness.
## Observability and Operations
Ripple implemented comprehensive observability using Amazon Bedrock Agent Core's built-in monitoring capabilities. The dashboard tracks total sessions, latency distributions, token usage across agents, and error rates. The team noted this observability was "really helpful for us to improve our agent performance," enabling them to identify bottlenecks, optimize prompt designs, and tune chunking strategies based on actual production usage patterns.
The DynamoDB-based state management provides full auditability of agent decisions, reasoning chains, and tool invocations. This audit trail is essential for debugging unexpected behaviors, understanding why particular answers were generated, and continuously improving system performance.
## Evolution and Development Journey
The project evolved significantly over approximately one year. In Q1, the team developed the vision and initially considered traditional machine learning approaches with model training. In Q2, as agentic AI emerged as an industry pattern, they engaged AWS's Pace (prototyping) team, who built a working prototype in six weeks that validated the feasibility of the approach. Q3 saw the introduction of Amazon Bedrock Agent Core, and Ripple worked with the preview version, eventually becoming an early adopter of the GA release. In Q4, they partnered with AWS ProServe to productionize the solution with proper VPC configurations, security guardrails, compliance controls, and preparation for release to the XRPL open-source community.
This evolution highlights a key LLMOps principle: willingness to adapt architecture as the ecosystem matures. The team didn't lock into initial technical choices but continuously evaluated new AWS capabilities and migrated when it made sense.
## Operational Impact and Benefits
The platform delivered substantial operational improvements. Tasks that previously required 2-3 days of manual work by C++ experts—parsing gigabytes of noisy peer-to-peer logs, cross-referencing with code, and synthesizing findings—now complete in minutes through the conversational interface. The team specifically highlighted the removal of "bidirectional dependency" between platform engineers and C++ experts: platform engineers no longer wait for expert availability to understand logs, and C++ experts no longer spend time on routine operational queries.
The system proved valuable beyond incident response. Core engineers building new features now use the platform to analyze development and test network logs, comparing them against mainnet behavior to catch potential issues early. During the presentation, Vijay mentioned that for an upcoming standalone release, platform engineers were using the chatbot daily to review logs and provide "thumbs up" confirmations that systems looked healthy, implementing a form of AI-assisted operational validation during release windows.
## Technical Tradeoffs and Considerations
While the presentation naturally emphasized successes, several technical tradeoffs emerge from the architecture:
**Graph RAG vs. Vector Search**: Neptune Analytics adds operational complexity compared to simpler vector stores but provides superior retrieval quality for code relationship queries. This tradeoff makes sense given their specific use case but might not generalize to all log analysis scenarios.
**Multi-Agent Complexity**: The four-agent architecture with orchestration, specialized domains, and inter-agent communication adds latency and potential failure points compared to a single-agent approach. However, it provides better separation of concerns and allows independent optimization of each agent's prompts and models.
**Re-ranking Overhead**: The re-ranking step adds latency and cost (Cohere API calls) but demonstrably improves relevance. The team's decision to retrieve a broad set and then rerank represents a classic precision/recall tradeoff.
**Chunking Strategy**: Fixed-size chunking works well for their structured code but might miss semantic boundaries. They likely chose this over LLM-based chunking for cost and latency reasons, accepting some loss of semantic coherence for operational efficiency.
**API Gateway Timeout**: The 29-second timeout forced an asynchronous architecture with DynamoDB state management. While this adds complexity, it's actually a better pattern for production systems handling variable-latency AI workloads.
## Future Directions and Planned Enhancements
Ripple outlined several planned enhancements. They intend to leverage Amazon Bedrock Agent Core's memory capabilities to maintain conversation context across sessions, eliminating the need for users to repeatedly provide context. They're also interested in Agent Core's built-in identity and role-based access control features for security and compliance.
Beyond operational monitoring, they're exploring two significant use cases. First, blockchain forensics for anti-money laundering: enabling users to input wallet addresses and trace fund flows across transactions to identify the ultimate destination of stolen funds, facilitating faster law enforcement engagement. Second, network-level spam detection: identifying accounts sending "dust transactions" (very small transfers) that create operational burden on validators, allowing proactive community response.
The team also plans to expand agent capabilities, mentioning they're "working on 3 more agents" beyond the current four, though specific domains weren't detailed in this presentation.
## Lessons Learned and Recommendations
The team emphasized several key lessons. "Context engineering is the total key"—while LLMs have significant capabilities, providing proper context prevents hallucinations and ensures reliable outputs. They stressed the importance of capturing agent decisions, reasoning chains, and tool calls in an auditable workflow for debugging and continuous improvement.
They also highlighted the value of engaging cloud provider specialist teams (Pace, ProServe) at appropriate stages: prototyping to validate feasibility quickly, then productionization to ensure security and compliance for community release. This staged engagement model helped them avoid over-engineering early while ensuring production readiness later.
The evolution from initial ML concepts through agentic AI to agent-specific runtime infrastructure demonstrates the importance of architectural flexibility in the rapidly evolving LLMOps landscape.
## Critical Assessment
While this is clearly a well-executed LLMOps implementation, some areas warrant balanced consideration. The presentation comes from an AWS re:Invent session with AWS partnership acknowledged throughout, so there's natural bias toward AWS services. Other cloud providers offer comparable capabilities, and the degree to which this architecture is AWS-specific versus portable isn't addressed.
The operational impact claims (2-3 days to minutes) are dramatic but based on anecdotal evidence rather than rigorous benchmarking. We don't know accuracy rates, false positive/negative rates, or how often the system requires human intervention. The team's emphasis on C++ expert elimination is positive for operational efficiency but raises questions about edge cases or complex scenarios where expert judgment remains essential.
The graph RAG implementation is sophisticated, but the cost implications aren't discussed. Neptune Analytics, Bedrock API calls, Cohere reranking, and continuous log processing at petabyte scale represent significant ongoing operational costs. The ROI calculation (saved engineering time vs. infrastructure costs) would be valuable context.
Security and privacy considerations for blockchain operational data aren't deeply explored. While they mention working with ProServe on VPCs and guardrails, the specifics of how they protect sensitive node information or prevent potential prompt injection attacks aren't detailed.
Nevertheless, this represents a mature, production-grade LLMOps implementation that addresses a genuine operational pain point with measurable business impact. The multi-agent architecture, graph RAG approach, and integration patterns offer valuable lessons for organizations building similar AI-powered operational tooling.