PayPay: RAG-Enhanced Code Review Bot Using Historical Incident Data

Company

PayPay

Title

RAG-Enhanced Code Review Bot Using Historical Incident Data

Industry

Finance

Link

https://blog.paypay.ne.jp/en/turning-historical-incidents-into-ai-insights-rag-enhanced-llm-approach-to-code-review/

Year

2025

Summary (short)

PayPay, a rapidly growing fintech company, developed GBB RiskBot to address the challenge of scaling code review processes across an expanding engineering organization. The system leverages historical postmortem and incident data combined with RAG (Retrieval-Augmented Generation) to automatically analyze pull requests and identify potential risks based on past incidents. When developers open pull requests, the bot uses OpenAI embeddings and ChromaDB to perform semantic similarity searches against a vector database of historical incidents, then employs GPT-4o-mini to generate contextual comments highlighting relevant risks. The system operates at remarkably low cost (approximately $0.59 USD monthly for 380+ analyses across 12 repositories) while addressing critical challenges including knowledge silos, manual knowledge sharing inefficiencies, and inconsistent risk assessment across teams.

Tags

## Overview and Business Context PayPay, a financial technology company launched in 2018, has experienced substantial growth over seven years, creating significant challenges in maintaining code quality across an expanding codebase with dozens of repositories. The company developed GBB RiskBot as an automated code review assistant that applies historical incident learnings systematically across the organization. The core insight driving this solution is that postmortem data represents highly valuable organizational knowledge that traditionally remains siloed, manually shared, and inconsistently applied across teams. The business problem PayPay identified centers on four key limitations in traditional code review processes: knowledge silos where incident context remains locked within specific projects and is lost with team member turnover; manual and inconsistent knowledge sharing across teams; recurring issues manifesting across different services due to lack of centralized incident awareness; and variable historical context among reviewers leading to inconsistent risk assessment. These challenges are particularly acute in fast-growing organizations where scaling review practices becomes increasingly difficult. ## Technical Architecture and Implementation The GBB RiskBot system operates through two primary subsystems working in concert to deliver automated, context-aware code review capabilities. The architecture demonstrates practical LLMOps considerations around cost, performance, and system design. The **Knowledge Base Ingestion system** runs as a scheduled GitHub Actions cron job that continuously monitors for newly created incident data from multiple sources across the organization. When new incidents are detected, the system performs data preprocessing to extract meaningful information and normalize it into a consistent database format. The normalized incident data is then processed through OpenAI's text-embedding-ada-002 model, wrapped via LangChain, to generate vector embeddings that are stored in ChromaDB, a vector database selected for its low cost and ease of proof-of-concept setup. PayPay explicitly notes they evaluated multiple vector database options including pgvector, Weaviate, Pinecone, and FAISS before settling on ChromaDB for their specific requirements. The **Contextual Code Analysis system** activates when developers open pull requests. The system extracts context from the PR including title, description, and all code changes, then converts each element into vector representations using OpenAI embeddings. A critical design decision here involves enforcing a 1000-character limit per item to balance performance, analysis quality, and cost control. Each generated vector is used to query the ChromaDB vector database using cosine similarity metrics to retrieve the top K most similar historical incidents. The retrieved incidents serve as factual grounding for RAG-based response generation, where GPT-4o-mini receives both the code changes and relevant historical incidents through a prompt template to generate contextual GitHub comments highlighting potential risks. ## Model Selection and Architectural Rationale PayPay's choice of GPT-4o-mini rather than more powerful models represents a thoughtful architectural decision grounded in understanding where intelligence is actually required in the system. The team explicitly justifies this choice by noting that semantic search via vector similarity handles the heavy lifting of identifying relevant patterns and matching them to past incidents. The LLM's role is primarily synthesis and presentation of existing facts rather than complex reasoning or creative problem-solving. Since the system already has concrete examples of what went wrong in similar contexts through retrieval, the model doesn't need advanced reasoning capabilities—it simply needs to articulate findings in readable format. This represents sound LLMOps practice: using appropriately sized models for specific tasks rather than defaulting to the most capable (and expensive) options. ## Cost Structure and Economics The operational economics of GBB RiskBot demonstrate how RAG-based systems can operate at remarkably low cost when properly architected. The system's costs break down into two primary components: embedding generation for knowledge base indexing and chat model inference for code analysis. Using OpenAI's pricing at the time (text-embedding-ada-002 at $0.10 per 1M tokens; GPT-4o-mini at $0.15 per 1M input tokens and $0.60 per 1M output tokens), the system's costs are driven by both one-time knowledge base initialization plus incremental updates, and ongoing per-PR analysis costs. Per-PR analysis costs vary based on several factors: PR context size including description length, the number of files changed, lines of code modified, and analysis time since each item must be compared against the entire vector database. PayPay provides concrete examples: initializing the database with 47 historical incidents cost approximately $0.001852, while analyzing a single PR with one file change cost around $0.000350. Most impressively, running the system across 12 repositories with 380+ bot executions in a single month cost just $0.59 USD, which the team correctly positions as extremely cost-effective compared to the potential cost of production incidents. However, it's worth noting that as the system scales to more repositories, higher PR volumes, or larger codebases, these costs would increase, though likely remaining modest compared to incident costs. ## Monitoring and Evaluation Framework PayPay implements a three-tier metrics framework that balances leading indicators for system health with lagging indicators for business value, demonstrating mature thinking about LLM system evaluation. This multi-tiered approach recognizes that different stakeholders need different metrics at different time horizons. **Tier 1 Core Operational Metrics** serve as real-time leading indicators providing immediate insights into system performance. These include issue detection rate (percentage of analyzed PRs where risks are identified), with explicit recognition that rates too high suggest false positives leading to "bot fatigue" where developers ignore alerts, while rates too low may indicate insufficient training data or false negatives. The team also tracks distinct incident coverage (number of unique historical incidents referenced) to ensure analyses aren't skewed toward specific incidents, repository coverage across the organization, and knowledge base growth rate to ensure the system's learning capacity improves over time. **Tier 2 Developer Feedback Metrics** leverage GitHub emoji reactions as a lightweight feedback mechanism. An automated daily workflow collects reactions from the past seven days and stores detailed metrics in an analytics database for trend analysis. This represents a pragmatic approach to gathering human feedback without imposing heavy burdens on developers. **Tier 3 Business Impact Metrics** focus on long-term lagging indicators demonstrating ROI and organizational value, primarily tracking whether incident rates decrease over time across repositories using the bot. While this is the ultimate success metric, the team appropriately recognizes it requires longer time horizons to measure meaningfully. ## Production Deployment and Operational Considerations The system operates as an integrated part of PayPay's development workflow, automatically triggering on pull request creation within GitHub. The use of GitHub Actions for the knowledge base ingestion cron job represents a pragmatic choice leveraging existing CI/CD infrastructure rather than introducing new deployment complexity. The system is deployed across 12 repositories as mentioned in the cost analysis, though the text doesn't specify whether this represents a pilot rollout or full production deployment across the engineering organization. The integration pattern—posting comments directly on pull requests—minimizes friction in developer workflows by surfacing insights where developers are already working rather than requiring them to consult separate systems. However, the text doesn't discuss important production considerations such as error handling when API calls fail, handling rate limits from OpenAI's API, managing ChromaDB reliability and backup, or how the system handles PRs that might exceed token limits even with the 1000-character chunking strategy. ## Critical Assessment and Limitations While PayPay presents GBB RiskBot as successful, several aspects warrant balanced consideration. The cost figures presented ($0.59 monthly) are impressively low but may not fully account for infrastructure costs beyond API calls, such as GitHub Actions compute time, ChromaDB hosting and maintenance, development and maintenance engineering time, and the human time required to initially create well-structured postmortem data. The remarkably low cost also raises questions about usage patterns—380+ runs across 12 repositories suggests either low PR volume or selective application, which might limit the system's impact. The choice of ChromaDB "for low cost and easy POC setup" suggests this may still be somewhat experimental rather than a fully hardened production system. ChromaDB is generally considered appropriate for prototypes and smaller-scale deployments but may face challenges at very large scale compared to enterprise-focused alternatives. The text doesn't discuss performance characteristics, query latency, or whether the system has been stress-tested at significantly higher volumes. The evaluation framework, while thoughtful, lacks concrete baselines or targets. What issue detection rate is considered optimal? How many developer reactions constitute meaningful signal? Most importantly, the Tier 3 incident rate reduction metric is acknowledged as a lagging indicator but no preliminary results are shared, making it difficult to assess actual business impact. The system appears relatively new given it was posted in October 2025, so long-term effectiveness data may not yet be available. The text doesn't address several important LLMOps concerns: How is the quality of generated comments evaluated beyond developer reactions? Is there a human-in-the-loop review for high-stakes changes? How does the system handle evolving incident patterns or outdated historical data? What happens when similar historical incidents have contradictory lessons? How is prompt engineering managed and versioned? ## Future Directions and System Evolution PayPay outlines several planned improvements that reveal current limitations. They intend to upgrade from the ada embedding model to text-embedding-3-large, suggesting current embedding quality may be limiting retrieval performance. The mention of experimenting with "models other than RAG, such as CAG and mem0" indicates exploration of alternative architectures, though it's somewhat unclear what "CAG" refers to in this context—it may be a typo or internal terminology. The reference to mem0 suggests interest in systems with more persistent memory capabilities beyond simple vector retrieval. The plan to add a "rerank" step after cosine similarity search represents a common pattern in production RAG systems where initial retrieval casts a wide net and reranking provides more sophisticated relevance assessment. This two-stage approach typically improves precision by applying more computationally expensive models only to a smaller candidate set. This future direction implicitly acknowledges that current false positive rates may be higher than desired, supporting earlier observations about the experimental nature of the current deployment. ## Knowledge Management and Organizational Learning A particularly interesting aspect of GBB RiskBot is its role in democratizing organizational knowledge. By converting incident data into a searchable, automatically-applied knowledge base, the system addresses the fundamental challenge that valuable lessons learned from incidents typically remain locked in the minds of specific team members or buried in postmortem documents that few people read. The system effectively makes every code reviewer as knowledgeable as the collective organization's incident history, at least within the scope of the indexed incidents. However, the quality of this system is fundamentally constrained by the quality of input data. The system's effectiveness depends entirely on having well-documented, consistently-structured incident reports that contain the information necessary to identify similar patterns in code. The text doesn't discuss data quality requirements, how incidents are selected for inclusion, whether there's curation of the incident database, or how the system handles poorly documented incidents. This represents a significant operational consideration: maintaining GBB RiskBot likely requires ongoing investment in incident response practices and postmortem documentation quality, not just the technical system itself. ## Broader LLMOps Lessons PayPay's GBB RiskBot case study offers several valuable lessons for LLMOps practitioners. First, it demonstrates that RAG architectures can operate at very low cost when appropriately scoped, with semantic search handling the heavy lifting and smaller language models handling synthesis. Second, it shows the value of multi-tiered evaluation frameworks that balance immediate operational metrics with longer-term business impact measures. Third, it illustrates pragmatic technology choices—ChromaDB for ease of prototyping, GitHub Actions for familiar infrastructure, GPT-4o-mini for cost efficiency—that prioritize delivering value quickly over architectural perfection. The case study also highlights an important application pattern: using LLMs not to generate novel insights but to make existing organizational knowledge more accessible and actionable. This represents a lower-risk, higher-reliability application of LLMs in production compared to systems that rely on models to reason about entirely novel situations. By grounding the LLM's outputs in retrieved factual incident data, PayPay reduces hallucination risk and increases trustworthiness of the system's suggestions. That said, the case study would benefit from more rigorous evaluation data, clearer discussion of limitations and failure modes, and more details about production operational practices. The presentation focuses heavily on architecture and costs while leaving important questions about effectiveness, reliability, and scale unanswered. This is common in vendor or company blog posts that serve partly as marketing, but it means practitioners should approach the claims with appropriate skepticism and recognize that successful deployment in their own context would require careful adaptation and testing.

Start deploying reproducible AI workflows today