ZenML

AI Agent for Automated Quality Assurance Testing in Cryptocurrency Platform

Coinbase 2025
View original source

Coinbase developed an AI-powered quality assurance agent (qa-ai-agent) to scale their testing efforts for their cryptocurrency platform while reducing costs. The agent processes natural language testing requests and uses visual and textual data to autonomously navigate and test the Coinbase website, eliminating the need for traditional coded test automation. In comparative testing against human QA testers, the AI agent demonstrated 75% accuracy (compared to 80% for humans), detected 300% more bugs in the same timeframe, reduced costs by 86%, and enabled new test creation in 15 minutes to 1.5 hours versus the hours required for human training. The system now executes 40 test scenarios covering localization, UI/UX, compliance, and functional testing, identifying approximately 10 issues weekly, with the goal of replacing 75% of manual testing.

Industry

Finance

Technologies

Overview

Coinbase, a major cryptocurrency trading platform, developed an AI-powered quality assurance agent called “qa-ai-agent” to fundamentally transform their software testing operations. The initiative emerged from a strategic question about leveraging AI agents to improve product quality while adhering to the principle of “doing more with less.” The company positioned product quality and trust as foundational to their mission of building financial infrastructure for billions of users, recognizing that users gravitate toward the most trusted platforms over time.

The project aimed to achieve a 10x improvement in testing effort at one-tenth the cost of traditional manual testing. This represents a significant LLMOps deployment where an AI agent operates autonomously in a production testing environment, making critical quality decisions that directly impact product releases and customer experience.

Technical Architecture and Implementation

The qa-ai-agent is built on several key technical components that work together to create a production-grade LLM testing system. At its core, the system leverages an open-source LLM browser agent called browser-use, which enables the AI to control browser sessions autonomously. This is a critical architectural decision that differentiates this approach from traditional coded test automation frameworks.

The service architecture includes both gRPC endpoints and WebSocket-based connections to initiate and manage test runs. This dual-protocol approach likely provides flexibility for different integration patterns—gRPC for synchronous, structured communication and WebSockets for real-time streaming of test execution status and results. For data persistence, the system uses MongoDB to store test executions, session history, and issue tracking data. The choice of a document database like MongoDB is well-suited for the variable and evolving structure of test execution data, particularly when dealing with AI-generated results that may not fit rigid schemas.

Browser automation capabilities are provided by BrowserStack, enabling remote browser testing across different environments. This is an important production consideration, as it allows the AI agent to test across multiple browsers, operating systems, and device configurations without maintaining physical infrastructure.

A distinctive aspect of the architecture is that the agent processes testing requests in natural language rather than relying on traditional test scripts. For example, a prompt such as “log into coinbase test account in Brazil, and buy 10 BRL worth of BTC” is sufficient to initiate a complete test scenario. The agent directly uses visual and textual data from coinbase.com to determine the next logical action, eliminating the need for text-to-code intermediary steps. This represents a significant shift in how test automation is authored and maintained.

Visual and Multimodal Reasoning Capabilities

The qa-ai-agent’s ability to work directly with visual and textual data from the website is a key innovation. Traditional test automation relies on element locators (CSS selectors, XPath, etc.) that break when UI layouts change, leading to the “flakiness” problem that plagues many test automation suites. By contrast, the AI agent appears to use visual understanding and reasoning to identify elements and determine appropriate actions, much like a human tester would.

This approach brings substantial maintenance advantages. The case study notes that minor layout adjustments that would cause traditional tests to fail and require hours of debugging do not impact the AI agent, as long as the underlying feature remains functional. This represents a significant productivity gain in test maintenance, which often consumes more effort than test creation in traditional automation frameworks.

Prompt Engineering and Test Authoring

Creating new test cases involves describing them in natural language, which is positioned as significantly faster and easier to maintain than code. This democratizes test creation, potentially allowing product managers, designers, or other non-technical team members to contribute test scenarios. However, the case study reveals that prompt engineering is not trivial—new tests can be integrated within 15 minutes if the prompt has already been tested, but approximately 1.5 hours is needed if prompt testing is required.

This suggests that while natural language test authoring lowers the barrier compared to coding, there is still a craft and refinement process involved in writing effective prompts that reliably guide the AI agent to execute the intended test scenario. The distinction between “tested” and “untested” prompts implies that there is a library or repository of validated prompt patterns that have been proven to work reliably.

Evaluation Framework and Performance Metrics

From the project’s inception, Coinbase treated AI performance evaluation as a first-class citizen, establishing the guiding principle that qa-ai-agent should perform at or above the level of human testers. This is exemplary LLMOps practice—establishing clear success criteria before deployment and measuring against them rigorously.

The evaluation framework defined four key metrics:

These metrics provide a balanced view of AI performance that goes beyond simple accuracy measures. The inclusion of “issues accepted as valid by dev teams” as the correctness measure is particularly important, as it grounds the evaluation in real-world utility rather than synthetic benchmarks.

A/B Testing Methodology

To assess performance against human testers, Coinbase employed an A/B testing approach mirroring their standard feature launch process. Both human testers and the AI agent conducted test runs under identical parameters, providing a fair comparison. This methodology demonstrates mature LLMOps practices—treating AI system evaluation with the same rigor as product feature evaluation.

The comparative results reveal important tradeoffs:

The accuracy gap (75% vs 80%) is notable and represents an honest assessment of current limitations. The AI agent produces more false positives than human testers, which could create noise for development teams. However, the dramatic efficiency gain (300% more bugs detected) suggests that even with lower precision, the recall improvement may provide net value—finding more real issues overall, despite also finding more false positives.

The cost analysis showing 86% reduction is impressive, though it’s important to note this compares token costs to manual testing expenses. The comparison likely includes human tester salaries, training costs, and overhead, making it a comprehensive economic analysis rather than just infrastructure costs.

LLM-as-a-Judge for Quality Control

To address the correctness gap and reduce false positives, Coinbase introduced an innovative solution: using another LLM as a judge to evaluate the quality of identified bugs. Based on artifacts such as screenshots and issue descriptions, a separate LLM evaluates whether the issue is genuine or potentially a false positive, producing a confidence score.

This confidence score is used for filtering out low-confidence issues, effectively creating a two-stage LLM pipeline where the first LLM identifies potential issues and the second LLM validates them. This is a sophisticated LLMOps pattern that addresses a common challenge in AI systems—balancing recall (finding all real issues) with precision (avoiding false alarms).

The LLM-as-a-judge approach raises interesting questions about the architecture. Are both stages using the same underlying model or different models? Is the judge LLM prompted differently or fine-tuned specifically for issue validation? The case study doesn’t provide these implementation details, but the pattern itself is valuable for other LLMOps practitioners to consider.

Production Integration and Workflow

The qa-ai-agent is fully integrated into the developer workflow, including Slack and JIRA integrations. This is critical for adoption—AI systems that exist in isolation provide limited value, but integrating test results into existing communication and issue-tracking systems ensures they become part of the natural workflow.

The Slack integration likely provides real-time notifications when tests complete or issues are found, enabling rapid feedback. The JIRA integration presumably creates tickets automatically for identified issues, potentially including screenshots, reproduction steps generated by the AI, and confidence scores from the LLM judge.

Current Production Status and Results

As of the publication date (October 23, 2025), the qa-ai-agent executes 40 test scenarios encompassing localization, UI/UX, compliance, and functional aspects of the Coinbase product experience. The system identifies approximately 10 issues weekly. With two months of accumulated test results at the time of writing, Coinbase had already begun deprecating manual tests that could be entirely supplanted by AI.

The company anticipates that at least 75% of current manual testing will eventually be replaced by AI agents, a goal they describe as “rapidly approaching.” This is an aggressive automation target that suggests high confidence in the system’s capabilities, though it’s worth noting that the case study acknowledges human testers retain advantages in certain areas.

Limitations and Areas Where Humans Excel

The case study candidly acknowledges that human testers retain an advantage in areas challenging for test automation systems, specifically mentioning the user onboarding flow, which requires real human ID and liveness tests (e.g., selfie verification). This is an important limitation—the AI agent cannot complete workflows that require actual identity verification or biometric authentication.

This limitation highlights a broader principle in LLMOps: AI systems excel in certain domains but have clear boundaries. Understanding and respecting these boundaries is crucial for successful deployment. The 75% automation target implicitly acknowledges that 25% of testing will remain manual, at least with current technology.

Critical Assessment and Balanced Perspective

While the case study presents impressive results, it’s important to view the claims with appropriate skepticism and context. The 300% increase in bugs detected could reflect several different phenomena:

The 86% cost reduction is substantial, but the comparison baseline matters. If the baseline includes expensive contractor rates for manual testers working limited hours, the savings may be different compared to in-house QA teams. Additionally, the token costs represent only the inference costs, not the engineering effort to build, maintain, and improve the system, nor the cost of false positives that consume developer time.

The accuracy gap (75% vs 80%) is presented straightforwardly, which is commendable. However, 75% correctness means one in four identified issues is a false positive, which could strain developer trust over time. The LLM-as-a-judge mechanism helps, but its effectiveness isn’t quantified in the case study—it would be valuable to know how much the confidence scoring improves precision.

The claim that new tests can be created in “15 minutes to 1.5 hours” is faster than coding traditional automation, but it’s not instant. Prompt engineering clearly requires skill and iteration, and there may be a learning curve for team members unfamiliar with prompting AI agents effectively.

LLMOps Maturity and Best Practices Demonstrated

This case study demonstrates several LLMOps best practices that are worth highlighting:

The use of A/B testing methodology to evaluate AI performance is particularly noteworthy and represents mature LLMOps practice that many organizations overlook.

Technical Debt and Maintenance Considerations

While the case study emphasizes reduced maintenance compared to traditional automation, operating an AI agent in production creates its own maintenance challenges. The system depends on:

As the Coinbase website evolves, prompts may need refinement even if they’re more resilient than coded selectors. The confidence scoring system may need retraining or adjustment as issue patterns change. These ongoing maintenance needs aren’t addressed in the case study but are important considerations for LLMOps practitioners.

Broader Implications for Software Quality Assurance

This case study represents a significant data point in the evolution of software testing. The shift from coded automation to natural language-driven AI agents could fundamentally change who can author tests, how quickly test coverage can expand, and the economics of quality assurance. However, it also introduces new challenges around AI explainability, trust in AI-identified issues, and the skills required for effective prompt engineering.

The cryptocurrency industry context is relevant—Coinbase operates in a high-stakes environment where security and correctness are paramount, and bugs can result in financial losses or regulatory issues. That they’re willing to deploy AI agents for quality assurance suggests confidence in the technology, though the 75% automation target (not 100%) indicates appropriate caution.

Conclusion

Coinbase’s qa-ai-agent represents a sophisticated production deployment of LLM technology for software quality assurance. The system demonstrates strong LLMOps practices including rigorous evaluation, human baseline comparison, production integration, and honest limitation acknowledgment. The results show meaningful productivity gains and cost reductions, though with accuracy tradeoffs that are mitigated through LLM-as-a-judge mechanisms. The case study provides valuable insights for organizations considering AI agents for testing while maintaining appropriate transparency about limitations and challenges. The 75% automation target and acknowledgment of human advantages in certain testing areas demonstrate pragmatic deployment strategy rather than overselling AI capabilities.

More Like This

Deploying Secure AI Agents in Highly Regulated Financial and Gaming Environments

Sicoob / Holland Casino 2025

Two organizations operating in highly regulated industries—Sicoob, a Brazilian cooperative financial institution, and Holland Casino, a government-mandated Dutch gaming operator—share their approaches to deploying generative AI workloads while maintaining strict compliance requirements. Sicoob built a scalable infrastructure using Amazon EKS with GPU instances, leveraging open-source tools like Karpenter, KEDA, vLLM, and Open WebUI to run multiple open-source LLMs (Llama, Mistral, DeepSeek, Granite) for code generation, robotic process automation, investment advisory, and document interaction use cases, achieving cost efficiency through spot instances and auto-scaling. Holland Casino took a different path, using Anthropic's Claude models via Amazon Bedrock and developing lightweight AI agents using the Strands framework, later deploying them through Bedrock Agent Core to provide management stakeholders with self-service access to cost, security, and operational insights. Both organizations emphasized the importance of security, governance, compliance frameworks (including ISO 42001 for AI), and responsible AI practices while demonstrating that regulatory requirements need not inhibit AI adoption when proper architectural patterns and AWS services are employed.

healthcare fraud_detection customer_support +50

Running LLM Agents in Production for Accounting Automation

Digits 2025

Digits, a company providing automated accounting services for startups and small businesses, implemented production-scale LLM agents to handle complex workflows including vendor hydration, client onboarding, and natural language queries about financial books. The company evolved from a simple 200-line agent implementation to a sophisticated production system incorporating LLM proxies, memory services, guardrails, observability tooling (Phoenix from Arize), and API-based tool integration using Kotlin and Golang backends. Their agents achieve a 96% acceptance rate on classification tasks with only 3% requiring human review, handling approximately 90% of requests asynchronously and 10% synchronously through a chat interface.

healthcare fraud_detection customer_support +50

Building Production AI Agents for Enterprise HR, IT, and Finance Platform

Rippling 2025

Rippling, an enterprise platform providing HR, payroll, IT, and finance solutions, has evolved its AI strategy from simple content summarization to building complex production agents that assist administrators and employees across their entire platform. Led by Anker, their head of AI, the company has developed agents that handle payroll troubleshooting, sales briefing automation, interview transcript summarization, and talent performance calibration. They've transitioned from deterministic workflow-based approaches to more flexible deep agent paradigms, leveraging LangChain and LangSmith for development and tracing. The company maintains a dual focus: embedding AI capabilities within their product for customers running businesses on their platform, and deploying AI internally to increase productivity across all teams. Early results show promise in handling complex, context-dependent queries that traditional rule-based systems couldn't address.

customer_support healthcare document_processing +39