Company
QyrusAI
Title
AI-Powered Shift-Left Testing Platform with Multiple LLM Agents
Industry
Tech
Year
2025
Summary (short)
QyrusAI developed a comprehensive shift-left testing platform that integrates multiple AI agents powered by Amazon Bedrock's foundation models. The solution addresses the challenge of maintaining quality while accelerating development cycles by implementing AI-driven testing throughout the software development lifecycle. Their implementation resulted in an 80% reduction in defect leakage, 20% reduction in UAT effort, and 36% faster time to market.
## Overview QyrusAI is a suite of AI-driven testing tools that aims to enhance software testing throughout the entire software development lifecycle (SDLC). The core premise is implementing "shift-left testing," which means moving testing activities earlier in the development process to identify and resolve problems sooner, thereby reducing costs and improving quality. This case study, published as an AWS blog post, describes how QyrusAI leverages Amazon Bedrock's foundation models to power various specialized AI agents for different testing tasks. It's important to note that this content is co-authored with QyrusAI representatives and published on AWS's platform, so it naturally presents the technology in a favorable light. The claimed metrics (80% reduction in defect leakage, 20% reduction in UAT effort, 36% faster time to market) come from QyrusAI's own internal testing and pilot programs with select customers, which warrants a degree of caution when evaluating these results. ## Architecture and Infrastructure The solution is deployed on AWS using a containerized architecture. The AI agents run as Amazon Elastic Container Service (ECS) tasks exposed through Application Load Balancers. This provides scalability and managed infrastructure for the AI-powered testing services. The architecture also leverages Amazon S3 for storage, Amazon Elastic File System (EFS) for the API virtualization service, and AWS Lambda for certain compute tasks. The central integration point with LLMs is Amazon Bedrock, which provides access to multiple foundation models from different providers without requiring infrastructure management. This is a key architectural decision that allows QyrusAI to leverage models from multiple vendors (Anthropic, Meta, Cohere, Mistral) through a single managed service. ## Custom LLM Integration Layer: The QAI Package A notable LLMOps aspect of this implementation is QyrusAI's custom-developed `qai` package, which builds upon `aiobotocore`, `aioboto3`, and `boto3`. This package serves as a unified interface for accessing diverse LLMs, VLMs, and embedding models available on Amazon Bedrock. The design philosophy behind this package reflects several LLMOps best practices: The package standardizes interactions with various models, providing uniformity across the suite of testing agents. This is crucial when working with multiple model providers, as each may have different API conventions, prompt formats, and capabilities. By centralizing Amazon Bedrock interaction logic, QyrusAI has minimized code duplication and enhanced system maintainability, following the DRY (Don't Repeat Yourself) principle. The architecture enables seamless updates—as Amazon Bedrock evolves and introduces new models or features, updating the qai package allows quick integration of these advancements without altering each agent individually. This is particularly important in the rapidly evolving LLM landscape where new models are released frequently. The package includes specialized class objects for different model types (LLMs and VLMs) and model families, optimizing interactions based on model requirements. This abstraction layer handles the nuances of different models, including their different approaches to function calling and JSON mode, which the case study notes can differ significantly among models. Function calling and JSON mode are highlighted as critical requirements for their AI workflows and agents. To maximize compatibility across the diverse array of models, they implemented consistent interfaces for these features, with specialized classes for various models to provide consistent capabilities despite underlying differences. ## AI Agent Architecture QyrusAI implements multiple specialized AI agents, each designed for specific testing tasks across the SDLC: **TestGenerator** is the requirements-to-test-case generator that employs a multi-model approach. Meta's Llama 70B is used for analyzing requirements documents and understanding key entities, user actions, and expected behaviors. It leverages in-context learning capabilities to infer possible scenarios and edge cases. Anthropic's Claude 3.5 Sonnet acts as a "judge" to evaluate generated test scenarios, assess comprehensiveness and accuracy, highlight missing scenarios, and rank test cases based on relevance. Cohere's English Embed is used for embedding text from large documents like requirement specifications, enabling semantic search and retrieval. Pinecone (via AWS Marketplace) stores the embedded documents for fast retrieval. This component implements a ReAct (Reasoning and Acting) agent approach where the LLM thinks, observes, searches for specific requirements in documents, and generates comprehensive test scenarios. This is essentially a RAG (Retrieval-Augmented Generation) implementation for test case generation. **VisionNova** is a design test case generator that uses Anthropic's Claude 3.5 Sonnet to analyze UX/UI design documents and generate design-based test cases. This leverages vision-language model capabilities to understand visual elements and translate them into testable scenarios. **UXtract** is an agentic workflow that converts Figma prototypes into test scenarios. It uses a layered approach with multiple models: Anthropic's Claude 3 Opus processes prototype graph transitions to identify potential actions and interactions, while Claude 3.5 Sonnet generates detailed test steps and instructions based on transitions and higher-level objectives. This demonstrates the use of different model strengths for different subtasks within a single workflow. **API Builder** creates virtualized APIs for early frontend testing. It interprets API specifications and generates mock responses using various LLMs from Amazon Bedrock. This enables frontend development and testing to proceed before the actual backend is implemented. The architecture uses Amazon Lambda compute with Amazon EFS for dynamic data provisions. **Echo** generates synthetic test data using a blend of Claude 3 Sonnet, Mistral 8x7B Instruct, and Meta's Llama 1 70B. Using multiple models for synthetic data generation suggests they're leveraging different model strengths or using ensemble approaches. **Rover and TestPilot** are multi-agent frameworks for exploratory and objective-based testing respectively. They combine LLMs, VLMs, and embedding models to autonomously explore applications and uncover issues. **Healer** addresses test maintenance by analyzing test scripts and their current state with various LLMs and VLMs to suggest fixes for common test failures caused by locator issues. ## Multi-Model Strategy A key characteristic of this LLMOps implementation is the strategic use of multiple models for different purposes. Rather than relying on a single model, QyrusAI assigns different models based on their strengths: Meta's Llama models are used for document analysis and understanding requirements. Anthropic's Claude models serve dual purposes: Claude 3.5 Sonnet is frequently used as a "judge" or evaluator to assess outputs from other models, while also being used for detailed instruction generation. Claude 3 Opus handles complex transition analysis in prototype workflows. Cohere's embedding models are dedicated to document embedding for semantic search. Mistral models contribute to synthetic data generation. This approach demonstrates a sophisticated understanding of model capabilities and tradeoffs. Using Claude as a judge/evaluator layer is particularly notable as it implements a form of AI-based quality control over AI-generated outputs. ## Workflow Integration Across SDLC The case study emphasizes how these agents integrate across the entire software development lifecycle: During requirement analysis, TestGenerator creates initial test cases directly from requirements documentation. In the design phase, VisionNova and UXtract convert Figma designs and prototypes into detailed test cases and functional steps. Pre-implementation activities include API Builder creating virtualized APIs for early frontend testing and Echo generating synthetic test data. During implementation, teams use pre-generated test cases and virtualized APIs for continuous quality checks. In testing, Rover autonomously explores applications while TestPilot conducts objective-based testing. For maintenance, the system supports ongoing regression testing with test management, version control, and reporting features. ## Evaluation of Claims and Considerations The claimed benefits (80% reduction in defect leakage, 20% reduction in UAT effort, 36% faster time to market) are significant but should be viewed with appropriate skepticism. The metrics are described as coming from "internal testing and pilot programs with select customers," which may not represent typical results across diverse organizations and use cases. The shift-left testing approach itself is well-established in software engineering, and the application of LLMs to automate and enhance this process is a logical extension. However, the effectiveness of AI-generated test cases depends heavily on the quality of input requirements documents and the coverage achieved by the generated tests. The multi-agent architecture with specialized models represents a sophisticated approach to LLMOps, demonstrating thoughtful model selection and orchestration. The custom qai package for standardizing model interactions is a practical pattern for organizations working with multiple LLM providers. The reliance on Amazon Bedrock as the underlying infrastructure means the solution benefits from AWS's managed services approach, reducing operational overhead for model hosting and scaling. However, this also creates tight coupling with the AWS ecosystem. ## Key LLMOps Patterns Demonstrated This case study illustrates several important LLMOps patterns: the use of abstraction layers (qai package) to manage multiple model providers, multi-agent architectures where different agents specialize in different tasks, RAG implementations for document-based test generation, model composition where outputs from one model are evaluated or refined by another, function calling and structured output generation for integration with downstream systems, and containerized deployment using AWS managed services for scalability. The implementation shows a mature approach to productionizing LLMs, with attention to maintainability, scalability, and the practical challenges of working with multiple model providers with different capabilities and interfaces.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.