Cesar: Practical Implementation of LLMs for Automated Test Case Generation

LLMOps Database

Research & Academia

Cesar

Company

Cesar

Title

Practical Implementation of LLMs for Automated Test Case Generation

Industry

Research & Academia

Link

https://arxiv.org/html/2312.12598v2

Year

2023

Summary (short)

A case study exploring the application of LLMs (specifically GPT-3.5 Turbo) in automated test case generation for software applications. The research developed a semi-automated approach using prompt engineering and LangChain to generate test cases from software specifications. The study evaluated the quality of AI-generated test cases against manually written ones for the Da.tes platform, finding comparable quality metrics between AI and human-generated tests, with AI tests scoring slightly higher (4.31 vs 4.18) across correctness, consistency, and completeness factors.

Tags

## Overview This academic case study from Cesar explores the practical application of Large Language Models (LLMs) for automating test case construction in software engineering. The research team conducted a systematic investigation into using OpenAI's GPT-3.5 Turbo model combined with the LangChain framework to generate test cases for Da.tes, a real-world production web platform that connects startups, investors, and businesses. The study provides valuable insights into both the potential and limitations of using LLMs for software testing automation, with a focus on practical implementation details and empirical evaluation. ## Application Context Da.tes is a production web platform designed to create opportunities connecting startups, investors, and businesses. It features diverse functionalities including recommendation systems, matching algorithms for user profiles, and a data-driven ecosystem for business decision-making. The researchers deliberately chose this real-world application to evaluate LLM-based test case generation across various components and functionalities, providing a more realistic assessment than hypothetical scenarios would offer. ## Technical Architecture and Implementation The implementation leverages the OpenAI API with GPT-3.5 Turbo as the underlying language model, orchestrated through the LangChain Python framework. LangChain was specifically chosen for its capabilities in managing prompts, inserting context, handling memory, and facilitating interactive multi-step workflows. This architectural decision reflects a common pattern in LLMOps where orchestration frameworks abstract away complexity in managing LLM interactions. ### Multi-Stage Prompt Engineering Approach The researchers developed an interactive, multi-step approach to test case generation that breaks down the complex task into intermediate stages. This design follows the chain-of-thought prompting principle, allowing the model to engage in reasoning by decomposing tasks into manageable steps. The workflow consists of three main stages: **Stage 1 - Requirements Generation:** The first prompt instructs the model to act as a QA Engineer working on a project. It receives the application description template and generates a requirements document. The output is structured to ensure it can be consumed by subsequent prompts. **Stage 2 - Test Conditions Generation:** The second prompt takes the output from stage one along with the original template. It generates test conditions in JSON format, grouping requirements by functional or non-functional categories. This intermediate step helps the model "think" and analyze before producing final test cases. **Stage 3 - Test Cases Generation:** The final stage combines outputs from both previous stages to generate structured test cases containing title, preconditions, steps, expected results, test data, and test classification. The output is formatted in markdown for easy conversion to PDF documentation. ### Role Prompting Strategy Each prompt in the pipeline employs role prompting, a technique where the model is given a specific persona (in this case, a QA Engineer) to influence how it interprets inputs and generates responses. This approach provides context that shapes the model's technical perspective and improves response quality for domain-specific tasks. ### Structured Input Templates A key finding from the research was the importance of structured input for preventing hallucinations and ensuring relevant test case generation. The team developed a template through empirical experimentation that captures essential application information: - Software name (single line reference) - Main purpose (brief description, maximum 3 lines) - Platform type (web, mobile, etc.) - Feature description (overview of each feature) Critically, the researchers discovered that the template must be completed separately for each feature in the software. Attempting to describe all features in a single template produced worse results, as GPT-3.5 Turbo struggled to handle multiple software features in one context. This finding has important implications for production deployments, suggesting that complex applications require decomposition into feature-level inputs. ## Evaluation Methodology The evaluation framework was designed around documentation quality factors identified in software testing literature. Given constraints on accessing code coverage metrics, execution history, and defect data for the Da.tes platform, the researchers focused on three key documentation quality sub-factors: Correctness, Completeness, and Consistency. These were selected based on literature identifying them as the most critical quality factors for test cases. ### Evaluation Process The study employed a blind evaluation approach where 10 QA engineer volunteers assessed test cases without knowing whether they were AI-generated or human-written: **Form 1 - Quality Factor Scoring:** Evaluators rated 10 AI-generated and 10 human-written test cases on a 1-5 scale (Very Poor to Very Good) across the three quality factors. **Form 2 - A/B Comparison:** Similar test cases (testing the same software paths) were placed side by side, with evaluators choosing their preferred option and providing justification for their selection. While only 7 of 10 expected responses were collected, the researchers noted consistency across evaluator responses, suggesting the sample still provided meaningful insights. ## Results and Findings The evaluation results showed AI-generated test cases achieving comparable or slightly better scores than human-written test cases: - Overall average score: AI 4.31 vs Human 4.18 - Correctness: AI-generated tests showed notably better evaluations - Consistency: AI-generated tests demonstrated higher uniformity and reliability - Completeness: Both groups showed similar characteristics In the A/B comparison, 58.6% of responses (41 of 70) preferred AI-generated test cases. Interestingly, two specific test cases showed unanimous preference for human-written versions, while one showed unanimous preference for AI-generated content, suggesting context-dependent performance variations. Qualitative feedback highlighted that AI-generated test cases excelled in writing quality, simplicity, directiveness, clarity, and completeness. These characteristics aligned with a significant percentage of participant preferences. ## Limitations and Challenges The study identified several important limitations relevant to production deployments: **Feature Isolation Requirement:** The model could not effectively handle multiple software features described in a single template. This necessitates feature-by-feature processing, which increases operational complexity. **Integration Testing Challenges:** Cross-feature dependencies proved problematic. Integration tests require the QA engineer to explicitly describe feature interdependencies, increasing template size. However, larger templates correlated with degraded results, creating a difficult trade-off. **Human Input Dependency:** The process is not fully automated and relies on human input to provide correctly formatted feature descriptions. The quality of output depends heavily on the quality of input templates. **Common Feature Performance:** While the model performed well with common features like login or signup, it struggled with more complex or unusual functionality, suggesting potential issues with less common software patterns. ## LLMOps Implications This case study provides several insights relevant to LLMOps practitioners: **Orchestration Patterns:** The use of LangChain for managing multi-step prompts, context injection, and memory handling demonstrates a common orchestration pattern for complex LLM workflows. The sequential prompt chaining with intermediate outputs feeding subsequent stages is a reusable architectural pattern. **Prompt Engineering as Development:** The extensive iteration required to develop effective templates and prompts highlights that prompt engineering is a significant development effort, not a trivial configuration task. The empirical testing approach used to refine templates mirrors traditional software development practices. **Quality Evaluation Frameworks:** The study's approach to evaluating LLM outputs through blind comparison with human-produced artifacts provides a template for assessing LLM-generated content quality in production contexts. **Decomposition Strategy:** The finding that complex inputs degrade performance suggests a general principle for LLMOps: decomposing complex tasks into smaller, focused operations may yield better results than attempting comprehensive single-shot generation. **Semi-Automation Reality:** Despite automation goals, the study reveals that effective LLM-based test generation remains semi-automated, requiring skilled human input for context provision and potentially for review of outputs. This has implications for workforce planning and process design in organizations adopting these technologies. ## Future Directions The researchers identified additional studies needed around time, cost, and learning curve comparisons between LLM-assisted and manual test case creation. These operational metrics would be essential for organizations considering production deployment of similar systems.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source