A case study exploring the application of LLMs (specifically GPT-3.5 Turbo) in automated test case generation for software applications. The research developed a semi-automated approach using prompt engineering and LangChain to generate test cases from software specifications. The study evaluated the quality of AI-generated test cases against manually written ones for the Da.tes platform, finding comparable quality metrics between AI and human-generated tests, with AI tests scoring slightly higher (4.31 vs 4.18) across correctness, consistency, and completeness factors.
This academic case study from Cesar explores the practical application of Large Language Models (LLMs) for automating test case construction in software engineering. The research team conducted a systematic investigation into using OpenAI’s GPT-3.5 Turbo model combined with the LangChain framework to generate test cases for Da.tes, a real-world production web platform that connects startups, investors, and businesses. The study provides valuable insights into both the potential and limitations of using LLMs for software testing automation, with a focus on practical implementation details and empirical evaluation.
Da.tes is a production web platform designed to create opportunities connecting startups, investors, and businesses. It features diverse functionalities including recommendation systems, matching algorithms for user profiles, and a data-driven ecosystem for business decision-making. The researchers deliberately chose this real-world application to evaluate LLM-based test case generation across various components and functionalities, providing a more realistic assessment than hypothetical scenarios would offer.
The implementation leverages the OpenAI API with GPT-3.5 Turbo as the underlying language model, orchestrated through the LangChain Python framework. LangChain was specifically chosen for its capabilities in managing prompts, inserting context, handling memory, and facilitating interactive multi-step workflows. This architectural decision reflects a common pattern in LLMOps where orchestration frameworks abstract away complexity in managing LLM interactions.
The researchers developed an interactive, multi-step approach to test case generation that breaks down the complex task into intermediate stages. This design follows the chain-of-thought prompting principle, allowing the model to engage in reasoning by decomposing tasks into manageable steps. The workflow consists of three main stages:
Stage 1 - Requirements Generation: The first prompt instructs the model to act as a QA Engineer working on a project. It receives the application description template and generates a requirements document. The output is structured to ensure it can be consumed by subsequent prompts.
Stage 2 - Test Conditions Generation: The second prompt takes the output from stage one along with the original template. It generates test conditions in JSON format, grouping requirements by functional or non-functional categories. This intermediate step helps the model “think” and analyze before producing final test cases.
Stage 3 - Test Cases Generation: The final stage combines outputs from both previous stages to generate structured test cases containing title, preconditions, steps, expected results, test data, and test classification. The output is formatted in markdown for easy conversion to PDF documentation.
Each prompt in the pipeline employs role prompting, a technique where the model is given a specific persona (in this case, a QA Engineer) to influence how it interprets inputs and generates responses. This approach provides context that shapes the model’s technical perspective and improves response quality for domain-specific tasks.
A key finding from the research was the importance of structured input for preventing hallucinations and ensuring relevant test case generation. The team developed a template through empirical experimentation that captures essential application information:
Critically, the researchers discovered that the template must be completed separately for each feature in the software. Attempting to describe all features in a single template produced worse results, as GPT-3.5 Turbo struggled to handle multiple software features in one context. This finding has important implications for production deployments, suggesting that complex applications require decomposition into feature-level inputs.
The evaluation framework was designed around documentation quality factors identified in software testing literature. Given constraints on accessing code coverage metrics, execution history, and defect data for the Da.tes platform, the researchers focused on three key documentation quality sub-factors: Correctness, Completeness, and Consistency. These were selected based on literature identifying them as the most critical quality factors for test cases.
The study employed a blind evaluation approach where 10 QA engineer volunteers assessed test cases without knowing whether they were AI-generated or human-written:
Form 1 - Quality Factor Scoring: Evaluators rated 10 AI-generated and 10 human-written test cases on a 1-5 scale (Very Poor to Very Good) across the three quality factors.
Form 2 - A/B Comparison: Similar test cases (testing the same software paths) were placed side by side, with evaluators choosing their preferred option and providing justification for their selection.
While only 7 of 10 expected responses were collected, the researchers noted consistency across evaluator responses, suggesting the sample still provided meaningful insights.
The evaluation results showed AI-generated test cases achieving comparable or slightly better scores than human-written test cases:
In the A/B comparison, 58.6% of responses (41 of 70) preferred AI-generated test cases. Interestingly, two specific test cases showed unanimous preference for human-written versions, while one showed unanimous preference for AI-generated content, suggesting context-dependent performance variations.
Qualitative feedback highlighted that AI-generated test cases excelled in writing quality, simplicity, directiveness, clarity, and completeness. These characteristics aligned with a significant percentage of participant preferences.
The study identified several important limitations relevant to production deployments:
Feature Isolation Requirement: The model could not effectively handle multiple software features described in a single template. This necessitates feature-by-feature processing, which increases operational complexity.
Integration Testing Challenges: Cross-feature dependencies proved problematic. Integration tests require the QA engineer to explicitly describe feature interdependencies, increasing template size. However, larger templates correlated with degraded results, creating a difficult trade-off.
Human Input Dependency: The process is not fully automated and relies on human input to provide correctly formatted feature descriptions. The quality of output depends heavily on the quality of input templates.
Common Feature Performance: While the model performed well with common features like login or signup, it struggled with more complex or unusual functionality, suggesting potential issues with less common software patterns.
This case study provides several insights relevant to LLMOps practitioners:
Orchestration Patterns: The use of LangChain for managing multi-step prompts, context injection, and memory handling demonstrates a common orchestration pattern for complex LLM workflows. The sequential prompt chaining with intermediate outputs feeding subsequent stages is a reusable architectural pattern.
Prompt Engineering as Development: The extensive iteration required to develop effective templates and prompts highlights that prompt engineering is a significant development effort, not a trivial configuration task. The empirical testing approach used to refine templates mirrors traditional software development practices.
Quality Evaluation Frameworks: The study’s approach to evaluating LLM outputs through blind comparison with human-produced artifacts provides a template for assessing LLM-generated content quality in production contexts.
Decomposition Strategy: The finding that complex inputs degrade performance suggests a general principle for LLMOps: decomposing complex tasks into smaller, focused operations may yield better results than attempting comprehensive single-shot generation.
Semi-Automation Reality: Despite automation goals, the study reveals that effective LLM-based test generation remains semi-automated, requiring skilled human input for context provision and potentially for review of outputs. This has implications for workforce planning and process design in organizations adopting these technologies.
The researchers identified additional studies needed around time, cost, and learning curve comparisons between LLM-assisted and manual test case creation. These operational metrics would be essential for organizations considering production deployment of similar systems.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
Agoda transformed from GenAI experiments to company-wide adoption through a strategic approach that began with a 2023 hackathon, grew into a grassroots culture of exploration, and was supported by robust infrastructure including a centralized GenAI proxy and internal chat platform. Starting with over 200 developers prototyping 40+ ideas, the initiative evolved into 200+ applications serving both internal productivity (73% employee adoption, 45% of tech support tickets automated) and customer-facing features, demonstrating how systematic enablement and community-driven innovation can scale GenAI across an entire organization.