ZenML

LLM-Based Agents for User Story Quality Enhancement in Agile Development

Austrian Post Group 2024
View original source

Austrian Post Group IT explored the use of LLM-based agents to automatically improve user story quality in their agile development teams. They developed and implemented an Autonomous LLM-based Agent System (ALAS) with specialized agent profiles for Product Owner and Requirements Engineer roles. Using GPT-3.5-turbo-16k and GPT-4 models, the system demonstrated significant improvements in user story clarity and comprehensibility, though with some challenges around story length and context alignment. The effectiveness was validated through evaluations by 11 professionals across six agile teams.

Industry

Government

Technologies

Overview

This case study presents research conducted at Austrian Post Group IT, a logistics and postal services company with robust agile software development practices. The company operates multiple teams working synchronously across numerous systems and applications orchestrated within Agile Release Trains. The research team, in collaboration with Tampere University, developed and evaluated an Autonomous LLM-based Agent System (ALAS) designed to automatically enhance the quality of user stories in agile development environments.

User stories are fundamental to agile software projects, serving as brief descriptions of functionalities from the user’s perspective. Maintaining high-quality user stories is crucial but challenging, as they must be complete, consistent, unambiguous, and testable. The research explores whether LLM-based agents can assist in automating and improving this quality assurance process.

The Reference Model Architecture

The researchers proposed a reference model for LLM-based agent systems that conceptualizes how multiple AI agents can collaborate to complete tasks. The model comprises four basic constructs: tasks, agents, a shared knowledge base, and responses.

The task initiates the interaction and contains inputs defining the work scope and objectives. It includes a comprehensive description, contextual information for understanding and execution, and the expected outcome. The task may also prescribe procedural steps for agents to follow.

Each agent represents an instance of an LLM model with a unique profile and role. In this implementation, two agents were defined: Agent PO (Product Owner) focused on product vision, backlog management, and business value alignment, while Agent RE (Requirements Engineer) specialized in quality aspects like unambiguity and measurable acceptance criteria.

The shared knowledge base is a repository containing task information and conversation history. It provides a dynamic resource that informs subsequent agents of context and next steps, enabling coherent dialogue across the multi-agent interaction.

Responses are outputs generated by each agent based on task descriptions, which are then added to the shared knowledge base for subsequent agents to reference.

Implementation Details

The ALAS implementation operates through two distinct phases: task preparation and task conduction.

Task Preparation Phase

This phase establishes the groundwork for agent interaction through careful prompt engineering. The researchers employed several sophisticated prompting techniques:

The persona pattern creates detailed profiles for each agent, guiding them to adopt specific characters or roles. The profile construction includes role definition with high expectations (described as “level 250 knowledge” compared to a human’s “level 10”), key responsibilities, practical tips, and tone specifications. For instance, the RE agent profile instructs: “Use clear and unambiguous language when documenting requirements to avoid any misunderstandings.”

The k-shot prompting technique provides explicit instructions or examples of desired output, particularly useful when formulating task descriptions and context. For generating product vision statements, prompts might include example elements or templates from other products.

AI planning helps decompose the overall task into smaller, manageable subtasks and assigns responsible agents to each. The researchers used this pattern to generate a comprehensive list of key steps, which was then reviewed and refined by a Scrum master and Product Owner.

The fact check list pattern verifies and validates LLM outputs by instructing the model to create key facts that can be checked for errors or inconsistencies.

Chain of Thought (CoT) prompting guides LLMs through step-by-step reasoning processes to enhance performance on complex tasks.

Prompt Structure

The system uses two categories of prompts. Initial prompts prepare participating agents for their responsibilities:

Prompti = Profilei + Task + Context of task + Subtaski (for 1 ≤ i ≤ k agents)

Follow-up prompts guide agents through the interaction steps:

Prompti = Subtaski + Responsei-1 (for i > k)

This structure ensures each agent builds upon previous work while maintaining coherent dialogue flow.

Task Conduction Phase

In this phase, agents dynamically interact to execute subtasks. The process is iterative and incremental, mirroring actual agile team practices. Each agent sequentially engages with subtasks following the structured prompt, with previous responses informing current prompts to ensure relevance and continuity.

Technical Configuration

The researchers faced significant challenges with token limits when agents engaged in response exchanges across subtasks. After adjustments, they selected two GPT models:

The Temperature parameter was set to 1 (medium value), though this still posed challenges for maintaining factual accuracy due to increased risk of AI hallucination.

Experimental Setup

The evaluation focused on 25 synthetic user stories for Austrian Post’s mobile delivery application, a tool helping postal workers prepare and deliver parcels. The task description included two supporting documents:

These documents provided agents with comprehensive background for executing tasks with sufficient understanding of both technical and strategic objectives.

Evaluation Methodology

The study surveyed 11 participants from six agile teams, including Product Owners, developers, a test manager, a Scrum master, requirements analysts, testers, and a train coach. Notably, 10 of 11 participants had been with the company over two years, and 9 had more than five years of agile project experience.

The questionnaire assessed user story quality based on the INVEST framework characteristics:

Participants rated user stories on a 1-5 Likert scale and answered open-ended questions about improvements, concerns, and suggestions.

Results and Findings

The evaluation compared original user stories against improved versions from both GPT-3.5-Turbo and GPT-4 models.

Original User Stories (US1, US2): Criticized for ambiguity, lack of essential details, vague business value descriptions, and inadequate error handling scenarios in acceptance criteria.

GPT-3.5-Turbo Improvements (v.1): Showed significant improvements in clarity and comprehensibility, with enhanced narrative flow and clearer acceptance criteria. However, participants noted overly creative titles and insufficient detail in some acceptance criteria.

GPT-4 Improvements (v.2): Recognized for comprehensive content and clearer business value expression. Acceptance criteria addressed previously ambiguous scenarios like printer connection problems. However, the increased detail led to significantly longer stories, with six participants expressing concerns about excessive length.

Quantitative results showed:

Key LLMOps Insights and Challenges

Context Alignment Gap

Despite improvements, the agents’ ability to learn from context revealed gaps in aligning with project-specific requirements. One developer noted that US1(v.2) included an authentication process that, while relevant, “seems to be out of scope.” This highlights the need for careful prompt preparation and human expert review during task preparation.

Human Oversight Requirements

The study emphasizes that ALAS outputs require manual validation by Product Owners to align with project goals and stakeholder expectations. Automated generation has limitations that necessitate human intervention to preserve practical utility.

Temperature Parameter Trade-offs

The Temperature setting at 1 represented a balance between creativity and accuracy. While it enabled novel content generation, it also increased AI hallucination risks, leading to plausible but potentially inaccurate or irrelevant outputs.

Future Improvements Identified

The researchers suggest incorporating additional specialized agents, such as:

Prompt Engineering Importance

The iterative process of prompt formulation and optimization proved crucial for system effectiveness. Engaging domain experts (Product Owners, Scrum masters) during task preparation helped optimize prompts for desired outputs.

Conclusions for LLMOps Practice

This case study demonstrates both the potential and limitations of deploying LLM-based multi-agent systems in production software development environments. The research provides a foundational framework for AI-assisted requirements engineering while highlighting that successful deployment requires:

The study represents an early-stage exploration rather than a fully productionized system, but offers valuable insights for organizations considering similar LLMOps implementations in software engineering contexts.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

AI Agents in Production: Multi-Enterprise Implementation Strategies

Canva / KPMG / Autodesk / Lightspeed 2026

This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.

customer_support data_cleaning content_moderation +36

Building Production-Ready AI Agent Systems: Multi-Agent Orchestration and LLMOps at Scale

Galileo / Crew AI 2025

This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.

customer_support code_generation document_processing +41