Carnegie Mellon: Usability Challenges in Commercial AI Agent Systems: A Study of Industry Aspirations vs. User Realities

Company

Carnegie Mellon

Title

Usability Challenges in Commercial AI Agent Systems: A Study of Industry Aspirations vs. User Realities

Industry

Research & Academia

Link

https://arxiv.org/html/2509.14528v1

Year

2025

Summary (short)

This research study addresses the gap between how AI agents are marketed by the technology industry and how end-users actually experience them in practice. Researchers from Carnegie Mellon conducted a systematic review of 102 commercial AI agent products to understand industry positioning, identifying three core use case categories: orchestration (automating GUI tasks), creation (generating structured documents), and insight (providing analysis and recommendations). They then conducted a usability study with 31 participants attempting representative tasks using popular commercial agents (Operator and Manus), revealing five critical usability barriers: misalignment between agent capabilities and user mental models, premature trust assumptions, inflexible collaboration styles, overwhelming communication overhead, and lack of meta-cognitive abilities. While users generally succeeded at assigned tasks and were impressed with the technology, these barriers significantly impacted the user experience and highlighted the disconnect between marketed capabilities and practical usability.

## Overview This Carnegie Mellon research study provides a critical examination of the emerging "AI agent" ecosystem from both industry positioning and end-user experience perspectives. The study is particularly relevant to LLMOps because it investigates how LLM-based agents are being deployed in production environments for real end-users, revealing significant gaps between marketing claims and practical usability. The research team conducted both a comprehensive market analysis and empirical user testing to understand the challenges of deploying LLM agents at scale for commercial use. The researchers framed their investigation around two key questions: how the technology industry conceives of and markets AI agents, and what challenges end-users face when attempting to use these commercial products for their advertised purposes. This dual approach provides valuable insights into the disconnect between aspirational technical capabilities and actual user experiences—a critical consideration for any organization looking to deploy LLM-based systems in production. ## Market Analysis and Taxonomy Development The researchers began by conducting a systematic review of 102 commercial AI agent products and services to understand how the industry positions these tools. They sourced products from aggregator websites like the AI Agents Directory and Product Hunt, supplemented by targeted Google searches. The goal was not exhaustive coverage but rather capturing a representative sample broad enough to identify patterns in how AI agents are marketed to end-users. Through open coding and inductive analysis, the team developed a taxonomy categorizing AI agents into three umbrella capabilities. **Orchestration agents** focus on automating GUI interactions and performing direct manipulation actions on behalf of users, using vision-language models to read interface states and controllers to simulate human input. Examples include Salesforce's Agentforce and Expedia's Trip Matching AI Agent. These agents promise to reduce manual labor in tasks like data entry and form completion. **Creation agents** specialize in generating structured documents with well-defined formats, such as websites, slide decks, emails, and marketing materials. Tools like Lovable (for app/website creation) and Gamma (for slide deck generation) fall into this category. Finally, **Insight agents** transform unstructured information queries into digestible insights, performing web searches, querying APIs, and accessing knowledge bases through iterative prompt-response loops. Examples include Perplexity's Deep Research and Spotify's AI DJ. This taxonomy reveals how the industry is positioning LLM-based agents across different dimensions of knowledge work, from procedural task automation to cognitive assistance to creative production. The categorization itself has implications for LLMOps, as different agent types require different technical architectures, testing strategies, and deployment considerations. ## User Study Design and Methodology To assess real-world usability, the researchers conducted a user study with 31 participants who attempted representative tasks from each taxonomy category using two popular commercial agents: OpenAI's Operator and Manus. The study employed think-aloud protocols combined with semi-structured exit interviews, providing both observational data on user behavior and self-reported reflections on the experience. Participants were assigned tasks randomly selected from the three categories, allowing the researchers to evaluate agent performance across the full spectrum of marketed use cases. The tasks were designed to be relatively simple and achievable—reflecting the types of activities that appear in marketing materials and demo videos. This methodological choice is important: the study wasn't testing edge cases or adversarial scenarios, but rather the baseline usability of agents for their intended purposes. ## Critical Usability Barriers Identified The empirical evaluation revealed five major usability challenges that emerged consistently across participants and tasks, despite general task completion success. These barriers have significant implications for deploying LLM agents in production settings. **First, agent capabilities were frequently misaligned with user mental models.** Users approached the agents with expectations shaped by their experiences with other software tools, search engines, and human assistants. When agent behavior diverged from these mental models—for instance, taking unexpected actions or following reasoning paths that weren't intuitive—users became confused and lost confidence. This misalignment represents a fundamental challenge in LLMOps: the probabilistic, emergent nature of LLM behavior doesn't always map cleanly onto user expectations formed by deterministic software systems. **Second, agents presumed trust without establishing credibility.** The agents frequently made decisions or took actions without adequately building user confidence first. This mirrors findings from the broader human-AI collaboration literature showing that trust calibration is critical for effective teamwork. From an LLMOps perspective, this suggests the need for deliberate design patterns around transparency, explanability, and gradual trust-building—particularly during initial user interactions when mental models are still forming. **Third, agents failed to accommodate diverse collaboration styles.** Users had varying preferences for how much autonomy to delegate versus how much control to maintain, but the agents offered relatively rigid interaction patterns. Some users wanted to provide high-level goals and let the agent determine implementation details, while others preferred step-by-step oversight. The inability to adapt to these different working styles reduced effectiveness, suggesting that production LLM systems need configurable interaction modes rather than one-size-fits-all approaches. **Fourth, agents generated overwhelming communication overhead.** While transparency is generally valuable, the agents often provided excessive detail about their reasoning, intermediate steps, and decision processes. This created cognitive burden for users who had to parse through lengthy explanations to extract relevant information. The challenge here relates to a fundamental tension in LLMOps: how to balance the explainability needed for trust and debugging with the information density that enables efficient user workflows. **Fifth, agents lacked meta-cognitive abilities necessary for productive collaboration.** Effective human collaborators understand their own capabilities and limitations, can recognize when they need help, and can communicate uncertainty appropriately. The tested agents often struggled with this meta-awareness, sometimes confidently proceeding when they should have asked for clarification, or failing to recognize when they were outside their competence zone. This represents a significant gap in current LLM agent architectures and has important implications for reliability in production deployments. ## Technical Architecture Considerations While the paper focuses primarily on user experience rather than technical implementation details, several architectural considerations emerge from the findings. The orchestration agents mentioned in the taxonomy employ a perception-reasoning-action loop: using vision-language models to perceive the GUI state, LLMs to generate appropriate commands based on user objectives, and controllers to execute those commands through simulated human input. This architecture enables agents to interact with existing software interfaces without requiring API access or custom integrations. The insight agents described in the study use more complex workflows involving multiple tool calls, stateful multi-turn interactions, web searches, API queries, and access to custom knowledge bases in addition to pre-trained knowledge. This ensemble approach represents a common pattern in LLM agent architectures, where the language model serves as a reasoning engine that orchestrates various specialized tools and data sources. The creation agents leverage document generation capabilities combined with formatting tools, using word processors, spreadsheet applications, data visualization platforms, and graphic design tools to produce structured outputs. This suggests a layered architecture where the LLM handles content generation while specialized tools handle presentation and formatting. ## Evaluation and Testing Implications The study's methodology—combining market analysis with empirical user testing—offers a model for evaluating LLM agents that goes beyond pure capability benchmarks. The researchers explicitly critique the industry focus on narrow technical metrics (like scoring 26% versus 24% on academic benchmarks) when users care about practical task completion for real-world activities like booking flights. This perspective aligns with recent critiques in the evaluation literature about the validity gaps in AI assessments that focus narrowly on technical competence while ignoring human-centered dimensions like usability, interpretability, trust, and satisfaction. For LLMOps practitioners, this suggests the importance of incorporating user experience testing into deployment pipelines alongside traditional model performance metrics. The think-aloud protocol employed in the study provides a practical methodology for uncovering usability issues that might not be apparent from success/failure metrics alone. Even when tasks were completed successfully, the protocol revealed significant friction points and cognitive burden that would degrade long-term adoption and satisfaction. ## Connections to Human-AI Collaboration Literature The researchers situate their findings within the broader context of human-AI collaboration research, drawing connections to work on mental models, trust calibration, explanation quality, and complementary expertise. Several specific insights from this literature informed their analysis: Research on mental models shows that users judge AI teammates differently than human counterparts, often assigning more blame for failures. The way AI agents are described through conceptual metaphors significantly influences adoption—paradoxically, emphasizing higher competence can actually reduce willingness to adopt if it creates unrealistic expectations that lead to disappointment. Work on trust and explanations demonstrates that people's trust in AI is influenced both by stated and observed accuracy, highlighting the importance of calibrating user expectations appropriately. Recent studies show that LLM expressions of uncertainty can reduce overreliance, suggesting that communication style materially affects outcomes in human-AI teams. The explainable AI (XAI) literature provides guidance on how agents should explain their reasoning. Research from social sciences suggests explanations should be contrastive (explaining why this choice rather than alternatives), carefully selected for contextual relevance rather than exhaustive, and incorporate natural conversational dynamics rather than technical jargon. Findings on complementary expertise indicate that human-AI teams perform best when each party contributes distinct strengths, with humans evaluating task expertise based on confidence signals in AI communication. However, updating AI models to improve performance can actually degrade user experience by disrupting established mental models—a phenomenon observed during GPT-5's release and documented in prior work on model updates. ## Design Implications for Production Systems Though the full design recommendations aren't detailed in the provided text excerpt, the identified usability barriers point toward several implications for organizations deploying LLM agents in production: Systems need mechanisms for helping users develop accurate mental models of agent capabilities, perhaps through progressive disclosure, onboarding experiences, or capability demonstrations. The transparency-efficiency tradeoff in agent communication requires careful design, possibly through configurable verbosity levels or summarization of detailed reasoning. Supporting diverse collaboration styles suggests the need for adjustable autonomy settings that let users control the level of agent initiative versus human oversight. Trust calibration mechanisms should be built into agent interactions, with appropriate expressions of uncertainty and confidence that align with actual capabilities. Meta-cognitive abilities need development, enabling agents to recognize and communicate their limitations, ask for clarification when needed, and escalate appropriately when facing tasks beyond their competence. ## Limitations and Context The researchers acknowledge several important limitations. The "AI agent" category is rapidly evolving with new products launching continuously and significant variance in how agents are defined and distinguished from chatbots, LLM-enhanced products, and virtual assistants. The study makes no claim to comprehensiveness but rather provides a snapshot of the current landscape with a representative sample. The user study focused on relatively simple, achievable tasks that reflect marketing materials—not edge cases or adversarial scenarios. This was a deliberate choice to evaluate baseline usability for intended purposes, but it means the findings may not capture all challenges that arise in more complex or open-ended real-world use. The study examined only two commercial agents (Operator and Manus) out of the 102 reviewed in the taxonomy. While these were selected as popular representatives, different agents might exhibit different usability characteristics. The 31-participant sample, while substantial for a qualitative study, may not capture all the diversity of user needs, backgrounds, and expectations. ## Historical Context and Evolution The paper situates current AI agents within a long history of interface agent research stretching back to Vannevar Bush's 1945 "Memex" concept and Licklider's 1960s vision of "Man-Machine Symbiosis." This historical framing is valuable for understanding both the longstanding aspirations for intelligent assistants and the recurring challenges that have emerged across different technological eras. The 1990s saw conceptual work on interface agents as alternatives to direct manipulation, though capabilities were limited. The 2000s brought agent-oriented programming frameworks like GOAL and JADE, enabling multi-agent systems for tasks like e-commerce and email management, though systems like Microsoft Clippy highlighted the risks of overzealous automation. The 2010s represented the era of chatbots and voice assistants like Alexa and Siri, leveraging advances in speech recognition and NLP for intent detection and entity extraction. The current 2020s wave of LLM-based agents builds on transformer architectures and large-scale language models, enabling GUI agents that can generate and execute commands to accomplish goal-directed actions in web browsers. This evolution demonstrates both increasing technical sophistication and persistent usability challenges that transcend specific implementations. ## Relevance for LLMOps Practice This research provides several key takeaways for LLMOps practitioners. First, there's a significant gap between marketed capabilities and practical usability that needs addressing through better design, testing, and expectation setting. Organizations deploying LLM agents should invest in user research and usability testing rather than relying solely on capability demonstrations. Second, the five identified usability barriers represent concrete failure modes to test for in production systems. Mental model alignment, trust calibration, collaboration flexibility, communication efficiency, and meta-cognitive awareness should all be evaluation criteria alongside traditional performance metrics. Third, different agent types (orchestration, creation, insight) may require different LLMOps approaches. Orchestration agents interacting with GUIs need robust error handling and state management. Creation agents need strong formatting and presentation capabilities. Insight agents need effective information synthesis and sourcing transparency. Finally, the historical context reminds us that intelligent agents have repeatedly cycled through hype and disappointment. Success in deploying LLM agents requires learning from past failures like Clippy—particularly the dangers of ceding too much initiative, the costs of opaque automation, and the challenge of aligning agent actions with human goals. The most viable path forward likely involves careful attention to human-centered design principles alongside technical capability development.

Start deploying reproducible AI workflows today