Cisco's Customer Experience organization, which manages over $26 billion in recurring revenue, evolved their renewals system from a basic chatbot to a sophisticated multi-agent agentic system operating as a "teammate." The initial system achieved 95% accuracy but plateaued in adoption because it remained an optional tool rather than being embedded in daily workflows. The solution involved implementing a hierarchical planner-based architecture using LangGraph with dynamic task decomposition, self-correction, and confidence-based human-in-the-loop interventions. They also introduced deterministic workflow scheduling for business-critical tasks and personalization features. The key insight was shifting from AI-augmented workflows to AI-native workflows where humans help software get work done rather than the reverse, focusing on outcomes rather than process steps.
Cisco’s Customer Experience organization manages an enterprise-scale renewals operation that handles more than half of Cisco’s $60 billion annual business, specifically over $26 billion in recurring revenue. Carlos, a fellow and chief architect in the CX organization, presented their evolution from experimental chatbot to production-scale agentic system deployed across their 18,000-20,000 person organization. This case study is particularly notable for its focus on the practical realities of moving from proof-of-concept to production at massive scale, including the difficult lessons learned about adoption, architecture, and the limitations of accuracy alone.
The CX organization operates across a standard customer lifecycle: land, adopt, expand, and renew. The renewals team specifically manages the critical end-of-term renewal cycle for customers across hardware, software, and SaaS offerings. The technical challenge was building a system that could handle complex multi-faceted queries at scale while maintaining accuracy, personalization, and most critically, actual business adoption.
In 2025, Cisco deployed what they called the “Agentic Foundation” - a multi-agent system built on a chatbot interface that achieved 95% accuracy. The initial architecture consisted of several specialized agents:
The decision to keep risk prediction separate from the LLM components is notable and reflects a pragmatic understanding of tool limitations. The team recognized that LLMs and predictions don’t align well, so they built a traditional ML model for high-stakes risk scoring. This hybrid approach combining LLMs with traditional ML achieved 95% accuracy.
The initial system started with a guided interface where the renewals team predefined the questions users could ask, with GenAI only generating answers. This lasted approximately three months before users demanded open prompt capabilities, wanting freedom from interface constraints. The team accommodated this evolution, but a critical problem emerged: despite high accuracy and concurrent users generating many queries monthly, adoption plateaued. Users would try the system a handful of times and then ghost it, reverting to spreadsheets and familiar tools.
This adoption failure despite technical success represents a crucial lesson in LLMOps: accuracy is merely table stakes, not a sufficient condition for production success. The team realized the system was yet another optional tool rather than being embedded in the workflow and daily tasks of the renewals team. Human nature and organizational inertia meant users defaulted to their comfort zones despite having access to a more accurate system.
To address the adoption challenge, Cisco evolved from a chatbot paradigm to what they call a “teammate” - a system capable of receiving delegated tasks, proactively executing work, and focusing on workflows rather than just question-answering. This architectural evolution reflected a philosophical shift: instead of asking how AI can improve existing workflows, they asked whether those workflows were even needed and how they would solve outcomes differently if starting fresh.
The new architecture introduced several critical components:
Hierarchical Planner-Based System: Rather than having a supervisor directly route to agents, they introduced planning layers. The supervisor generates high-level plans mapping tasks to agents or subgraphs, but remains lean - focused only on metadata, policy, and routing rather than becoming a monolithic controller. The team explicitly noted that overloading the supervisor with intelligence and extensive prompting leads to maintenance nightmares and brittle systems.
Domain-Specific Planners: Within each major agent domain (renewals, sentiment, adoption), they introduced specialized planners that decompose questions into smaller tasks specific to that domain context. For example, the renewals planner handles Cisco-specific business logic like fiscal year definitions that no general LLM would know. When a user asks about Q1 fiscal year 2026, the renewals planner knows which months comprise Cisco’s Q1, something impossible to encode in a general model.
Dynamic Stage and Step Generation: The planner creates stages dynamically, where each stage can have multiple steps. After each stage completes, the system checks if everything is okay before proceeding. If not, it triggers a replanning cycle. This continues until completion or a threshold is reached. The architecture supports both sequential and parallel execution depending on dependencies between stages.
Reflection and Replanning Loops: At both the supervisor level and within individual agent subgraphs, the system implements reflection and replanning mechanisms. This allows for self-correction without human intervention for many failure modes. The system can detect when a plan hasn’t worked and dynamically adjust rather than failing outright.
The power of this architecture becomes evident in handling complex real-world queries. For example, a typical renewals person might ask: “Give me my top 10 customers by ATR for Q1 fiscal year 26 that have Duo, along with the sentiment of each.” This single question requires:
The supervisor dynamically generates a plan with Stage 1 for renewals and Stage 2 for sentiment, understanding the dependency. It uses LangGraph’s send execution patterns to manage parallel and sequential operations as appropriate. The renewals subgraph receives the query, and its internal planner breaks it down further into subtasks like fiscal calendar mapping, product filtering, and metric calculation.
This architectural approach avoids the hallucination and error problems that emerge with simpler architectures when handling complex queries at scale. By decomposing both at the supervisor level and within domain subgraphs, the system maintains coherence and accuracy even for questions with significant complexity and context requirements.
A critical insight from Cisco’s production experience concerns when to use LLM reasoning versus deterministic execution. The team discovered that giving planners few-shot examples and instructions to follow specific steps led to inconsistent behavior - the LLM would follow instructions the first time, get creative and modify the approach the second time, reduce steps the third time, expand them the fourth time, essentially entering what Carlos humorously described as “a debate and an argument with LLM.”
Their solution was to adopt a backslash-command pattern similar to coding agent tools like Claude Code, Codex, and Cursor. For deterministic, business-critical workflows, they created scheduled tasks with specific execution patterns that bypass LLM reasoning entirely. Examples include:
The key insight is that for well-defined business processes that must execute consistently, deterministic programming is superior to LLM reasoning. The LLM’s role is reserved for areas requiring interpretation, analysis, and dynamic planning - not for following established procedures. This hybrid approach uses each technology where it’s strongest: LLMs for reasoning and adaptation, deterministic code for consistency and reliability.
The production system runs on LangGraph as its core orchestration framework, with extensive use of LangSmith for observability. The team emphasized that observability is “a no-brainer” for production systems and that the complexity of their hierarchical planner architecture makes tracing and debugging essential.
The system supports multiple interfaces beyond the chatbot:
The architecture runs in on-premises environments for some components, reflecting enterprise security and data governance requirements. They use multiple LLM models, though specific model choices weren’t detailed in the presentation. The emphasis was on architecture over model selection, with explicit warnings against “chasing benchmarks” or assuming newer, smarter models will solve architectural problems.
A central concept in their AI-native workflow approach is confidence scoring to determine when human judgment is required. Rather than having humans orchestrate every step (the traditional workflow pattern), the default is AI-driven execution with autonomy. The system only pauses for human judgment when confidence scores drop below thresholds, effectively inverting the traditional human-software relationship.
This represents a philosophical shift: humans help software get work done, rather than software helping humans get work done. The focus moves from step-based workflow logic to context that enables personalization. Authority is given by default to the AI for execution, with humans jumping in for judgment calls rather than orchestrating every action.
This approach requires sophisticated confidence scoring mechanisms that the presentation didn’t detail extensively, but the concept reflects a mature understanding of AI system design for production environments where complete automation isn’t possible or desirable, but constant human oversight isn’t scalable.
The evolution from optional tool to teammate required personalization features that make the system adapt to individual users rather than forcing users to adapt to the system. Cisco implemented long-term memory capabilities, leveraging LangChain contributions in this area, to enable the system to learn user preferences, communication styles, and work patterns over time.
This personalization layer addresses the adoption problem identified earlier. By making the system personal and contextually aware of individual user needs, it becomes less of an external tool and more of an integrated work partner. The presentation emphasized that making systems personal - adapting to users rather than forcing user adaptation - is critical for actual adoption in enterprise environments.
Cisco shared several hard-won lessons from their production deployment at scale:
Accuracy is Table Stakes: Achieving 95% accuracy was necessary but not sufficient. It created a floor for viability but didn’t drive adoption by itself. Personalization and workflow integration proved equally critical.
Move Past Chat: Chatbots are just one surface for interaction. Real production value came from diversifying interaction patterns - APIs, scheduled tasks, email delegation, proactive intelligence - rather than limiting the system to conversational interfaces.
Workflow Integration is Critical: They introduced a concept called “forced curiosity” with management to drive adoption, but the real solution was embedding the AI into existing workflows and consolidating tools rather than adding yet another optional system to the stack.
Proactive Intelligence: Making the system proactive rather than purely reactive - scheduling tasks, sending briefings, identifying issues before they’re queried - significantly improved value perception and adoption.
Beyond system design, Cisco shared infrastructure and automation insights:
Start Bounded, Then Extend: Let agents help before they act. Test reasoning quality in constrained contexts before allowing autonomous action at scale. The team discovered that reasoning quality varies significantly by task type, leading to their hybrid deterministic/LLM approach.
Self-Correction is Not Optional: The difference between pilot and production is implementing self-correction mechanisms where agents fix their own output. At scale, human intervention for every error is impossible.
Quality Ceiling Requires Upfront Investment: Including knowledge investment for models themselves. The assumption that newer, smarter models will automatically solve domain-specific problems is a fallacy. Models are helpers, not magic solutions.
Routing is Your First Decision: Before building agents, decide how classification and routing will work. This is hard and critical - without proper routing, models hallucinate because they don’t understand business context. Routing quality determines whether subsequent agent work is even relevant.
Don’t Chase Benchmarks: Model benchmarks don’t translate directly to business value. Focus on architecture, integration, and solving actual business problems rather than deploying the highest-benchmark model.
Carlos candidly acknowledged organizational resistance to AI-native workflow redesign. The people being asked to change are often the same people who built the existing workflows years ago and who currently run the business. Expecting resistance is realistic, and managing that change is part of the LLMOps challenge beyond pure technical implementation.
The shift from workflows with “human DNA by human orchestration by DNA” built over many years to AI-native workflows with human-in-the-loop only for judgment requires not just technical transformation but cultural and organizational change. This honest assessment of non-technical barriers distinguishes this case study from purely technical presentations.
This case study represents a mature, production-focused perspective on enterprise LLMOps that acknowledges both successes and failures candidly. Several aspects merit critical consideration:
The 95% Accuracy Plateau: The admission that high accuracy alone didn’t drive adoption is valuable and often overlooked in vendor presentations. However, the case study doesn’t provide quantitative metrics on adoption rates before and after the teammate evolution, leaving questions about the magnitude of improvement.
Complexity vs. Maintainability: The hierarchical planner architecture with subgraphs containing their own planners is sophisticated but potentially complex to maintain. The presentation advocates for lean supervisors but doesn’t detail how they manage the overall system complexity or what observability tooling they’ve built around LangSmith.
Deterministic Workflows: The solution of bypassing LLM reasoning for scheduled tasks and predefined workflows is pragmatic but somewhat contradicts the AI-native workflow vision. It suggests limits to what LLMs can reliably execute, which is honest but indicates the technology hasn’t fully delivered on autonomous workflow execution promises.
Missing Metrics: The presentation lacks quantitative business impact metrics - renewal rate improvements, time savings, cost reductions, or specific adoption figures. Without these, evaluating true production success is difficult.
Model Choices: The deliberate avoidance of discussing specific models and focus on architecture is both a strength (avoiding model-chasing) and a limitation (not understanding what capabilities were actually needed at different system layers).
Human-in-the-Loop Balance: While confidence scoring for human intervention is mentioned, the specifics of how thresholds are set, what percentage of tasks require human judgment, and how false positive/negative rates are managed remain unclear.
The case study is particularly valuable for its honesty about failures and its focus on production realities over demo success. The evolution from chatbot to teammate, the recognition that accuracy alone doesn’t drive adoption, and the hybrid approach combining LLM reasoning with deterministic execution represent mature thinking about enterprise AI deployment. However, readers should recognize this as a work in progress rather than a fully solved problem, with ongoing challenges around complexity management, organizational change, and measuring true business impact.
Notion, a knowledge work platform serving enterprise customers, spent multiple years (2022-2026) iterating through four to five complete rebuilds of their agent infrastructure before shipping Custom Agents to production. The core problem was enabling users to automate complex workflows across their workspaces while maintaining enterprise-grade reliability, security, and cost efficiency. Their solution involved building a sophisticated agent harness with progressive tool disclosure, SQL-like database abstractions, markdown-based interfaces optimized for LLM consumption, and a comprehensive evaluation framework. The result was a production system handling over 100 tools, serving majority-agent traffic for search, and enabling workflows like automated bug triaging, email processing, and meeting notes capture that fundamentally changed how their company and customers operate.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
ListenLabs, a platform for analyzing user research at scale, built a sophisticated multi-agent system that processes hundreds to thousands of user interviews, surveys, and focus group feedback. The company evolved from basic retrieval-augmented generation to a complex architecture featuring three primary agents: a study creation agent (Composer) that collaboratively builds discussion guides with users through an artifact-based interface, an interview agent that conducts voice-based multimodal conversations with participants, and a research agent that analyzes large volumes of qualitative data to generate insights, charts, video clips, and PowerPoint presentations. Their system demonstrates advanced LLMOps practices including parallelized sub-agent execution for processing hundreds of interviews simultaneously, custom evaluation agents for quality control, contextual prompt engineering, code execution in sandboxes, and sophisticated trace analysis for continuous improvement. The platform handles the complete lifecycle from study design through data collection to automated analysis and reporting.