Tech
Kentauros AI
Company
Kentauros AI
Title
Building Production-Grade AI Agents: Overcoming Reasoning and Tool Challenges
Industry
Tech
Year
2023
Summary (short)
Kentauros AI presents their experience building production-grade AI agents, detailing the challenges in developing agents that can perform complex, open-ended tasks in real-world environments. They identify key challenges in agent reasoning (big brain, little brain, and tool brain problems) and propose solutions through reinforcement learning, generalizable algorithms, and scalable data approaches. Their evolution from G2 to G5 agent architectures demonstrates practical solutions to memory management, task-specific reasoning, and skill modularity.
## Overview This case study comes from a presentation by Daniel Jeff of Kentauros AI, discussing the practical challenges of building AI agents capable of performing complex, open-ended tasks through GUI (graphical user interface) navigation. The talk, titled "Illogical Logic: Why Agents Are Stupid and What We Can Do About It," provides a candid and technically grounded assessment of current frontier model limitations when deployed in agentic systems, along with Kentauros AI's evolving approach to addressing these limitations. Daniel Jeff brings an interesting perspective as both a long-time science fiction author (30 years of writing) and an applied AI practitioner. His experience has led him to be skeptical of AGI hype while remaining deeply engaged in pushing the boundaries of what current systems can accomplish. The key insight from this presentation is that actually building production agents quickly reveals the gap between benchmark performance and real-world capability. ## Problem Statement: Why Agents Fail The presentation identifies that true AI agents—systems capable of doing complex, open-ended tasks in the real world over extended time periods without unrecoverable errors—remain extremely difficult to build. This is contrasted with simpler use cases like "talk to your PDF" or basic web scraping, which don't qualify as genuine agentic reasoning systems. Kentauros AI has developed a colloquial taxonomy for the problems they repeatedly encounter: **Big Brain Problems** relate to higher-level reasoning, strategic planning, abstraction, expert understanding, and common sense. These represent failures at the level of understanding what needs to be done and why. For example, Claude's computer use demo was observed getting "bored" and looking at photos instead of completing tasks, or getting stuck on European cookie popups without understanding that scrolling would reveal the submit button. **Little Brain Problems** involve tactical, in-the-moment decision-making—the subclusters of actions that make up larger tasks. When checking out at a store, for instance, there are numerous micro-decisions (picking the shortest line, answering questions, small talk) that require contextual reasoning. **Tool Brain Problems** concern the quality and precision of the interfaces and appendages available to the agent. Even if reasoning is correct, broken tools prevent successful execution—analogous to knowing how to pick up an object but having a broken hand. ## Cascading Errors: The Core Production Challenge A crucial insight for LLMOps is that agent failures compound in ways that single-prompt failures do not. In interactive chat scenarios, an incorrect response can be corrected through re-prompting. In agentic systems, errors cascade through the task chain, creating compounding problems that are far more severe. The presentation provides several concrete examples of this phenomenon: The frontier models frequently revert to outdated API patterns (like the old OpenAI `chat_completion.create` versus the newer `client.chat.completion.create` syntax), causing downstream errors that are difficult to diagnose because they occur many steps into a workflow. In GUI navigation testing, an agent tasked with finding information about Edward the Confessor on Wikipedia successfully navigated to the correct page but then inexplicably decided to click the language button, spent two minutes trying to switch to English (though already viewing the English version), before finally returning the answer that was visible in the first paragraph all along. ## Lack of Common Sense and Abstract Reasoning The talk draws on insights from "A Brief History of Intelligence" to highlight fundamental gaps in current models. The example given: if told "I threw a baseball 100 feet above my head, I reached up to catch it and jumped," GPT will typically describe successfully catching the ball—failing to recognize the physical impossibility. The model lacks the common sense world model that humans develop through embodied experience. Abstract pattern generalization is another gap. Humans who get cut by a knife once immediately abstract concepts of "pain" and "sharp" that transfer to jagged rocks, fence spikes, pikes, and swords without ever seeing them. Multimodal models require extensive visual training on each category of sharp object without naturally generalizing the underlying concept. Similarly, a fish trained to touch its nose to a picture of a frog for food will recognize that frog from any angle—demonstrating a built-in 3D world model that current AI lacks. Multimodal diffusion models need training images from many angles to recognize objects from novel perspectives. ## Technical Architecture: The Five-Part Brain Model For their G3 (third generation) agent, Kentauros AI developed a five-component cognitive architecture: - **The Strategist** handles big-brain reasoning—long-term planning and high-level task understanding - **The Tactician** manages little-brain reasoning—determining the specific steps needed - **The Critic** provides ongoing evaluation, detecting when the agent is stuck in loops or heading in wrong directions - **The Translator** converts semantic actions ("click the blue submit button") into precise coordinates through various tool-brain interfaces - **The Arm** executes the physical actions While this can be conceptualized as a multi-agent system, the team prefers to think of it as parts of a unified brain. In practice, this architecture involves multiple models: two OCR models, a 7B Molmo model for clicking precision, and larger frontier models (Claude, GPT-4, Molmo 72B) for higher-level reasoning. ## Tool Brain Improvements: The Soda Click Model A significant technical contribution involves improving click accuracy for GUI navigation. The team found that frontier models struggled with returning exact coordinates for UI elements. Their solution involved: Initial experiments with PaLiGemma (Pema) achieved approximately 63% click accuracy. They released the Wave UI dataset on Hugging Face along with the original fine-tuned model. Testing revealed that the Molmo 7B model achieves approximately 83% accuracy, while the Molmo 72B version (originally trained on robotics with 2 million pairs) achieves approximately 90% accuracy despite never being specifically trained on GUIs. The team encountered fascinating edge cases requiring custom solutions. Calendar interfaces proved particularly challenging—when zooming in on dates, OCR engines would miss random numbers (e.g., returning coordinates for some dates but mysteriously omitting 2, 7, and 9). The solution required mathematically reconstructing grid positions based on partial OCR results to determine where clicks should land. ## Memory Systems: Beyond Simple RAG The presentation argues forcefully that storing data in a RAG database is "only the most basic level of memory and it's not even most important." The retrieval system—how memories are accessed and utilized—is where real agent capability emerges. Kentauros AI built a system supporting multiple memory operations: - Synthetic memory generation for training - Annotation of memories as successes or failures - Human annotation to correct specific behaviors (e.g., "stop clicking the language button, just read the screen") - AI-driven annotation of sequences - Retrieval of similar past experiences to inform current actions In practice, when an agent encounters a situation, the system retrieves five successful and five failed memories of similar scenarios. This can reduce task completion from 25 steps down to 8 in favorable cases, though the approach isn't universally effective when problems differ significantly from stored memories. ## The Roadmap: G3 Through G5 The team has outlined a clear progression: **G3 (current)** focuses on the five-part brain architecture with improved precision tooling and memory integration. This version is described as nearly complete. **G4** will introduce shared memory across agent swarms (when one agent learns, all benefit) and task-specific reasoning, with a fine-tuning pipeline leveraging workflow data to improve the "big brain." **G5** targets hot-swappable skills—the ability to load specialized capabilities (e.g., Amazon navigation, Salesforce workflows, specific database interfaces) as needed. This includes real-time skill swapping based on entropy during token generation. **Reinforcement Learning (RL Strong)** represents a more distant goal where agents are "unleashed" to learn on cloned websites. Drawing from the LLM4Teach paper, this approach uses the LLM to set initial reward policies, allowing agents to quickly learn and eventually exceed the limitations of the LLM reward model through exploration and exploitation. ## Practical Wisdom for LLMOps Teams Several observations are particularly relevant for production LLM deployments: The "bitter lesson" is frequently misunderstood as simply "throw more compute at it." The actual insight is that expert knowledge built into systems eventually gets superseded by generalized learning algorithms. However, for practical engineering, expert system components may be necessary stepping stones toward generalization—the key is being "willing to let it go" when generalized solutions emerge. For small teams with limited resources, Daniel emphasizes that necessity drives innovation. The constraint of not having billions of dollars forces creative solutions that well-funded labs sometimes miss. Scientific iteration—trying things daily, including approaches that will fail—is essential. The "magical fix" skepticism is noteworthy: whenever Twitter announces something as a complete solution with responses like "we're all cooked, AGI tomorrow," discount that by approximately 75%. Multi-agent approaches and chain-of-thought reasoning help but are not panaceas. The team observed agents sometimes "thinking themselves out of the answer" with extended reasoning—five steps in having the correct solution, then ten more steps of reasoning leading to a wrong conclusion. ## Open Source Contributions Kentauros AI has made several contributions to the open source community, including: - The Wave UI dataset on Hugging Face - The Robbie G2 agent (described as "pretty good" but not production-ready) - Plans to release their fine-tuned Molmo-based click model - Documentation of techniques and algorithms for improving agent click accuracy ## Assessment This presentation offers a refreshingly candid view of production agent challenges from a team actively building these systems. The taxonomy of big brain/little brain/tool brain problems provides useful mental models for debugging agent failures. The emphasis on cascading errors as the fundamental difference between chat and agent deployments is particularly valuable for teams considering agent development. The technical approaches described—multi-component architectures, specialized fine-tuned models for specific subtasks, sophisticated memory systems with retrieval optimization, and progressive capability development—represent practical engineering patterns rather than theoretical solutions. The acknowledgment that many components are still imperfect, combined with clear roadmaps for improvement, provides realistic expectations for what production agents can currently achieve. The speaker's skepticism of AGI timelines, grounded in daily experience with frontier model limitations, offers useful context for organizations evaluating agent technology investments. The core message is that building real agents capable of complex, long-running tasks without unrecoverable errors remains an unsolved problem, but systematic engineering approaches can steadily improve capabilities.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.