ZenML

Production AI Agents with Dynamic Planning and Reactive Evaluation

Hex 2023
View original source

Hex successfully implemented AI agents in production for data science notebooks by developing a unique approach to agent orchestration. They solved key challenges around planning, tool usage, and latency by constraining agent capabilities, building a reactive DAG structure, and optimizing context windows. Their success came from iteratively developing individual capabilities before combining them into agents, keeping humans in the loop, and maintaining tight feedback cycles with users.

Industry

Tech

Technologies

Overview

This case study comes from a podcast interview with Brian Bishoff, who leads AI at Hex, a data science notebook platform. Hex has built a product called “Magic” that uses AI agents to help data scientists and analysts perform multi-step data analysis tasks. The interview provides valuable insights into the practical challenges of deploying AI agents in production and the specific solutions the Hex team developed to overcome them.

Brian brings a background as a physicist with a PhD in mathematics and prior data science experience at companies like Stitch Fix and Blue Bottle Coffee. This background informs his pragmatic, data-driven approach to LLMOps challenges. The conversation covers agent architecture, prompt engineering, latency optimization, evaluation strategies, and the broader philosophy of building AI-powered products.

The Agent Architecture Challenge

Hex’s initial approach to agents followed what has become a common pattern: having one agent generate a plan, then dispatching sub-agents to execute individual steps of that plan. However, their first attempt did not go well. The team encountered what Brian describes as “too high entropy” - not quite the death loops that plague some agent applications, but unexpected behaviors that made the system unreliable.

The core problem was that they allowed for high diversity in the planning stage. The agent could decide how to structure its plan and orchestrate sub-agents with essentially “anything goes” flexibility. While this initially produced exciting results, it proved unsustainable for production use.

The solution came from being more prescriptive about the types of plans that could be generated. Crucially, this didn’t mean reducing the agent’s planning capabilities, but rather constraining the specific types of steps that could be executed within a plan. Brian uses the weather API example to illustrate: instead of building an ever-expanding library of narrow tools (get_user_location, get_weather, get_precipitation, get_humidity), they built more general but still bounded tools that represent core capabilities.

The key insight is that there’s an inherent tension between tool generality and reliability. Tools that are too general allow the agent to go off the rails; tools that are too narrow require exponentially more tooling and create a rigid, limited interface. Hex found their sweet spot by asking: “What are the core ways the agent should be able to respond and interface with our product?”

Bootstrapping from Human Workflows

One of the most valuable insights from this case study is how Hex bootstrapped their agent capabilities from existing human-facing features. Before building a sub-agent that generates SQL as part of answering a question, they first built the standalone capability for users to generate SQL. This approach provided several advantages:

When it came time to make these capabilities available to a supervising agent, they could leverage all these learnings. This mirrors an argument that incumbents with existing workflow tooling may have an advantage in building AI agents - they already have the APIs and abstractions that work for humans, which can then be exposed to AI agents.

The DAG-Based Reactive Architecture

Hex’s agent system constructs a directed acyclic graph (DAG) under the hood based on the inferred plan. This DAG structure serves multiple purposes:

The system infers which steps in the plan require upstream response values, allowing for proper dependency management. Some prompts for downstream agent requests need information from the original user prompt or from upstream agent responses. By understanding these dependencies, the system can execute independent steps in parallel while properly sequencing dependent ones.

The reactive aspect is particularly interesting: if a user makes a change to an upstream agent’s response, the system knows which downstream agents need to be re-run. This reactivity allows users to intervene and correct agent behavior without having to restart the entire pipeline.

Latency Optimization Through Context Minimization

One of the most striking aspects of Hex’s implementation is the speed of their agent system. When the interviewer expressed surprise at how fast the system responds despite calling multiple models, Brian revealed their approach is fundamentally about context minimization rather than model fine-tuning or using smaller models.

The team uses GPT-4 Turbo as their primary model, falling back to GPT-3.5 only when they’ve identified it’s capable enough for specific tasks. The speed comes from aggressive pruning of what goes into the context window.

Brian describes a Michelangelo-like approach to prompt engineering: “I’m trying to get rid of all the stone that shouldn’t be there.” This contrasts with what he observed at AI meetups in summer 2023, where the focus was on jamming more and more into expanding context windows.

Their process typically starts big and goes small:

The Text-to-SQL Challenge

Brian provides an illuminating walkthrough of why text-to-SQL products fail, which has broader implications for any RAG or context-augmented LLM system. He describes a progression of challenges:

The naive approach assumes the model knows SQL. Even if it’s perfect at SQL, it doesn’t know your schema. Adding RAG over the schema gets you to “level zero.” But then there’s institutional knowledge: before 2020, customers were in a different table due to a data warehouse migration. The data team knows this, some GTM team members know it, but it’s not documented anywhere.

This leads to the need for a data dictionary with “meta-metadata” - not just table schemas but all the caveats and institutional knowledge about the data. Brian jokes that “every table has a caveat.”

The takeaway for LLMOps practitioners is that AI systems can’t do things that humans can’t do without context. A new colleague joining the team would get this tacit knowledge from conversations with colleagues - it’s not all documented. Unless an AI system can somehow acquire this tacit information, it will hit the same walls.

Hex’s solution involves making documentation easy (through their Data Manager), using existing known context about how people query the data, leveraging latent representations of data relationships, and applying sophisticated search and retrieval methods from RAG and recommendation systems.

Keeping Humans in the Loop

A common critique of agents is that unreliable steps compound exponentially - if each step is 95% reliable and you chain 10 of them, overall reliability drops dramatically. Brian’s response is instructive: keep the chain short.

For a data science workflow, the core components are: get data, transform data, make a chart. A surprising percentage of analytics questions can be answered with just this set of agent responses. Rather than building Devon-style “click go, come back in two days” systems, Hex focuses on keeping the user close to the agent pipeline.

The agents handle doing more things simultaneously and reactively, but everything comes from an interactivity paradigm. Users are right there to observe and reflect. This reduces death loops and increases productive iteration.

The UX lesson generalizes: the shift from completion to chat paradigms changed user expectations. In a chat paradigm, users expect to take turns and correct things. Similarly, keeping humans close to agent actions gives more opportunities to steer the system back on course.

Evaluation Philosophy

Brian’s perspective on evaluation comes from his background in recommendation systems (fashion recommendations at Stitch Fix, coffee recommendations), where there’s no easy ground truth. His claim is that current LLM evaluation challenges aren’t fundamentally new - “you oversimplified or you worked downstream of the hard work.”

The approach at Hex involves building a family of evaluators where each tests one specific thing they care about. Using a hypothetical children’s education product as an example, he describes:

At Hex, many evaluators aren’t broadly reusable across hundreds of test cases. Instead, each evaluation case comes with its own set of assertions. Some are reusable (did it generate a data frame?), but there’s significant nuance in determining correctness.

Execution evaluation is particularly valuable for code generation: executing the result and comparing the environment before and after provides rich information about correctness. The team simulates environments and compares target code outputs against agent predictions.

Brian sees the new version of feature specs for AI features as: “these are the evals I want to pass.” It sounds like an unusual form of test-driven development, but provides a concrete target: what the user asks, what the user gets in an ideal case, distilled down to important aspects.

Adapting Research to Production

A key insight is that research results rarely translate directly to production applications. Hex uses a variant of Chain of Thought that’s crucial for their planning stage performance, but it’s not the precise version from papers. The same applies to other techniques - they’ve had to take existing knowledge and resources and reframe them for their specific paradigm.

This translation work is important enough that Hex has a research part of their organization focused specifically on adapting techniques for applied use. The lesson for LLMOps practitioners is that research papers provide inspiration and direction, but the specific implementation will always need customization.

The Data-Driven Mindset

Brian’s parting advice emphasizes the importance of constantly examining data. Engineers from software backgrounds focus on systems and reliability; those from data science backgrounds focus obsessively on the data - inputs, outputs, understanding what’s happening.

His friend Eugene spends hours every week just looking at application data. Hex has a weekly “evals party” inspired by conversations with AI managers at Notion - dedicated time to immerse in the data of what their system is producing. This is presented as the key “alpha” for AI practitioners: you will answer most hard product questions by looking at the data generated by users and agents.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

AI Agents in Production: Multi-Enterprise Implementation Strategies

Canva / KPMG / Autodesk / Lightspeed 2026

This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.

customer_support data_cleaning content_moderation +36

Enterprise-Scale GenAI and Agentic AI Deployment in B2B Supply Chain Operations

Wesco 2025

Wesco, a B2B supply chain and industrial distribution company, presents a comprehensive case study on deploying enterprise-grade AI applications at scale, moving from POC to production. The company faced challenges in transitioning from traditional predictive analytics to cognitive intelligence using generative AI and agentic systems. Their solution involved building a composable AI platform with proper governance, MLOps/LLMOps pipelines, and multi-agent architectures for use cases ranging from document processing and knowledge retrieval to fraud detection and inventory management. Results include deployment of 50+ use cases, significant improvements in employee productivity through "everyday AI" applications, and quantifiable ROI through transformational AI initiatives in supply chain optimization, with emphasis on proper observability, compliance, and change management to drive adoption.

fraud_detection document_processing content_moderation +52