ZenML

Building and Deploying Production AI Agents for Enterprise Data Analysis

Asterrave 2024
View original source

Rosco's CTO shares their two-year journey of rebuilding their product around AI agents for enterprise data analysis. They focused on enabling agents to reason rather than rely on static knowledge, developing discrete tool calls for data warehouse queries, and creating effective agent-computer interfaces. The team discovered key insights about model selection, response formatting, and multi-agent architectures while avoiding fine-tuning and third-party frameworks. Their solution successfully enabled AI agents to query enterprise data warehouses with proper security credentials and user permissions.

Industry

Tech

Technologies

Overview

Patrick, co-founder and CTO of Asterrave (formerly Rosco), shares extensive lessons learned from a two-year journey of rebuilding their entire product around AI agents. The product enables AI agents to search and query enterprise data warehouses on behalf of users, essentially allowing natural language interaction with complex SQL databases. This case study provides valuable insights into the practical challenges of deploying AI agents in production enterprise environments.

Agent Definition and Architecture

Patrick establishes a precise definition for what constitutes an AI agent, requiring three specific criteria: the ability to take directions (human or AI-provided) toward a specific objective, access to at least one tool with response capability, and autonomous reasoning about how and when to use tools. Critically, he emphasizes that predefined sequences of tool calls in a prompt-chained setup do not qualify as true agents—the system must demonstrate autonomous reasoning capabilities.

This definition has significant implications for production systems because it means the agent architecture must support dynamic decision-making rather than fixed workflows. The team found this flexibility essential for handling the diverse and unpredictable nature of enterprise data queries.

Reasoning Over Knowledge: A Core Architectural Decision

One of the most significant lessons from Asterrave’s experience was the importance of enabling agents to think rather than relying on what the underlying model knows. This insight led to a fundamental shift away from traditional RAG (Retrieval Augmented Generation) approaches where content is inserted into system prompts. Instead, the team focused on discrete tool calls that allowed agents to perform retrieval and gather relevant context dynamically during task execution.

The SQL generation use case perfectly illustrates this principle. When agents were given comprehensive table schemas with all columns upfront, they frequently failed to reason correctly about which tables and columns to use. The models became overwhelmed by the token count in the prompt and either chose incorrect options or wrote queries that failed to execute. The solution was to implement simpler building blocks of tool calls such as “search tables,” “get table detail,” or “profile a column.” The agent then used these iteratively to find the right columns for the right query.

Model Selection and Behavior Differences

The case study provides valuable comparative insights across different LLM models in production agent scenarios. Patrick contrasts GPT-4o’s behavior with reasoning models like o1, using a practical example involving Salesforce data schemas (accounts, contacts, opportunities tables).

When asked to write a query calculating customer churn, GPT-4o was heavily incentivized to produce SQL regardless of whether the underlying data could support the query. It made assumptions and wrote SQL with poor definitions for calculating churn, potentially leading analysts to incorrect conclusions. The model showed no inclination to push back or consider whether the task was actually possible given the schema.

In contrast, o1 reasoned through the various aspects of the question and accurately concluded that there was no way to calculate churn status given the schema provided. This demonstrates the critical importance of model selection for agent orchestration tasks where stopping to reason before acting is essential.

Patrick notes that Claude 3.5 Sonnet remains his preferred model for agent orchestration, citing its balance between speed, cost, and decision quality. However, he acknowledges that cheaper models can be appropriate for specific tool calls or sub-prompts, as long as the core decision-making about which tool to call next runs on a generally intelligent model.

Agent Computer Interface (ACI) Optimization

A substantial portion of the lessons learned relates to what Patrick calls the Agent Computer Interface—the exact syntax and structure of tool calls, including both input arguments and response formats. The team found that seemingly trivial tweaks to the ACI could have massive impacts on agent accuracy and performance.

Specific examples include format optimization for different models. With GPT-4o, the team initially formatted search result payloads as markdown. However, they observed cases where the agent would claim a column did not exist even when it was clearly visible in the tool response. This was particularly problematic for customers with large tables (500-1000 columns) producing responses of 30,000+ tokens. Switching the response format from markdown to JSON immediately resolved this issue for GPT-4o.

Interestingly, when working with Claude, the team discovered that XML formatting was significantly more effective than JSON. This model-specific behavior likely correlates to training data differences and highlights the importance of format experimentation in production deployments.

Learning from Agent Failure Modes

Patrick shares a valuable debugging technique: observing how agents hallucinate can reveal what the model expects from tool calls. If an agent consistently ignores the provided JSON schema and provides arguments in a different format, this indicates the model’s native expectations. Adapting tool definitions to match these expectations—rather than forcing the model into a different format—generally improves agent performance by aligning with training data patterns.

Fine-Tuning: A Cautionary Tale

The team concluded that fine-tuning models was largely a waste of time for agent applications. If the premise is that agents should focus on reasoning over inherent knowledge, then fine-tuning provides little benefit for reasoning improvement. In Asterrave’s experience, fine-tuning actually decreased reasoning capabilities in many cases because it overfit the model to perform specific sequences of tasks rather than stopping to evaluate whether it was making the right decision.

Patrick recommends investing time in ACI iteration rather than building fine-tuned models for agent applications. This represents a significant departure from common assumptions about LLM optimization and has important implications for LLMOps resource allocation.

Framework Selection and Custom Implementation

When asked about framework choices, the team ultimately decided against using abstraction libraries like LangGraph or CrewAI for two reasons. First, when development started two years ago, these frameworks weren’t publicly available. Second, and more importantly, production requirements created significant blockers that made framework adoption impractical.

A key example was the need for end-user security credentials to cascade down to agents. When users query their Snowflake accounts through an agent, they may have granular permissions controlling what data they can access. The agent needed to run with those user-specific permissions using OAuth integration. Managing the authentication process and underlying service keys within a third-party framework proved extremely difficult to build and scale.

The lesson for LLMOps practitioners is to consider end goals before becoming dependent on frameworks. Patrick notes that building an agent or even a multi-agent system requires less code than commonly assumed. While abstractions can accelerate prototyping and validation, production deployments often benefit from custom implementation that provides full control over security, authentication, and integration requirements.

System Prompts and Competitive Moats

Patrick challenges the common assumption that system prompts represent significant intellectual property. He argues that the most valuable and defensible aspects of agent products are the ecosystem around the agent—including user experience design for agent interaction and the security protocols and connections the agent must follow. These elements constitute the most time-consuming aspects of building production-quality agents and represent the true competitive differentiation.

Multi-Agent System Design

Approximately one year into their agent transition, Asterrave introduced multi-agent concepts as customers became comfortable with single agents. Several key principles emerged from this experience.

The team found that implementing a manager agent within a hierarchy was essential. The manager agent owns the final outcome but delegates subtasks to specialized worker agents with more specific instructions and tool calls. Giving all information to a single manager agent caused it to become overwhelmed and make poor decisions.

A “two pizza rule” similar to Amazon’s team design philosophy applied to agent teams. Limiting multi-agent systems to approximately 5-8 agents produced the best results. Systems with 25 or 50 agents strongly decreased the likelihood of achieving desired outcomes due to infinite loops and unrecoverable paths.

Incentivization proved more effective than forcing worker agents through discrete steps. The goal should be to describe and “reward” the manager agent for accomplishing the overall objective while relying on it to manage underlying worker agents and validate that their output contributes to the broader outcome.

Production Deployment Considerations

Throughout the case study, several production-specific concerns emerge that distinguish this from experimental or prototype work. Security credential management and OAuth integration for enterprise data sources represent significant engineering challenges. Performance optimization through format selection and ACI tuning directly impacts user experience and system reliability. The choice of orchestration model affects both cost and quality in ways that require careful balancing. Multi-agent coordination at scale introduces failure modes (infinite loops, lost context) that don’t appear in single-agent prototypes.

These lessons represent hard-won knowledge from two years of production deployment and iteration, providing practical guidance for teams building similar enterprise AI agent systems.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61

Domain-Specific AI Platform for Manufacturing and Supply Chain Optimization

Articul8 2025

Articul8 developed a generative AI platform to address enterprise challenges in manufacturing and supply chain management, particularly for a European automotive manufacturer. The platform combines public AI models with domain-specific intelligence and proprietary data to create a comprehensive knowledge graph from vast amounts of unstructured data. The solution reduced incident response time from 90 seconds to 30 seconds (3x improvement) and enabled automated root cause analysis for manufacturing defects, helping experts disseminate daily incidents and optimize production processes that previously required manual analysis by experienced engineers.

customer_support data_analysis classification +49