ZenML

Building Production Analytics Agents with Semantic Layer Integration

Wobby 2025
View original source

Wobby, a company that helps business teams get insights from their data warehouses in under one minute, shares their journey building production-ready analytics agents over two years. The team developed three specialized agents (Quick, Deep, and Steward) that work with semantic layers to answer business questions. Their solution emphasizes Slack/Teams integration for adoption, building their own semantic layer to encode business logic, preferring prompt-based logic over complex workflows, implementing comprehensive testing strategies beyond just evals, and optimizing for latency through caching and progressive disclosure. The approach led to successful adoption by clients, with analytics agents being actively used in production to handle ad-hoc business intelligence queries.

Industry

Tech

Technologies

Overview

Wobby is a company that has spent approximately two years building production analytics agents designed to help business teams extract insights from their data warehouses in under one minute. The presentation by Quinton, who leads AI engineering at Wobby, offers a deeply pragmatic and sometimes contrarian perspective on building and deploying LLM-based agents in production environments. The company has developed three distinct analytics agents: the Quick agent for straightforward everyday questions, the Deep agent for open-ended research queries, and the Steward agent which helps build and maintain the semantic layer that encodes business logic.

The case study is particularly notable for its emphasis on non-technical factors like user adoption through communication channels, the critical importance of semantic layers and business context, and a strong preference for comprehensive testing over evaluation-heavy approaches. Wobby’s journey represents a realistic account of the challenges, pivots, and pragmatic decisions required to build agents that actually get used in production business environments.

Communication Channels and User Adoption

One of Wobby’s most distinctive insights is placing communication channel integration and user experience at the forefront of their success formula, rather than treating it as an afterthought. The team found that adoption of an agent is fundamentally more than just a chat UI—it’s about creating a “group feeling” where people naturally adopt the technology. This insight was validated by client feedback indicating that Slack and Teams integrations were key reasons for rapid agent adoption.

The company integrated their agents directly into Slack and Microsoft Teams, allowing business users to interact with analytics agents in their existing workflows. These integrations proved more complex than anticipated, requiring significant engineering effort to handle different UX constraints between platforms. For example, Slack allows onboarding workflows that help business teams understand how to query agents and receive answers in different modes, while Teams has different capabilities and limitations.

A critical benefit of the group chat approach is the emergence of “bright business people” who ask creative questions, helping demonstrate successful agent usage to peers. This creates healthy peer pressure and social proof that encourages broader adoption. Additionally, having customer support personnel present in these channels allows the team to observe failure modes in real-time, adding problematic queries to their evaluation datasets for debugging and improvement.

Business Logic Awareness and Semantic Layers

Wobby’s second major insight centers on the absolute necessity of encoding business logic and context for text-to-SQL applications. The team uses a compelling analogy: if you gave a smart person full access to your data catalog who had never seen it before and asked them to produce valid queries for key business metrics, they would likely fail—and so would an agent without proper context.

Initially, the team overindexed on text-to-SQL benchmarks like Spider and BIRD, attempting to achieve high scores on these academic datasets. However, they discovered that even the best teams in the world plateau at around 80% performance on these benchmarks, which contain broken and ambiguous queries. The fundamental problem isn’t SQL syntax—it’s the contextual understanding of what the data represents and how it relates to business processes.

Wobby’s first approach involved building agents that used RAG with parameterized query templates—essentially a library of “golden” SQL queries that could be reused with different parameters. The hope was that providing context through these examples would help agents adopt to business-specific needs. However, this approach resulted in 100+ similar-looking SQL queries that created more review work for analysts and didn’t align with how BI teams actually operate day-to-day.

The breakthrough came with their second version, which involved building a semantic layer—a logical abstraction that sits between raw database tables (the physical layer) and dashboards. This semantic layer encodes business logic using concepts familiar to BI teams: metrics, dimensions, and organized SQL queries that follow the “don’t repeat yourself” principle. In database terminology, this separates the physical layer (raw data) from the logical layer (business concepts).

This architectural decision led to a multi-agent system: the analytics agents (Quick and Deep) only interact with the logical semantic layer and never touch raw data, while the Steward agent has access to both the physical layer and can suggest and create new dimensions, metrics, and KPIs in the semantic layer. This division better aligns with how BI teams and data governance processes actually work.

Wobby made the ambitious decision to build their own semantic layer rather than solely integrating with existing solutions like dbt’s Metric Flow, Cube, or DataBricks’ offerings. This decision was enabled by modern AI-assisted coding and the availability of open-source semantic layer projects to reference. The team needed a solution that was “agent-first” and could provide integrations across multiple existing semantic layer platforms. Their semantic layer includes its own SQL compiler that references entities in the logical layer and translates queries to different SQL dialects.

The Steward agent plays a unique role in this ecosystem. It can connect to documentation sources (ideally through protocols like MCP in secure environments, though in practice often through PDF exports) and helps build out the semantic layer by suggesting dimensions and metrics. There’s a lifecycle management approach where changes to the semantic layer are tested in separate environments before being merged into production, similar to software development practices.

Code Architecture and Framework Decisions

Wobby’s approach to code architecture reflects hard-won lessons about framework selection and maintainability. When they started building two years ago, popular agent frameworks were not yet in their v1 release cycles, and the team was skeptical about adopting frameworks after experiencing “questionable abstractions and breaking APIs” from early versions of tools like LangChain.

Rather than adopting multi-agent network frameworks or MCP (Model Context Protocol) server architectures, Wobby opted for clean architecture principles based on Uncle Bob’s approach. They built around stable interfaces that different parts of the software stack depend on, giving them direct control over what happens at the app layer versus the agent layer. This flexibility was critical because some responsibilities naturally belonged in the application layer rather than being managed entirely by an agent framework.

Useful abstractions they implemented include:

Notably, several popular patterns proved not useful for their use case:

Testing Philosophy: Tests Before Evals

One of Wobby’s most contrarian positions is that evaluations (evals) are “overhyped for early stage projects.” Coming from machine learning engineering backgrounds where the focus is data → model → product, the team realized that in production engineering, you must focus on product first. This means erring on the side of comprehensive test cases before investing heavily in evaluation infrastructure.

The key insight is recognizing when something is a bug versus when it’s a performance question. If a tool doesn’t return correct data due to an access management issue, or if a semantic search doesn’t yield expected results, or if API response processing loses information—these are bugs that should be caught by automated tests in CI/CD pipelines, not discovered through evals.

Wobby uses a “Swiss cheese model” for thinking about testing in layers, where bugs travel from left to right and different layers catch different types of issues. Earlier layers (like unit tests for prompt formatting, checking for consistent markdown headers, handling completion service downtime) are easier to implement and should catch common issues. Later layers (like comprehensive evals) are harder to implement and should focus on agent performance across different configurations rather than catching bugs that could be prevented earlier.

This doesn’t mean evals aren’t important—they are critical for measuring agent performance on representative datasets and comparing different agent configurations. But the team found that mixing concerns by putting basic functionality testing into the eval layer makes debugging harder and creates unnecessary complexity.

Wobby’s eval infrastructure consists of a Streamlit app that can spawn as a job container, potentially integrating into CI pipelines. They emphasize documentation and collaboration, encouraging their own team to use agents against their own database so that when errors occur, they can be added to eval datasets. This creates a feedback loop between real usage and quality improvement.

Prompts Over Workflows

Another significant architectural decision was preferring logic in prompts over deterministic workflows. Initially, the conventional wisdom suggested that workflows were better than agents because they’re easier to test, more deterministic, and represent better system design. Agents were seen as suitable only for rare, highly non-deterministic cases like open-ended chat.

However, Wobby found that as large language models improved in instruction-following capabilities, putting logic in prompts became increasingly practical and offered significant advantages:

The team acknowledges this is domain-specific and that workflows still have their place for single-completion scenarios. But for their text-to-SQL analytics use case, prompt-based logic aligned better with modern LLM capabilities. This is what Andrej Karpathy calls “programming the neural net”—as small language models become exceptionally good at following instructions, there’s a strong case for more portable logic encoded directly in prompts.

That said, Wobby recognizes that simple questions don’t always have simple answers, so they manage user expectations about response times and ensure users understand when complex queries might take longer.

Evaluation Methodology

For evaluation, Wobby developed a “composite quality metric” approach designed specifically for text-to-SQL applications. The fundamental challenge is that text-to-SQL is inherently an offline evaluation problem—you cannot judge if a generated SQL query is correct in an online setting without a reference query to compare against.

Their evaluation framework includes:

The composite scoring approach is particularly sophisticated. Different agent trajectories (paths through the tool-calling process) can end in different states, and it’s important to have a single comparable metric across all trajectories. Their weighted score considers both the trajectory taken and the expected outcomes for that specific path, allowing for interpretable comparisons even when agents solve problems in different ways.

This “Overall Agent Effectiveness” score enables the team to track performance across agent runs regardless of how the agent reaches its conclusions, providing a single interpretable metric for agent quality.

Latency Optimization

Wobby places significant emphasis on latency optimization, operating under the principle that faster agents get used more frequently. Their latency optimization strategies include:

These optimizations reflect a deep understanding of the token-level economics of LLM applications and the user experience impact of response latency.

Observability and Monitoring

For observability, Wobby takes a multi-tool approach using both LangFuse and Logfire, recognizing that different tools have different strengths:

Their strategy involves using decorators to wrap functions that call LLMs, ensuring that inputs and outputs are captured regardless of whether the function succeeds, errors, or calls other functions. They built their own wrappers that call both LangFuse and Logfire, rather than depending directly on either API. This abstraction layer protects them from breaking changes when observability platforms update their APIs from v2 to v3, preventing disruption to their observability infrastructure.

The team emphasizes building wrappers around all external APIs, not just observability tools, to maintain flexibility and reduce the impact of external changes on their codebase.

Data Representation and Format

Wobby experimented with different data representation formats for presenting query results and schema information to agents. They evaluated trending formats like Tson (which uses abbreviations and compression) but were concerned that since Tson isn’t widely represented in LLM training data, models might not reliably follow its conventions without extensive system prompt additions.

Instead, they use the Tablib library to format dataframes in various formats and tested which table representations led to better agent performance. They found that GitHub-flavored markdown tables work well, likely because large amounts of LLM training data comes from GitHub. The compression benefits of formats like Tson matter, but standard markdown tables proved sufficient for their needs.

Business Context and Data Quality Challenges

A recurring theme in the presentation is that the technical challenge of text-to-SQL is fundamentally a context and data quality problem. Many companies don’t have trustworthy, well-governed data ecosystems. Simply unleashing an agent on poorly organized data with unclear business definitions will produce wrong results.

The iterative process of building and refining the semantic layer is actually where the value emerges. As business teams use agents and encounter errors, they identify missing context or ambiguous definitions. Adding these to the semantic layer and evaluation datasets creates a virtuous cycle where the agent becomes increasingly accurate for that specific business context.

Wobby positions themselves not just as a tool provider but as a partner helping data engineering teams build proper semantic layers and well-governed data ecosystems. The promise of useful agents provides motivation for companies to invest in data governance—something that was often deprioritized despite everyone recognizing its importance.

For companies with data distributed across multiple systems with non-interoperable language, Wobby’s pragmatic approach is to focus on teams that have already consolidated data into a data warehouse. They considered building a cross-data-source query engine (referencing IBIS engine as an example) but chose not to pursue this initially, focusing instead on working with single data warehouses that aggregate data from multiple sources.

Competitive Landscape

The competitive landscape includes major data platform vendors building their own agent capabilities:

Wobby differentiates by building agent-first semantic layer capabilities with interoperability across multiple platforms, rather than being locked into a single data platform vendor’s ecosystem.

Recommendations for New Startups

When asked what advice they’d give to a new startup designing data systems from scratch to be agent-ready, Quinton emphasized:

He referenced Dave Knemeyer’s work on enterprise ontology, noting there’s a spectrum from very formal business ontology with RDF mappings to more lightweight YAML representations that developers find practical. The right balance depends on the use case, but having some formal representation of business semantics is crucial.

Deployment and Production Realities

The case study is refreshingly honest about the realities of production deployment. Wobby acknowledges that it’s “not a walk in the park”—you can’t just give business teams an agent and expect success. It requires:

The vision of answering queries in “under one minute” is aspirational and requires alignment across multiple teams with different responsibilities—data engineering, data governance, business analysts. Wobby positions themselves as facilitating this alignment, helping teams think through their data organization and agent deployment together.

The user adoption curve shown in the presentation (linear growth over time with two different analytics agents) demonstrates real production usage in logistics and finance analytics contexts, with agents being assigned to increasing numbers of tasks over time.

Philosophical Approach

Throughout the presentation, Quinton emphasizes pragmatism over dogma. He’s willing to challenge conventional wisdom (evals over tests, workflows over agents) when their production experience suggests different tradeoffs. At the same time, he acknowledges these are domain-specific insights for their particular use case (business intelligence and analytics), and different domains might reach different conclusions.

The emphasis on “clean code” principles, stable interfaces, comprehensive testing, and gradual iteration reflects software engineering discipline applied to the relatively new field of production LLM applications. Rather than chasing the latest trends (multi-agent networks, MCP servers, trending data formats), the team focuses on what actually works in production for their specific problem domain.

This balanced, experience-based perspective makes the case study particularly valuable for teams building production LLM applications, offering both technical specifics and higher-level strategic guidance grounded in real deployment challenges.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57