ZenML

Building a Rust-Based AI Agentic Framework for Multimodal Data Quality Monitoring

Zectonal 2024
View original source

Zectonal, a data quality monitoring company, developed a custom AI agentic framework in Rust to scale their multimodal data inspection capabilities beyond traditional rules-based approaches. The framework enables specialized AI agents to autonomously call diagnostic function tools for detecting defects, errors, and anomalous conditions in large datasets, while providing full audit trails through "Agent Provenance" tracking. The system supports multiple LLM providers (OpenAI, Anthropic, Ollama) and can operate both online and on-premise, packaged as a single binary executable that the company refers to as their "genie-in-a-binary."

Industry

Tech

Technologies

Company Overview and Use Case

Zectonal is a software company specializing in characterizing and monitoring multimodal datasets for defects, errors, poisoned data, and other anomalous conditions that deviate from defined baselines. Their core mission is to detect and prevent bad data from polluting data lakes and causing faulty analysis in business intelligence and AI decision-making tools. The company has developed what they call “Zectonal Deep Data Inspection” - a set of proprietary algorithms designed to find anomalous characteristics inside multimodal files in large data stores, data pipelines, or data lakes at sub-second speeds.

The company’s journey into LLMOps began with a simple GPT-3 chat interface for answering basic questions about data formats and structures. However, they recognized early on that the real value would come from leveraging AI’s yet-to-emerge capabilities to enhance their core data quality monitoring functionality. This vision led them to develop a sophisticated AI agentic framework built entirely in Rust.

Technical Architecture and LLMOps Implementation

Rust-Based Agentic Framework

Zectonal made the strategic decision to build their AI agentic framework in Rust rather than using existing Python-based solutions like LangChain or LlamaIndex. This choice was driven by several factors: their previous negative experiences with Python’s performance limitations (particularly the Global Interpreter Lock), the need for blazing fast performance when processing large datasets, and Rust’s memory safety guarantees that eliminate entire classes of runtime bugs and vulnerabilities.

The framework operates as a single compiled binary executable that the company calls their “genie-in-a-binary.” This approach provides several advantages for production deployment: simplified distribution, consistent performance across environments, and the ability to maintain what they term “Agent Provenance” - a complete audit trail of all agent communications and decision-making processes.

Function Tool Calling Architecture

The core innovation in Zectonal’s LLMOps approach is their implementation of LLM function tool calling to replace traditional rules-based analysis. Instead of pre-defining rules for when to execute specific diagnostic algorithms, they allow LLMs to dynamically decide which function tools to call based on the characteristics of the data being analyzed. This represents a significant scaling improvement as the number of supported data sources and file codecs has grown.

The framework supports what they call “needle-in-a-haystack” algorithms - specialized diagnostic capabilities designed to detect anomalous characteristics within multimodal files. For example, the system can detect malformed text inside a single cell in a spreadsheet with millions of rows and columns at sub-second speed, and can analyze similar files flowing through data pipelines continuously on a 24/7 basis.

Multi-Provider LLM Support

One of the key architectural decisions was to build an extensible framework that supports multiple LLM providers rather than being locked into a single vendor. The system currently integrates with:

This multi-provider approach allows Zectonal to implement what they call “hybrid best-of-breed” deployments, where different agents can run on different platforms based on cost, performance, and security requirements.

Agent Provenance and Audit Trails

A critical aspect of Zectonal’s LLMOps implementation is their “Agent Provenance” system - comprehensive tracking of agent communications from initial tasking to completion. This system addresses a common enterprise concern about AI transparency and auditability. The provenance tracking captures:

This audit capability has proven essential for debugging agent behavior, optimizing prompts and function tool descriptions, and fine-tuning models more efficiently. The company notes that one of their first observations when deploying AI agents was how “chatty” they are when communicating amongst themselves, making this tracking capability crucial for both operational and cost management purposes.

Production Challenges and Solutions

The company has encountered and addressed several typical LLMOps challenges:

Cost Management: Token-based billing for LLM services creates unpredictable costs that are difficult to budget and forecast. Zectonal has developed configuration options and fine-tuning techniques to mitigate token usage, though they note that cost forecasting remains “no science, all art.” They’re working toward an equilibrium where some highly specialized agents run on premium LLM services while others operate on-premise via Ollama to optimize costs.

Error Detection and Reliability: Unlike traditional software where wrong function calls would typically result in clear errors, LLM-driven function calling can fail silently when the wrong tool is selected. Zectonal found that existing Python frameworks relied too heavily on human intuition to detect these issues, which they recognized as non-scalable. Their Rust framework includes better mechanisms for detecting when inappropriate tools have been called.

Model Provider Dependencies: The rapid evolution of LLM capabilities means constant adaptation. The company tracks the Berkeley Function Calling Leaderboard daily and has had to adapt to changes like OpenAI’s transition from “function calling” to “tool calling” terminology and the introduction of structured outputs.

Advanced Capabilities and Future Directions

Autonomous Agent Creation: Zectonal is experimenting with meta-programming concepts applied to AI agents - autonomous agents that can create and spawn new agents based on immediate needs without human interaction. This includes scenarios where agents might determine that a specialized capability is needed (such as analyzing a new file codec) and automatically create both new agents and the tools they need to operate.

Dynamic Tool Generation: The system includes agents specifically designed to create new software tools, enabling a form of “meta AI programming” where tools can create other tools. This capability raises interesting questions about governance, security, and the potential for diminishing returns as agent populations grow.

Utility Function Optimization: The company is developing a general-purpose utility function for their agents based on the rich data collected through Agent Provenance. This includes metrics for intent, confidence, and other factors that can be used to reward useful information and penalize poor performance.

Deployment and Operations

The production system is designed for both cloud and on-premise deployment scenarios. Zectonal deliberately chose not to be a SaaS product, instead providing software that customers can deploy in their own environments. This decision was based on their experience selling to Fortune 50 companies and their concerns about data liability and security breaches associated with external hosting.

The single binary deployment model simplifies operations significantly compared to typical Python-based ML deployments that require complex dependency management. The system can analyze data flowing through pipelines continuously, processing files every few seconds on a 24/7 basis while maintaining the performance characteristics necessary for real-time monitoring.

Technical Lessons and Insights

Several key insights emerge from Zectonal’s LLMOps implementation:

Language Choice Impact: The decision to use Rust rather than Python for their AI infrastructure has provided significant advantages in terms of performance, reliability, and deployment simplicity. While Python remains dominant in the AI/ML space, Rust’s characteristics make it particularly well-suited for production AI systems that require high performance and reliability.

Framework vs. Library Approach: Rather than assembling existing tools, building a custom framework allowed Zectonal to optimize for their specific use cases and requirements, particularly around audit trails and multi-provider support.

Hybrid Deployment Strategy: The ability to support both cloud-based and on-premise LLM providers has proven crucial for enterprise adoption, allowing customers to balance performance, cost, and security requirements.

Observability Requirements: The need for comprehensive monitoring and audit trails in production AI systems is critical, particularly for enterprise use cases where transparency and explainability are essential for adoption and compliance.

This case study demonstrates how a thoughtful approach to LLMOps architecture, with careful consideration of performance, security, auditability, and deployment flexibility, can enable sophisticated AI capabilities while meeting enterprise requirements for reliability and transparency.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Enterprise-Scale GenAI and Agentic AI Deployment in B2B Supply Chain Operations

Wesco 2025

Wesco, a B2B supply chain and industrial distribution company, presents a comprehensive case study on deploying enterprise-grade AI applications at scale, moving from POC to production. The company faced challenges in transitioning from traditional predictive analytics to cognitive intelligence using generative AI and agentic systems. Their solution involved building a composable AI platform with proper governance, MLOps/LLMOps pipelines, and multi-agent architectures for use cases ranging from document processing and knowledge retrieval to fraud detection and inventory management. Results include deployment of 50+ use cases, significant improvements in employee productivity through "everyday AI" applications, and quantifiable ROI through transformational AI initiatives in supply chain optimization, with emphasis on proper observability, compliance, and change management to drive adoption.

fraud_detection document_processing content_moderation +52

AI Agents in Production: Multi-Enterprise Implementation Strategies

Canva / KPMG / Autodesk / Lightspeed 2026

This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.

customer_support data_cleaning content_moderation +36