ZenML

Building and Managing Production Agents with Testing and Evaluation Infrastructure

Nearpod 2023
View original source

Nearpod, an edtech company, implemented a sophisticated agent-based architecture to help teachers generate educational content. They developed a framework for building, testing, and deploying AI agents with robust evaluation capabilities, ensuring 98-100% accuracy while managing costs. The system includes specialized agents for different tasks, an agent registry for reuse across teams, and extensive testing infrastructure to ensure reliable production deployment of non-deterministic systems.

Industry

Education

Technologies

Overview

Nearpod is a K-12 EdTech company operating globally that has embarked on a significant journey into LLM-powered agent systems. Zach Wallace, an engineering manager at the company, shares insights from their transition from building a robust data platform to implementing production-grade multi-agent systems. The case study illustrates how the foundational data infrastructure work enabled the team to build reliable, cost-effective AI agents for educational content generation.

The Data Platform Foundation

Before diving into agents, Nearpod invested significant effort in building a unified data platform. This foundational work proved essential for their subsequent LLM operations. The company faced challenges with data scattered across approximately 20 disparate sources, including Aurora DB and DynamoDB instances, spanning millions to billions of rows.

Their solution leveraged several key technologies:

The team built what they call a “data product exchange,” which represents the underpinning of a data mesh architecture. They define a data product as the intersection of data and its definition—for example, user usage patterns aggregated from login times and application interactions. This data platform became crucial for agent quality because, as Wallace notes, “without data [agents are] not going to have the quality associated that you need to provide reliable and confident answers.”

Agent Architecture and Use Case

The primary use case driving Nearpod’s agent development is question generation for teachers. The edtech context presents unique challenges: considerations for students, parents, state legislation, international cultural sensitivities, and language nuances. Wallace describes LLMs as being good at translation but struggling with “trans-adaptations”—culturally appropriate adaptations of content.

The team approaches agent development with what Wallace calls the “three-year-old consultant” metaphor. Agents start with limited capability and require significant input from subject matter experts to reach production quality. This framing helps set appropriate expectations: the initial proof of concept takes only about 7 hours to build (compared to 60+ days using traditional methods), but achieving production quality requires substantial iteration with domain experts.

Their agent architecture includes several specialized components:

The team explicitly chose multi-agent architectures over single monolithic agents for several reasons. Specialized agents have smaller token requirements, leading to faster processing and lower costs. They’re also easier to understand, debug, and adjust. Wallace emphasizes that breaking down problems into domain-specific agents follows the principle of being a “master of one” rather than a “jack of all trades.”

Agent Registry and Cross-Departmental Collaboration

One of the most innovative aspects of Nearpod’s approach is their agent registry. This registry allows agents to be discovered, shared, and composed across the organization. Engineers from any of the company’s 12-13 teams can browse available agents and assemble them into new product features.

This architecture fundamentally changes organizational dynamics. Wallace describes how previously isolated departments—curriculum development, legal, sales, marketing—now collaborate much more closely with engineering. The rapid proof-of-concept cycle (hours rather than months) enables stakeholders to review working prototypes and provide actionable feedback early. As Wallace puts it, engineers can now say “I’m a three-year-old [in your domain], so what did I not understand from this language gap?”

The registry model also addresses a common agent anti-pattern: recreating similar functionality across different teams. By centralizing agents, the organization avoids duplication and enables reuse of validated, tested components.

Evaluation Framework

Nearpod built a custom evaluation framework that serves as the “single source of truth” for their agent systems. Key aspects include:

The evaluation framework was built specifically to address gaps in available tooling. While Python engineers could use OpenAI’s eval framework directly, TypeScript engineers (common in Nearpod’s stack) lacked similar tooling. The custom framework supports both languages and integrates with CI/CD pipelines for automated testing.

Wallace is candid about the time distribution in agent development: building the initial POC takes only 10-20% of total effort, while 80-90% goes into debugging false positives, prompt tuning, and quality assessment. This insight is valuable for teams planning agent development timelines.

Cost Optimization

Cost management is deeply integrated into Nearpod’s agent operations. Their evaluation framework includes cost estimation capabilities that approximate production costs based on:

This allows teams to see projected costs before deploying new agents or agent compositions. Wallace notes they’ve achieved high accuracy (98-100%) with lower costs by using cheaper models, enabled by two key practices:

The cost visibility extends to the agent registry, where teams can estimate the expense of combining multiple agents for a new product feature before committing to production deployment.

Production Deployment

Deploying non-deterministic agents to production requires accepting higher risk than traditional software. Nearpod addresses this through:

Wallace references the Air Canada chatbot incident as an example of what can go wrong, emphasizing that teams must accept some level of uncertainty with LLM-based systems while minimizing risk through comprehensive testing.

Deterministic vs. Non-Deterministic Orchestration

The team uses both deterministic and non-deterministic orchestration patterns depending on the use case. Non-deterministic orchestration allows an orchestrating agent to decide which subordinate agents to invoke based on context. For example, in an e-commerce analogy Wallace provides, an orchestrator might call a user interests agent, a business needs agent, or both, depending on the situation.

This flexibility comes with trade-offs. Non-deterministic flows are harder to predict and debug but offer greater adaptability. The team has learned to carefully consider where non-determinism adds value versus where simpler deterministic flows suffice.

Feedback Loops and Continuous Improvement

While acknowledging that services like OpenAI don’t support direct retraining, Nearpod builds feedback loops through:

This creates a virtuous cycle where agent performance data feeds back into improvements, leveraging the data platform infrastructure built earlier.

Organizational Implications

Perhaps the most significant insight from Nearpod’s experience is how agent development changes organizational dynamics. The speed of prototyping (hours instead of months) means:

Wallace predicts that agents may eventually enable teams to deprecate legacy codebases that have been technically challenging to maintain, replacing them with more flexible agent-based implementations. This represents a fundamental shift in how software organizations might evolve their systems over time.

Critical Assessment

While the case study presents an optimistic view of agent development, several considerations warrant attention:

Nevertheless, Nearpod’s approach demonstrates thoughtful integration of data infrastructure, evaluation frameworks, and organizational change management in deploying LLM-based agents at scale.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57

AI-Powered CRM Insights with RAG and Text-to-SQL

TP ICAP 2025

TP ICAP faced the challenge of extracting actionable insights from tens of thousands of vendor meeting notes stored in their Salesforce CRM system, where business users spent hours manually searching through records. Using Amazon Bedrock, their Innovation Lab built ClientIQ, a production-ready solution that combines Retrieval Augmented Generation (RAG) and text-to-SQL approaches to transform hours of manual analysis into seconds. The solution uses Amazon Bedrock Knowledge Bases for unstructured data queries, automated evaluations for quality assurance, and maintains enterprise-grade security through permission-based access controls. Since launch with 20 initial users, ClientIQ has driven a 75% reduction in time spent on research tasks and improved insight quality with more comprehensive and contextual information being surfaced.

customer_support question_answering data_analysis +36