ZenML

Iterative Development Process for Production AI Features

Zapier 2024
View original source

Zapier's journey in developing and deploying AI products demonstrates a pragmatic, iterative approach to LLMOps. Their methodology focuses on rapid prototyping with advanced models like GPT-4 Turbo and Claude Opus, followed by quick deployment of initial versions (even with sub-50% accuracy), systematic collection of user feedback, and establishment of comprehensive evaluation frameworks. This approach has enabled them to improve their AI products from sub-50% to over 90% accuracy within 2-3 months, while successfully managing costs and maintaining product quality.

Industry

Tech

Technologies

Overview

Zapier is described as the world’s leading workflow automation platform, connecting over 6,000 apps and serving developers and non-developers at more than 2 million companies. The company was an early adopter of generative AI, launching AI features as early as 2022 and continuing to expand their AI capabilities with products like custom AI chatbots, Zapier Central (AI bots), AI Actions, and semantic search features. This case study, published in partnership with Braintrust (an AI evaluation platform), outlines Zapier’s methodology for taking AI products from initial concept to production-ready deployments.

It’s worth noting that this content originates from Braintrust’s blog and is essentially a promotional piece highlighting how Zapier uses their platform. While the insights shared are valuable, readers should be aware that the case study emphasizes Braintrust’s role in the process and may not represent the complete picture of Zapier’s AI development infrastructure.

The Seven-Step Development Process

Zapier’s approach to building production-ready AI products follows a structured seven-step methodology that emphasizes rapid iteration, user feedback, and systematic evaluation.

Prototyping and Initial Validation

The process begins with quickly validating whether an AI feature idea is feasible with existing models. During this phase, the team focuses exclusively on using the most capable (though expensive and slow) models available—specifically GPT-4 Turbo and Claude Opus. The rationale is clear: if the smartest models cannot accomplish the task, there’s no point in proceeding. The team rapidly cycles through different prompts and examples to develop an intuition for what works and what doesn’t.

This approach to prototyping reflects a pragmatic understanding of LLM capabilities. By starting with frontier models, teams can quickly determine the ceiling of what’s possible before investing significant engineering effort. Braintrust’s playground environment is mentioned as the tool of choice for this experimentation phase, allowing quick testing of various prompts and model configurations.

Shipping Early with Acceptable Quality Thresholds

One of the more notable aspects of Zapier’s methodology is their willingness to ship v1 products with sub-50% accuracy. This may seem counterintuitive, but the reasoning is sound: getting real users to interact with the product is the fastest way to understand actual usage patterns and collect diverse inputs that will inform future improvements.

The team acknowledges that at this early stage, “vibes”—essentially having team members manually sanity-check outputs—is sufficient for making progress. This pragmatic approach recognizes that perfect is the enemy of good when it comes to AI product development. However, Zapier employs several risk mitigation strategies when shipping early versions:

This staged rollout approach is a common pattern in LLMOps, allowing teams to gather real-world data while limiting potential negative impacts from errors.

User Feedback Collection

After shipping v1, Zapier obsessively collects every piece of feedback available. This includes both explicit feedback mechanisms (thumbs up/down ratings, star ratings) and implicit signals (error rates, whether users accepted suggestions or asked follow-up questions). The team generally weights explicit feedback more heavily, which makes sense as explicit signals represent conscious user evaluations rather than potentially ambiguous behavioral signals.

This growing collection of examples with associated feedback becomes the foundation for future improvements. Negative feedback helps identify specific areas requiring improvement, while positive feedback validates successful patterns.

Establishing Evaluations

The establishment of a robust evaluation framework is presented as a critical inflection point in the development process. The text describes evaluations as a “natural-language test suite” that accurately scores applications across a broad set of examples. The effectiveness of these evaluations depends on having test sets that are both diverse and reflective of real customer usage patterns.

Zapier bootstraps their test sets by leveraging customer examples collected in the feedback phase. Both positive and negative feedback contribute value:

Over time, these examples are consolidated into “golden datasets” that benchmark performance and prevent regressions. This approach to building evaluation datasets from real user interactions is particularly valuable because it ensures the test suite reflects actual usage rather than synthetic scenarios that may not represent production conditions.

The workflow described involves logging user interactions, analyzing logs, tracking customer feedback, filtering based on that feedback, and directly adding interesting logs to test sets. Dataset versioning is also mentioned as an important capability, addressing the challenge of managing evolving test sets over time.

Iterative Improvement

With the evaluation framework in place, teams can confidently iterate to improve product quality. The ability to immediately test whether an update moved the product in the right direction creates a virtuous cycle: absorb customer feedback, make changes, test changes, and ship with confidence.

The case study claims impressive results: Zapier has improved many of their AI products from sub-50% accuracy to 90%+ within 2-3 months using this feedback loop. As accuracy increases, product availability and capabilities can expand—shipping to more users and allowing AI systems to take more independent actions.

These accuracy improvement claims are notable, though the text doesn’t provide detailed methodology for how accuracy is measured or which specific products achieved these results. Readers should interpret these figures as illustrative rather than rigorously documented benchmarks.

Optimization

The final phase focuses on optimizing cost and latency. By this point, the team has a robust set of evaluations that make it straightforward to benchmark how swapping in cheaper or smaller models impacts product accuracy. This approach—start with expensive frontier models, establish quality baselines, then optimize—is increasingly common in LLMOps as it allows teams to make data-driven decisions about cost-quality tradeoffs.

The text notes that optimization can come earlier if the product is prohibitively expensive with frontier models or requires very low latency. This flexibility acknowledges that business constraints sometimes require cost optimization before ideal quality levels are achieved.

Tools and Technologies

The case study prominently features Braintrust as the evaluation platform of choice. Specific capabilities mentioned include:

Models mentioned include GPT-4 Turbo, GPT-4o, and Claude Opus, all of which are frontier models used during the prototyping and initial development phases.

Critical Assessment

While this case study provides valuable insights into Zapier’s AI development methodology, several caveats are worth noting:

The content is published by Braintrust and naturally emphasizes their platform’s role in the process. The actual complexity of Zapier’s AI infrastructure likely extends beyond what’s described here, potentially including additional tools, custom solutions, and engineering practices not mentioned.

The accuracy improvement claims (sub-50% to 90%+) are impressive but lack methodological detail. Questions remain about how accuracy is defined, whether these figures represent user satisfaction, task completion, or some other metric, and which specific products achieved these results.

The seven-step process is presented as somewhat linear, but real-world AI development is often more iterative and messy, with steps overlapping and teams cycling back to earlier phases as needed.

Despite these caveats, the case study offers a practical framework for teams looking to operationalize LLM-based products. The emphasis on shipping early, collecting real user feedback, and building evaluation datasets from production data reflects emerging best practices in the LLMOps community. The staged rollout strategies and risk mitigation approaches provide actionable guidance for teams concerned about the risks of deploying imperfect AI systems.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Production AI Agents for Enterprise HR, IT, and Finance Platform

Rippling 2025

Rippling, an enterprise platform providing HR, payroll, IT, and finance solutions, has evolved its AI strategy from simple content summarization to building complex production agents that assist administrators and employees across their entire platform. Led by Anker, their head of AI, the company has developed agents that handle payroll troubleshooting, sales briefing automation, interview transcript summarization, and talent performance calibration. They've transitioned from deterministic workflow-based approaches to more flexible deep agent paradigms, leveraging LangChain and LangSmith for development and tracing. The company maintains a dual focus: embedding AI capabilities within their product for customers running businesses on their platform, and deploying AI internally to increase productivity across all teams. Early results show promise in handling complex, context-dependent queries that traditional rule-based systems couldn't address.

customer_support healthcare document_processing +39

AI Agents in Production: Multi-Enterprise Implementation Strategies

Canva / KPMG / Autodesk / Lightspeed 2026

This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.

customer_support data_cleaning content_moderation +36