Company
Zapier
Title
Iterative Development Process for Production AI Features
Industry
Tech
Year
2024
Summary (short)
Zapier's journey in developing and deploying AI products demonstrates a pragmatic, iterative approach to LLMOps. Their methodology focuses on rapid prototyping with advanced models like GPT-4 Turbo and Claude Opus, followed by quick deployment of initial versions (even with sub-50% accuracy), systematic collection of user feedback, and establishment of comprehensive evaluation frameworks. This approach has enabled them to improve their AI products from sub-50% to over 90% accuracy within 2-3 months, while successfully managing costs and maintaining product quality.
## Overview Zapier is described as the world's leading workflow automation platform, connecting over 6,000 apps and serving developers and non-developers at more than 2 million companies. The company was an early adopter of generative AI, launching AI features as early as 2022 and continuing to expand their AI capabilities with products like custom AI chatbots, Zapier Central (AI bots), AI Actions, and semantic search features. This case study, published in partnership with Braintrust (an AI evaluation platform), outlines Zapier's methodology for taking AI products from initial concept to production-ready deployments. It's worth noting that this content originates from Braintrust's blog and is essentially a promotional piece highlighting how Zapier uses their platform. While the insights shared are valuable, readers should be aware that the case study emphasizes Braintrust's role in the process and may not represent the complete picture of Zapier's AI development infrastructure. ## The Seven-Step Development Process Zapier's approach to building production-ready AI products follows a structured seven-step methodology that emphasizes rapid iteration, user feedback, and systematic evaluation. ### Prototyping and Initial Validation The process begins with quickly validating whether an AI feature idea is feasible with existing models. During this phase, the team focuses exclusively on using the most capable (though expensive and slow) models available—specifically GPT-4 Turbo and Claude Opus. The rationale is clear: if the smartest models cannot accomplish the task, there's no point in proceeding. The team rapidly cycles through different prompts and examples to develop an intuition for what works and what doesn't. This approach to prototyping reflects a pragmatic understanding of LLM capabilities. By starting with frontier models, teams can quickly determine the ceiling of what's possible before investing significant engineering effort. Braintrust's playground environment is mentioned as the tool of choice for this experimentation phase, allowing quick testing of various prompts and model configurations. ### Shipping Early with Acceptable Quality Thresholds One of the more notable aspects of Zapier's methodology is their willingness to ship v1 products with sub-50% accuracy. This may seem counterintuitive, but the reasoning is sound: getting real users to interact with the product is the fastest way to understand actual usage patterns and collect diverse inputs that will inform future improvements. The team acknowledges that at this early stage, "vibes"—essentially having team members manually sanity-check outputs—is sufficient for making progress. This pragmatic approach recognizes that perfect is the enemy of good when it comes to AI product development. However, Zapier employs several risk mitigation strategies when shipping early versions: - Labeling features as "beta" to set appropriate user expectations - Shipping initially to internal users before external release - Limiting availability to small subsegments of external users - Making features opt-in rather than default - Keeping humans in the loop through suggestion-based interfaces This staged rollout approach is a common pattern in LLMOps, allowing teams to gather real-world data while limiting potential negative impacts from errors. ### User Feedback Collection After shipping v1, Zapier obsessively collects every piece of feedback available. This includes both explicit feedback mechanisms (thumbs up/down ratings, star ratings) and implicit signals (error rates, whether users accepted suggestions or asked follow-up questions). The team generally weights explicit feedback more heavily, which makes sense as explicit signals represent conscious user evaluations rather than potentially ambiguous behavioral signals. This growing collection of examples with associated feedback becomes the foundation for future improvements. Negative feedback helps identify specific areas requiring improvement, while positive feedback validates successful patterns. ### Establishing Evaluations The establishment of a robust evaluation framework is presented as a critical inflection point in the development process. The text describes evaluations as a "natural-language test suite" that accurately scores applications across a broad set of examples. The effectiveness of these evaluations depends on having test sets that are both diverse and reflective of real customer usage patterns. Zapier bootstraps their test sets by leveraging customer examples collected in the feedback phase. Both positive and negative feedback contribute value: - Positive feedback provides input/output pairs that serve as examples of desired behavior - Negative feedback, once corrected, becomes test cases representing areas where the product previously struggled Over time, these examples are consolidated into "golden datasets" that benchmark performance and prevent regressions. This approach to building evaluation datasets from real user interactions is particularly valuable because it ensures the test suite reflects actual usage rather than synthetic scenarios that may not represent production conditions. The workflow described involves logging user interactions, analyzing logs, tracking customer feedback, filtering based on that feedback, and directly adding interesting logs to test sets. Dataset versioning is also mentioned as an important capability, addressing the challenge of managing evolving test sets over time. ### Iterative Improvement With the evaluation framework in place, teams can confidently iterate to improve product quality. The ability to immediately test whether an update moved the product in the right direction creates a virtuous cycle: absorb customer feedback, make changes, test changes, and ship with confidence. The case study claims impressive results: **Zapier has improved many of their AI products from sub-50% accuracy to 90%+ within 2-3 months** using this feedback loop. As accuracy increases, product availability and capabilities can expand—shipping to more users and allowing AI systems to take more independent actions. These accuracy improvement claims are notable, though the text doesn't provide detailed methodology for how accuracy is measured or which specific products achieved these results. Readers should interpret these figures as illustrative rather than rigorously documented benchmarks. ### Optimization The final phase focuses on optimizing cost and latency. By this point, the team has a robust set of evaluations that make it straightforward to benchmark how swapping in cheaper or smaller models impacts product accuracy. This approach—start with expensive frontier models, establish quality baselines, then optimize—is increasingly common in LLMOps as it allows teams to make data-driven decisions about cost-quality tradeoffs. The text notes that optimization can come earlier if the product is prohibitively expensive with frontier models or requires very low latency. This flexibility acknowledges that business constraints sometimes require cost optimization before ideal quality levels are achieved. ## Tools and Technologies The case study prominently features Braintrust as the evaluation platform of choice. Specific capabilities mentioned include: - Playground environments for prompt testing and model comparison - Logging and analysis of user interactions - Customer feedback tracking and filtering - Dataset management and versioning - Evaluation running and results analysis - Comparison views showing which examples improved or degraded between versions Models mentioned include GPT-4 Turbo, GPT-4o, and Claude Opus, all of which are frontier models used during the prototyping and initial development phases. ## Critical Assessment While this case study provides valuable insights into Zapier's AI development methodology, several caveats are worth noting: The content is published by Braintrust and naturally emphasizes their platform's role in the process. The actual complexity of Zapier's AI infrastructure likely extends beyond what's described here, potentially including additional tools, custom solutions, and engineering practices not mentioned. The accuracy improvement claims (sub-50% to 90%+) are impressive but lack methodological detail. Questions remain about how accuracy is defined, whether these figures represent user satisfaction, task completion, or some other metric, and which specific products achieved these results. The seven-step process is presented as somewhat linear, but real-world AI development is often more iterative and messy, with steps overlapping and teams cycling back to earlier phases as needed. Despite these caveats, the case study offers a practical framework for teams looking to operationalize LLM-based products. The emphasis on shipping early, collecting real user feedback, and building evaluation datasets from production data reflects emerging best practices in the LLMOps community. The staged rollout strategies and risk mitigation approaches provide actionable guidance for teams concerned about the risks of deploying imperfect AI systems.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.