Zed: Testing and Evaluation Strategies for AI-Powered Code Editor with Agentic Editing

LLMOps Database

Tech

Zed

Company

Zed

Title

Testing and Evaluation Strategies for AI-Powered Code Editor with Agentic Editing

Industry

Tech

Link

https://www.youtube.com/watch?v=WXy8Yy9xGss

Year

2025

Summary (short)

Zed, an AI-enabled code editor built from scratch in Rust, implemented comprehensive testing and evaluation strategies to ensure reliable agentic editing capabilities. The company faced the challenge of maintaining their rigorous empirical testing approach while dealing with the non-deterministic nature of LLM outputs. They developed a multi-layered approach combining stochastic testing with deterministic unit tests, addressing issues like streaming edits, XML tag parsing, indentation handling, and escaping behaviors. Through statistical testing methods running hundreds of iterations and setting pass/fail thresholds, they successfully deployed reliable AI-powered code editing features that work effectively with frontier models like Claude 4.

Tags

## Overview Zed represents a compelling case study in applying rigorous engineering practices to AI-powered software development tools. Founded by Nathan Sobo and his team, Zed is an AI-enabled code editor that distinguishes itself from competitors like VS Code by being built entirely from scratch in Rust. The system is engineered like a video game with approximately 1,200 lines of shader programs running on the GPU, designed to deliver frames at 120 frames per second. The company has been developing since 2018-2021 and recently launched "Agentic Editing" capabilities, which presented unique challenges in maintaining their traditionally rigorous testing standards while dealing with the inherent non-determinism of large language models. ## The Testing Philosophy Challenge Zed's development team has maintained an extremely empirical approach to software development, with tens of thousands of tests ensuring system reliability. Their testing methodology includes sophisticated concurrent testing where they simulate servers, create multiple clients, and run thousands of iterations with randomized concurrent operations to catch edge cases. This approach has historically allowed them to maintain fully deterministic testing environments without flaky CI tests. However, the introduction of LLM-powered features fundamentally changed this paradigm, as the non-deterministic nature of language model outputs made traditional deterministic testing impossible. The team recognized that even with controlled sampling from logits, changing a single token in the input could completely alter the output, requiring a fundamental shift from deterministic to stochastic testing approaches. This represents a common challenge in LLMOps where traditional software engineering practices must be adapted to accommodate the probabilistic nature of AI systems. ## Multi-Layered Evaluation Strategy Zed developed a sophisticated multi-layered approach to testing their agentic editing capabilities. Their strategy combines three main types of evaluations, each serving different purposes in ensuring system reliability. The first layer consists of traditional data-driven evaluations similar to those used in machine learning, resembling frameworks like SWE-bench. These evaluations compile a headless version of Zed, check out repositories, run agents, and attempt to make them perform coding tasks. However, the team quickly discovered that when these high-level evaluations failed, diagnosing the root cause among potentially millions of failure modes became extremely difficult. The second layer involves more programmatic, stochastic unit tests that focus on specific aspects of the system while still interacting with the LLM. These tests run hundreds of iterations (typically 100-200) and establish statistical thresholds for pass/fail criteria. For example, they might require 100% of 200 test runs to pass for a test to be considered successful. This approach allows them to maintain quality standards while accounting for the inherent variability in LLM outputs. The third layer comprises traditional deterministic unit tests that test the algorithmic components supporting the AI features. These tests focus on parsing, matching, and text manipulation logic that can be tested deterministically, even when they're critical to handling LLM outputs correctly. ## Technical Implementation Details ### Search and Code Understanding One of the first challenges Zed encountered was in their implementation of code search functionality. Their initial "dumb" implementation of the grep tool would show context that could confuse the model. For instance, when searching for a function definition, the model might see truncated or incomplete context that didn't provide proper syntactic boundaries. To address this, they implemented tree-sitter integration, using Max Parsons' parsing framework to expand search matches to proper syntactic boundaries. This ensured that when the LLM examined code, it always received complete, well-formed syntax trees rather than partial snippets. This improvement was driven by their stochastic testing approach, where they could identify that the agent was failing to properly understand code structure due to incomplete context. The solution involved a deterministic algorithmic improvement that was validated through continued stochastic testing. ### Streaming Edits Architecture The team initially implemented editing through traditional tool calls, but discovered significant user experience issues with this approach. Tool calls don't stream well, providing all results at once rather than showing progressive updates. To address this, they developed a two-stage approach: first, a small tool call describes the intended edits, then they loop back to the same model (leveraging its loaded cache) to emit "old text/new text" blocks that can be streamed to users in real-time. This streaming approach required sophisticated parsing and handling of the LLM's output format, as the model would generate structured text blocks that needed to be parsed, validated, and applied to the codebase incrementally. The challenge was ensuring robust handling of malformed, incomplete, or incorrectly structured output from the LLM. ### XML Parsing and Robustness A significant portion of Zed's LLMOps challenges centered around parsing structured output from LLMs. They implemented extensive testing for XML tag parsing, as models would frequently generate malformed XML structures. Common issues included mismatched tags, where the model would start with an "old text" tag but end with a "new text" tag, or generate empty tags that provided no useful information for the matching algorithm. Through prompt engineering, they were able to reduce XML tag mismatches from 40% to about 5%, but recognized that completely eliminating such errors through prompting alone was unrealistic. Instead, they built robust parsing systems that could handle malformed XML gracefully, recovering from errors and extracting meaningful information even from imperfect model outputs. ### Fuzzy Matching Implementation To handle cases where the LLM generates text that's similar but not identical to the target code, Zed implemented a fuzzy matching algorithm using dynamic programming. This system allows for approximate matching when the model's output is slightly incorrect but semantically appropriate. The fuzzy matching proved critical in reducing tool calling failures, as it could bridge the gap between what the model intended to match and what it actually generated. The fuzzy matching algorithm itself is fully deterministic and can be tested traditionally, but its integration with LLM outputs required the stochastic testing approach to ensure it worked correctly across the wide variety of model behaviors. ### Streaming Diff Algorithm One of the most sophisticated technical challenges was implementing a streaming diff algorithm that could handle real-time comparisons between old and new text as the model generates output. The system needs to make dynamic decisions about whether missing text from the original content has been deleted intentionally or simply hasn't been streamed yet. This requires careful state management and predictive logic to provide smooth user experiences. The streaming diff implementation involves comparing incoming text tokens against the existing codebase in real-time, maintaining multiple possible interpretations of the edit until enough context is available to make definitive decisions. This algorithmic complexity is fully deterministic and testable, but its integration with streaming LLM outputs required extensive stochastic validation. ## Prompt Engineering and Model Behavior Management ### Insertion Boundary Handling Zed discovered that models frequently struggled with insertions at document boundaries (beginning or end of files). Models would often generate empty "old text" tags when trying to insert content at these locations, which broke their old text/new text matching logic. Through prompt engineering, they were able to significantly reduce this behavior by explicitly instructing the model about proper formatting for boundary insertions. However, even with improved prompting, the behavior still occurred 1-2% of the time, necessitating robust handling in their parsing logic. This exemplifies a common LLMOps pattern where prompt engineering can significantly improve model behavior but rarely eliminates problematic patterns entirely, requiring system-level robustness. ### Indentation Normalization A particularly interesting challenge emerged around indentation handling. Models would sometimes generate code with completely flattened indentation when trying to modify nested code structures. For example, when modifying an inner function within an outer function, the model might output the replacement text with no leading whitespace, even though the original code was properly indented. Zed solved this through an indentation delta detection system that analyzes both the original code's indentation and the model's generated replacement, computing a normalization factor to maintain proper code structure. This system can detect when the model has provided semantically correct code with incorrect formatting and automatically adjust the indentation to match the surrounding context. ### Escaping and Special Character Handling The team encountered significant challenges with models generating incorrect escaping in various contexts, particularly with Rust's raw string syntax that allows quotes and other special characters. Models, especially Gemini, would sometimes apply HTML escape codes, backslash escaping, or double-escape newlines when generating code modifications. These escaping issues were primarily addressed through prompt engineering, explicitly instructing models about proper character handling in different contexts. However, the team acknowledges this as an ongoing area for improvement, with opportunities to implement more sophisticated detection and correction of escaping errors. ## Statistical Testing Methodology Zed's approach to statistical testing represents a sophisticated adaptation of traditional software engineering practices to the LLMOps domain. Rather than accepting the inherent non-determinism as a limitation, they embraced it while maintaining rigorous quality standards through statistical methods. Their tests typically run 100-200 iterations of specific scenarios, establishing clear pass/fail thresholds. For critical functionality, they require 100% success rates across all iterations, while for less critical features, they might accept 95% success rates. This approach allows them to identify and address edge cases that might only appear in a small percentage of interactions but could significantly impact user experience. The statistical testing methodology also enables them to measure the impact of changes to prompts, model parameters, or system architecture quantitatively. They can compare success rates before and after modifications, providing objective measures of improvement or regression. ## Integration with Development Workflow Zed has integrated their LLMOps testing directly into their standard development workflow rather than using external evaluation frameworks. The eval functions are part of their main test suite and run as part of their continuous integration process. This integration ensures that AI-powered features are held to the same reliability standards as traditional software components. The team's approach demonstrates that LLMOps testing doesn't necessarily require specialized tools or frameworks - many of the principles and infrastructure used for traditional software testing can be adapted effectively. Their use of the same underlying testing infrastructure for both deterministic and stochastic tests creates consistency in their development process. ## Results and Real-World Performance The comprehensive testing approach has enabled Zed to deploy reliable agentic editing capabilities that work effectively with frontier models like Claude 4. Nathan Sobo reports being able to write Rust code agentically with high efficiency, suggesting that their testing methodology has successfully addressed the reliability challenges inherent in LLM-powered development tools. The system's ability to handle streaming edits, parse complex code structures, and manage the various edge cases in LLM behavior demonstrates the effectiveness of their multi-layered testing approach. By combining high-level evaluations with focused stochastic tests and deterministic unit tests, they've created a robust system that can handle the unpredictability of LLM outputs while maintaining user confidence. ## Lessons for LLMOps Practitioners Zed's experience offers several valuable insights for LLMOps practitioners. First, traditional software engineering rigor remains fundamental even when working with non-deterministic AI systems. The principles of comprehensive testing, empirical validation, and systematic debugging apply equally to AI-powered systems, though the implementation must be adapted. Second, the multi-layered testing approach - combining broad evaluations, focused stochastic tests, and deterministic unit tests - provides a comprehensive framework for ensuring system reliability. Each layer serves a different purpose and catches different types of issues. Third, many challenges in LLMOps are not advanced machine learning problems but rather basic software engineering issues related to parsing, formatting, and handling edge cases in model outputs. Robust system design and careful attention to data processing can address many reliability issues. Finally, statistical approaches to testing, while requiring more sophisticated infrastructure than traditional deterministic tests, can provide rigorous quality assurance for AI-powered features while accommodating the inherent variability in LLM behavior. The case study demonstrates that with careful engineering and adapted testing methodologies, it's possible to build reliable, production-ready AI-powered development tools that meet the high reliability standards expected in professional software development environments.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source