Ragas, Various: Systematic AI Application Improvement Through Evaluation-Driven Development

LLMOps Database

Tech

Ragas, Various

Company

Ragas, Various

Title

Systematic AI Application Improvement Through Evaluation-Driven Development

Industry

Tech

Link

https://blog.ragas.io/hard-earned-lessons-from-2-years-of-improving-ai-applications

Year

2025

Summary (short)

This case study presents Ragas' comprehensive approach to improving AI applications through systematic evaluation practices, drawn from their experience working with various enterprises and early-stage startups. The problem addressed is the common challenge of AI engineers making improvements to LLM applications without clear measurement frameworks, leading to ineffective iteration cycles and poor user experiences. The solution involves a structured evaluation methodology encompassing dataset curation, human annotation, LLM-as-judge scaling, error analysis, experimentation, and continuous feedback loops. The results demonstrate that teams can move from subjective "vibe checks" to objective, data-driven improvements that systematically enhance AI application performance and user satisfaction.

This case study presents Ragas' comprehensive methodology for improving AI applications through systematic evaluation practices, based on their experience working with various enterprises and dozens of early-stage startups over two years. While the text serves as content marketing for Ragas' evaluation infrastructure platform, it provides valuable insights into practical LLMOps challenges and solutions that teams face when deploying LLM applications in production environments. The core problem addressed is a common challenge in the LLMOps landscape: AI engineers struggle to systematically improve their applications because they lack proper evaluation frameworks. Teams often resort to subjective "vibe checks" rather than objective measurements, leading to ineffective iteration cycles. The authors note that while LLM-based system iteration is fast in terms of implementation (changing prompts, tools, or routing logic takes minutes), measuring the impact of these changes is slow and often requires human review of sample responses. This creates a bottleneck where evaluations can consume approximately 30% of a team's iteration loop, particularly when testing prompt tweaks or retrieval changes across multiple edge cases. The solution framework presented consists of several interconnected components that form a comprehensive evaluation-driven development approach. The methodology begins with defining what to evaluate, distinguishing between end-to-end evaluations (which test the complete pipeline like integration tests) and component-wise evaluations (which test specific parts like unit tests). The authors recommend starting with end-to-end evaluation since it reflects what users actually experience, then drilling down into component-level testing for debugging purposes. Dataset curation represents a critical foundation of their approach. For pre-production scenarios, they recommend starting with 10-30 realistic inputs that represent expected production scenarios, focusing on intent diversity rather than volume. For post-production situations, they suggest sampling 30-50 real system inputs and reviewing them manually for quality, diversity, and edge cases. To improve test data diversity, they propose a clustering-based approach: sampling system inputs from production, deduplicating them, embedding them using embedding models, clustering the embeddings using KNN or DBSCAN, and then sampling from each cluster to ensure comprehensive coverage. The methodology includes sophisticated techniques for generating high-quality synthetic test data using LLMs. Their approach involves conditioning LLMs on key variables such as persona (who is asking), topic (what the query is about), and query complexity (short vs. long, ambiguous vs. specific). They provide practical examples, such as generating healthcare chatbot queries by varying personas from PhD students to experienced academics to specialist doctors, combined with different query complexities and medical topics. The authors emphasize using LLMs for extrapolation while maintaining human validation to ensure quality. Human review and annotation form another crucial component of their framework. They stress that many teams underestimate the importance of human review while overestimating the effort required. The process involves defining what to measure (typically 1-3 dimensions per input-response pair), choosing appropriate metric types (binary pass/fail, numerical scores, or ranking), and collecting justifications alongside ratings. For example, in RAG systems, they focus on response correctness and citation accuracy, while for coding agents, they emphasize syntactic correctness, runtime success, and code style. The authors recommend building custom annotation UIs tailored to specific use cases to make the review process more efficient and less tedious. A significant innovation in their approach is scaling evaluation through LLM-as-judge systems. Once teams have established solid test sets with human annotations, they can replace repeated manual review with automated evaluation powered by LLMs prompted to act like domain experts. The goal is achieving at least 80% agreement between LLM judges and human expert labels. Their implementation strategy involves few-shot prompting with high-quality examples, retrieval-based prompting that fetches similar reviewed samples from datasets, and including verbal feedback from annotation phases to help judges reason better. They suggest using correlation analysis methods like Spearman's rank correlation or Cohen's kappa to measure alignment between LLM-as-judge outputs and human evaluator ratings. Error analysis represents a diagnostic phase that helps teams understand why systems fail and what actions to take next. Their approach involves two steps: developing error hypotheses by manually inspecting logs for underperforming samples, and categorizing these hypotheses into groups for prioritization. They recommend using observability tools like Sentry, Datadog, or specialized LLM observability tools to trace and log necessary information. The categorization process can be done manually or with LLM assistance, and teams should prioritize the most frequent error categories for maximum impact. The experimentation framework emphasizes structured changes with clear measurement protocols. Each experiment involves tweaking a single component in the pipeline, evaluating it using existing test data, comparing results against baselines, and making informed decisions about deployment. This removes guesswork from iteration cycles and provides confidence for shipping improvements. The authors note that the real challenge in AI applications isn't the first 80% of problems but the long-tailed 20% that represents rare, unpredictable scenarios encountered in production. The methodology concludes with establishing ML feedback loops to handle production edge cases. This involves capturing signals from production through explicit feedback (direct user input like thumbs down or ratings) and implicit feedback (behavioral signals like users not copying output or abandoning sessions). The authors emphasize interpreting feedback carefully, noting that negative feedback might be due to latency rather than quality issues. The loop closure process involves adding real failures to test datasets, running targeted experiments, and shipping improved versions. Throughout the case study, the authors reference implementations by notable companies including GitHub Copilot, Casetext, and Notion, lending credibility to their approach. They also provide practical code examples and recommend specific tools, including various open-source LLM observability platforms like Langfuse, Phoenix, Helicone, Laminar, Traceloop, and OpenLit. While the case study effectively demonstrates practical LLMOps challenges and solutions, it's important to note that it serves as content marketing for Ragas' evaluation infrastructure platform. The claims about effectiveness and the specific percentage improvements mentioned (such as the 80% agreement threshold for LLM-as-judge systems) should be considered within this context. However, the overall methodology aligns with industry best practices and addresses real challenges faced by teams deploying LLM applications in production environments. The approach represents a mature perspective on LLMOps that goes beyond simple prompt engineering or model selection to encompass systematic evaluation, continuous improvement, and production monitoring. It acknowledges the complexity of deploying AI systems at scale while providing actionable frameworks for addressing common challenges. The emphasis on moving from subjective evaluation to objective measurement reflects the evolution of the field toward more rigorous engineering practices.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source