Anaconda developed a systematic approach called Evaluations Driven Development (EDD) to improve their AI coding assistant's performance through continuous testing and refinement. Using their in-house "llm-eval" framework, they achieved dramatic improvements in their assistant's ability to handle Python debugging tasks, increasing success rates from 0-13% to 63-100% across different models and configurations. The case study demonstrates how rigorous evaluation, prompt engineering, and automated testing can significantly enhance LLM application reliability in production.
Anaconda, the company behind the popular Python distribution and data science platform, published a detailed case study describing their approach to building and improving their “Anaconda Assistant” — an AI-powered coding companion designed to help data scientists with tasks like code generation, debugging, and explanation. The central thesis of the case study is their development of a methodology they call “Evaluations Driven Development” (EDD), which focuses on systematic testing and iterative prompt refinement rather than model fine-tuning to improve LLM performance in production applications.
The Anaconda Assistant is built on top of existing language models (the case study specifically mentions GPT-3.5-Turbo and Mistral 7B Instruct), and its primary use cases include generating code snippets, suggesting code improvements, providing explanations of functions and modules, recommending data preprocessing techniques, and most notably, intelligent debugging. According to their telemetry data, approximately 60% of user interactions with the Assistant involve debugging help, making error handling accuracy a critical capability to optimize.
The case study presents a concrete and honest assessment of the initial performance challenges they faced. When testing their error handling capabilities on a specific debugging scenario (a ValueError raised by invalid input to a function), the initial success rates were remarkably low:
These results were based on 500 iterations per configuration, providing statistically meaningful baselines. The transparency about these initial poor results is notable, as many vendor case studies tend to gloss over challenges. This established a clear problem: the underlying LLMs, when used with naive prompting, could not reliably identify bugs and provide working solutions.
Central to their EDD approach is an in-house testing framework called “llm-eval.” This framework enables automated evaluation of LLM responses across multiple dimensions. The key components of their evaluation methodology include:
Defining Evaluation Criteria: The team focused on metrics that matter to end users, specifically the accuracy of error explanations and the clarity of code explanations. Rather than relying on abstract benchmarks like perplexity or generic NLP metrics, they prioritized real-world task completion.
Test Case Curation: They assembled diverse test cases spanning common data science challenges, from simple syntax errors to complex issues involving data types and performance. This is a critical LLMOps practice — having representative test suites that mirror actual production workloads.
Automated Execution and Validation: The llm-eval framework executes generated code snippets in a controlled environment, capturing execution details including errors and exceptions, and compares outputs to expected results. This is essentially a form of code execution-based evaluation, which is more rigorous than purely semantic or LLM-as-judge approaches for code generation tasks.
Iterative Refinement: Based on evaluation results, the team identifies weaknesses and edge cases, then refines prompts, queries, and knowledge bases accordingly. This creates a feedback loop where evaluation insights directly drive improvement.
The framework appears to have revealed specific patterns where the Assistant struggled, including “errors involving complex data structures or multiple interrelated files.” This level of diagnostic insight is valuable for prioritizing improvement efforts.
Rather than fine-tuning the underlying models (which is expensive, time-consuming, and may not transfer well across model updates), Anaconda focused on prompt engineering to improve performance. They employed several established techniques:
Few-Shot Learning: By including examples of previously explained errors and their fixes in the prompt, they aimed to guide the model toward generating more accurate error explanations and code corrections. These examples were selected from a dataset of common Python errors and their solutions.
Chain-of-Thought Prompting: They structured prompts to request an explanation of the error before requesting the fixed code. This encourages the model to reason step-by-step about the problem, mimicking how a human developer approaches debugging.
The case study includes concrete examples of their prompts, showing a system prompt that positions the Assistant as a “Jupyter Notebook expert” with specific guidelines about explaining errors, asking permission before providing corrected code, and including comments in code to highlight changes. This level of detail about prompt structure is valuable for practitioners.
One of the more innovative aspects of their approach is what they call “Agentic Feedback Iteration.” This technique uses LLMs themselves to analyze evaluation results and suggest prompt improvements:
This is essentially using LLMs for meta-optimization of prompts — a form of automated prompt engineering. The case study provides specific examples of changes made through this process, such as changing the user prompt from “Explain this error” to “How can I fix this error?” — a subtle but meaningful shift that focuses the model on actionable solutions rather than just explanations.
The system prompt was also refined from a brief instruction to a more structured format with explicit numbered guidelines covering: providing code snippets with output errors, explaining errors in simple terms, asking permission before providing corrected code, providing corrected code in a single Python block, and including comments in code to highlight changes and reasons.
After applying their EDD methodology with prompt engineering and Agentic Feedback Iteration, the results showed substantial improvement:
These improvements are dramatic, particularly for the Mistral model which went from essentially unusable (0-2% success) to highly reliable (87-100% success). The fact that Mistral 7B at temperature 1 achieved 100% success is notable, as it suggests that for this particular task, the combination of well-engineered prompts and some output diversity (higher temperature) actually improved reliability.
It’s worth noting that these results are self-reported and based on a specific test scenario. While the improvements are impressive, real-world performance may vary across different error types and user contexts. The case study does acknowledge this is “just the beginning” and outlines plans for expanding their evaluation framework.
The case study mentions that the Assistant “is always getting smarter thanks to our Evaluations Driven Development process. Every interaction with users who have consented to data collection is an opportunity to refine the prompts and queries.” This indicates a production feedback loop where real user interactions inform ongoing improvements — a key LLMOps practice for maintaining and improving production AI systems over time.
Anaconda outlines several planned improvements:
While the case study presents compelling results, a balanced assessment should note several considerations:
The evaluation was demonstrated on a single specific error scenario (a ValueError). While the methodology is sound, the 100% success rate should be understood in this limited context rather than as a general claim about the Assistant’s capabilities.
The improvements achieved through prompt engineering are impressive but may not generalize across all use cases. Different types of errors, programming patterns, or user query styles may require different prompt optimizations.
The “Agentic Feedback Iteration” concept, while innovative, raises questions about reproducibility and interpretability — using LLMs to optimize LLM prompts adds another layer of complexity and potential variability.
That said, the overall approach of rigorous evaluation, systematic prompt engineering, and iterative refinement represents sound LLMOps practice. The transparency about initial poor performance and the specific metrics-driven improvements add credibility to the case study. The focus on real-world task completion (actually fixing bugs in code) rather than proxy metrics is also a strength of their methodology.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
OpenAI developed Codex, a coding agent that serves as an AI-powered software engineering teammate, addressing the challenge of accelerating software development workflows. The solution combines a specialized coding model (GPT-5.1 Codex Max), a custom API layer with features like context compaction, and an integrated harness that works through IDE extensions and CLI tools using sandboxed execution environments. Since launching and iterating based on user feedback in August, Codex has grown 20x, now serves many trillions of tokens per week, has become the most-served coding model both in first-party use and via API, and has enabled dramatic productivity gains including shipping the Sora Android app (which became the #1 app in the app store) in just 28 days with 2-3 engineers, demonstrating significant acceleration in production software development at scale.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.