Canva developed a systematic framework for evaluating LLM outputs in their design transformation feature called Magic Switch. The framework focuses on establishing clear success criteria, codifying these into measurable metrics, and using both rule-based and LLM-based evaluators to assess content quality. They implemented a comprehensive evaluation system that measures information preservation, intent alignment, semantic order, tone appropriateness, and format consistency, while also incorporating regression testing principles to ensure prompt improvements don't negatively impact other metrics.
This case study presents Canva’s approach to evaluating LLM outputs in their “Magic Switch” feature, which transforms user designs into different content formats. Canva is an online design platform that aims to democratize design for everyone. The presenter, Shin, a machine learning engineer at Canva with experience in NLP, computer vision, and MLOps, shares their methodology for building a robust LLM evaluation system to ensure quality outputs when dealing with creative, subjective, and non-deterministic content generation.
The core challenge Canva faced was a fundamental one in LLMOps: how do you evaluate whether creative LLM outputs are actually “good”? Unlike traditional ML tasks where there’s often a ground truth to compare against, LLM-generated creative content (summaries, emails, songs) exists in a subjective space where traditional accuracy metrics don’t apply. The team recognized that LLMs provide creativity and non-deterministic outputs, but don’t guarantee correctness, objectivity, or determinism—making evaluation particularly challenging.
Canva’s Magic Switch feature must handle a wide variety of design inputs that can be quite challenging:
These diverse inputs need to be transformed into “delightful content” such as summaries (for quick high-level understanding), emails (for communication), or even creative outputs like songs. The transformation must preserve meaning while adapting to the target format’s conventions.
A key insight shared in this case study is that the development process with LLMs should be reversed from the typical engineering approach. Instead of starting with prompts and evaluating outputs afterward, Canva’s team advocated for:
This approach prevents the common pitfall of “goalpost moving” where teams unconsciously adjust their definition of success based on what the LLM produces.
Canva established two main areas for quality assessment:
Content Criteria (assessed through questions like):
Format Criteria (assessed through questions like):
The team converted their quality criteria into specific, measurable metrics:
For example, the format criteria for email outputs was defined as: “The email should have the format of a subject line, proper greeting, concise content, appropriate call to action, and salutation.”
Canva implemented two types of evaluators based on the complexity of the criteria being assessed:
Rule-Based Evaluators: Used for small, straightforward criteria. For example, checking if a summary has a title can be done with regex-based rules. These are objective, deterministic, and computationally efficient.
LLM-as-Judge Evaluators: Used for complex, abstract, higher-level criteria. For instance, evaluating whether an email has a “proper greeting” or “appropriate call to action” is conceptual and difficult to capture with simple rules. LLMs are quite effective at making these subjective assessments.
The evaluation architecture includes:
For LLM-based scoring, instructions ask the model to provide a score between 0 and 1, where 0 means criteria are not followed and 1 means all criteria are fully met. The presenter clarified this is a continuous scale (not binary), normalized for easy aggregation and comparison across iterations.
A critical question in LLM-as-judge approaches is: “Can we trust the scores generated by LLMs?” The team addressed this through ad-hoc ablation score checks—deliberately degrading outputs and verifying that scores dropped appropriately:
This ablation approach validates that the LLM evaluator is responding appropriately to quality degradation. The presenter acknowledged that future improvements could include more systematic checks with human-in-the-loop processes to calculate correlation between LLM evaluator scores and human labels.
Each iteration of prompt engineering triggers an evaluation run against LLM test input/output pairs. The evaluation compares metrics before and after prompt changes, serving dual purposes:
This mirrors software engineering regression testing practices, applied to the inherently non-deterministic domain of LLM outputs.
The presenter shared several important takeaways from this work:
The team identified several areas for improvement:
The presenter also discussed how their approach relates to RAG (Retrieval Augmented Generation), noting that their current system already operates similarly—user design input serves as context/knowledge, and the expected output type serves as the query, combined to guide the LLM transformation. The emphasis remains on quality and format rather than factual accuracy, reflecting the creative nature of the use case.
This case study demonstrates a mature approach to LLMOps evaluation that recognizes the unique challenges of creative content generation while applying systematic engineering practices to ensure quality at scale.
OpenAI's applied evaluation team presented best practices for implementing LLMs in production through two case studies: Morgan Stanley's internal document search system for financial advisors and Grab's computer vision system for Southeast Asian mapping. Both companies started with simple evaluation frameworks using just 5 initial test cases, then progressively scaled their evaluation systems while maintaining CI/CD integration. Morgan Stanley improved their RAG system's document recall from 20% to 80% through iterative evaluation and optimization, while Grab developed sophisticated vision fine-tuning capabilities for recognizing road signs and lane counts in Southeast Asian contexts. The key insight was that effective evaluation systems enable rapid iteration cycles and clear communication between teams and external partners like OpenAI for model improvement.
This case study presents the deployment of Dust.tt's AI platform across multiple companies including Payfit and Alan, focusing on enterprise-wide productivity improvements through LLM-powered assistants. The companies implemented a comprehensive AI strategy involving both top-down leadership support and bottom-up adoption, creating custom assistants for various workflows including sales processes, customer support, performance reviews, and content generation. The implementation achieved significant productivity gains of approximately 20% across teams, with some specific use cases reaching 50% improvements, while addressing challenges around security, model selection, and user adoption through structured rollout processes and continuous iteration.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.