## Overview
This case study comes from a conference presentation featuring Sheila (founder of Tola Capital, a venture capital firm) and Nishal (co-founder and CTO of Klarity), discussing the critical importance of AI evaluations in production systems. Klarity is an enterprise software company that automates back-office document workflows, particularly for finance and accounting teams. The company raised $70 million in Series B funding and has been building AI systems for approximately eight years, transitioning from traditional ML to generative AI in recent years.
The presentation emphasizes that as AI systems become more agentic and automated, the need for robust evaluation frameworks becomes paramount. The speakers argue that current evaluation approaches are often inadequate for the complexity of modern LLM-based systems, and they share both philosophical perspectives on the challenges and practical solutions Klarity has implemented.
## The Problem Space
Klarity tackles document-oriented workflows that have historically been impossible to automate because they are cognitive and non-repetitive. These include revenue recognition, invoice-to-purchase-order matching, and tax withholding processing. The documents are completely unstructured PDFs that humans traditionally had to read and process manually. As one speaker noted during the keynote, document processing tasks are particularly challenging for LLMs due to issues like page rotation, poor scan quality, graphs, images, and embedded tables.
The transition from traditional ML to generative AI introduced several new challenges for evaluation:
**Non-deterministic performance**: Unlike traditional ML models, LLMs produce variable outputs. Uploading the same PDF multiple times can yield completely different responses, making evaluation significantly more complex.
**Novel user experiences**: Generative models enable entirely new types of experiences that are difficult to evaluate. Klarity built features like an "Architect" tool where users record their business workflows and the system generates a comprehensive 10-page business requirements document with flowcharts and images. Other features include natural language analytics for generating charts and visualizations from conversational queries. These experiences have no precedent for evaluation frameworks.
**Accelerated feature development cycles**: In the traditional deep neural network era, features took 5-6 months to build, including dedicated annotation teams and GPU training infrastructure. With generative AI, features can ship in days—Klarity launched document chat within 12 hours of ChatGPT's release. This means evaluation can no longer take weeks without becoming a bottleneck.
**Benchmark divergence from real-world performance**: The speakers highlighted significant gaps between standard benchmarks (like MMLU) and actual production performance. Models that score higher on benchmarks sometimes perform worse on Klarity's internal evaluations, particularly because their use cases push the frontier of human cognition.
## Broader Industry Context on Evaluations
The presentation provides important context on the state of AI evaluations more broadly:
**Benchmark hacking**: Current leaderboards reward "solving for X" rather than genuine understanding. Models can score high on AP exams by being trained on similar questions without truly understanding the subject matter. The speakers cite research on dynamic benchmarks using synthetic data where objects are moved around to prevent memorization—testing spatial reasoning and visual prompting without allowing models to game the evaluation.
**User-centric evaluation gaps**: End users are often an afterthought in evaluation systems, despite being the ultimate recipients of value. The speakers advocate for putting user experience at the center of evaluation frameworks.
**Black-box model challenges**: Evaluation is always a proxy for capability, not truth itself. As AI systems become more complex, understanding what neural networks have learned becomes increasingly difficult, making AI interpretability research critical.
**Values and societal implications**: The presentation raises philosophical questions about whether AI systems reinforce user beliefs (echo chambers) and how to balance appeasing users versus challenging them, present versus aspirational societal values.
## Klarity's Evaluation Solutions
### Embrace Imperfection and Frontload UX Testing
Klarity advocates for "giving yourself the gift of imperfection." Rather than requiring high-quality evals before any feature development, they frontload user testing. Each generative AI feature is treated almost as its own product-market fit experiment because they deliver novel experiences. Many features fail at the UX stage before evals would even matter. The approach is: let features die at the UX stage before investing in comprehensive evaluation infrastructure. However, once a feature is committed to production at scale, rigorous evaluation becomes essential.
### Backward-Chaining from User Experience
Rather than starting from what's easy to measure (like F1 scores) and hoping it correlates with user value, Klarity works backward from the end-user outcome. Their framework involves a three-level hierarchy:
- End-user outcome and business value being driven
- Indicators of adoption and utilization
- Technical health metrics (JSON adherence, output variability, etc.)
Not every eval reflects the entirety of the experience, but in aggregate, evaluations should be reflective of user experience.
### Customer-Specific Evaluation Metrics
Because Klarity operates as "bespoke enterprise AI," each customer effectively provides their own set of labels. The company annotates data for each customer's specific use case as part of the UAT (User Acceptance Testing) process and builds use-case-specific accuracy metrics. Some customers need matching accuracy; others need extraction accuracy. User feedback loops exist but are not relied upon exclusively—the company explicitly rejects the assumption that absence of user feedback indicates positive feedback.
### Data Drift Monitoring
Klarity has invested in tools to monitor data drift, identifying when incoming documents differ significantly from the population the system has been trained on. This helps maintain awareness of when models may be operating outside their reliable performance envelope.
### Synthetic Data Generation Investment
After surveying six-plus providers and finding them inadequate for their needs, Klarity built their own synthetic data generation stack. All example documents shown in the presentation were synthetically generated. The approach is applied selectively: once a use case grows large enough across multiple customers, they invest in customer-agnostic evals using synthetic data. A dedicated team monitors that synthetic data remains distributionally similar to actual customer data. They also track accuracy scores to ensure models neither over-perform nor under-perform on synthetic data compared to production data.
### Reducing Degrees of Freedom
Klarity's architecture involves numerous features, each requiring custom prompts for each customer across 100+ customers. Combined with multiple LLMs that perform differently on different tasks and customers, this creates an exponential explosion in evaluation complexity. Their solution: reduce degrees of freedom by standardizing which LLM is used for specific types of tasks (like automated prompt engineering). While this may sacrifice marginal accuracy gains, building grid search infrastructure across all dimensions is too expensive. They describe this as "dimensionality reduction at a project management level" and commit to continuous reevaluation as new models emerge.
### Automated Prompt Engineering (APE)
Klarity uses automated prompt engineers rather than manual prompt engineering. This addresses the challenge of maintaining thousands of prompts across customers and features. The system has features for free text extraction, tabular extraction, matching, and table composition, each requiring customer-specific prompts generated automatically.
### Future-Facing Evaluation Scaffolding
Beyond evaluating current capabilities, Klarity maintains a "wish list" of future use cases they want to grow into over 3-5 years. They recognize that technology evolves faster than expected, but they don't require production-grade metrics (like MMLU or BBH) for hypothetical future workflows. Instead, they maintain lightweight, scrappy evaluations. For example, when GPT-V (vision model) launched, within an hour they had formed an organizational mental model: good at checkmarks, potentially weaker at pie charts, etc. This muscle of quickly forming qualitative assessments of new capabilities enables faster strategic planning.
## Scale and Production Metrics
Klarity now processes over 500,000 documents for customers, runs more than 15 unique LLM use cases in production, and operates with more than 10 LLMs under the hood. Often, a single use case has multiple LLMs working together. This multi-model architecture adds complexity to evaluation but is necessary for meeting the diverse cognitive demands of enterprise document processing.
## Key Takeaways
The case study illustrates several important LLMOps principles:
The traditional ML evaluation paradigms (train/test splits, F1 scores, ROC curves, shared benchmark tasks) don't translate directly to generative AI applications. New evaluation frameworks must be developed that account for non-determinism, novel user experiences, and rapid development cycles.
User experience should drive evaluation design, not the reverse. Abstract benchmarks that don't correlate with user value are insufficient and potentially misleading.
Pragmatism is essential—embracing imperfection, reducing dimensions of complexity, and knowing when to invest in rigorous evaluation versus when to accept lightweight assessments.
Synthetic data generation is becoming a critical capability for production LLM systems, but off-the-shelf solutions may not meet specialized enterprise needs.
The speakers also emphasize the broader responsibility of AI practitioners to think about values, interpretability, and societal implications as AI systems become more autonomous and agentic. They call for the community to reinvent benchmarks, bring depth and multifaceted nature to evaluations, and share learnings with one another—as demonstrated by Klarity's willingness to discuss both successes and failures in their evaluation journey.