Company
DocETL
Title
Systematic Approach to Building Reliable LLM Data Processing Pipelines Through Iterative Development
Industry
Research & Academia
Year
2025
Summary (short)
UC Berkeley researchers studied how organizations struggle with building reliable LLM pipelines for unstructured data processing, identifying two critical gaps: data understanding and intent specification. They developed DocETL, a research framework that helps users systematically iterate on LLM pipelines by first understanding failure modes in their data, then clarifying prompt specifications, and finally applying accuracy optimization strategies, moving beyond the common advice of simply "iterate on your prompts."
## Overview This case study presents research from UC Berkeley's DocETL project, focusing on the challenges practitioners face when building LLM pipelines for processing unstructured data. The talk, delivered by Shrea (a PhD candidate at UC Berkeley), provides a research-oriented perspective on why "prompt iteration" is so difficult and what kinds of tooling and methodologies can help practitioners build more reliable LLM systems. Unlike vendor case studies, this is an academic research perspective that offers insights applicable across industries and use cases. The core insight from this research is that organizations are increasingly trying to use LLMs to extract insights from large collections of unstructured documents—customer service reviews, sales emails, contracts, safety incident reports—but they consistently report that "prompts don't work" and that iteration feels like "hacking away at nothing." The research team applied HCI (Human-Computer Interaction) research methods to understand these challenges systematically. ## The Problem Space: Data Processing Agents The research focuses on what they call "data processing agents"—LLM-based systems designed to extract, analyze, and summarize insights from organizational document collections. The examples given include: - Customer service reviews: extracting themes, summarizing actionable insights - Sales communications: analyzing why deals didn't close and determining next steps - Real estate contracts: extracting pet policy clauses by neighborhood - Safety domains (traffic, aviation): identifying causes of accidents and mitigation strategies A typical pipeline architecture involves sequences of LLM operations: map operations that extract outputs from each document, additional map operations for classification or categorization, and aggregate operations that summarize results by grouping criteria (neighborhood, city, etc.). This is a common pattern in production LLM systems dealing with document processing at scale. ## Two Critical Gaps Identified The research identified two fundamental gaps that explain why LLM pipeline development is so challenging: ### The Data Understanding Gap The first gap is between the developer and their data. When practitioners start building LLM pipelines, they often don't know what the "right question" is to ask. In the real estate example, a developer might initially request "all pet policy clauses" and only later realize they specifically need "dog and cat pet policy clauses" after examining outputs. This discovery process is inherently iterative and requires looking at actual data and outputs. The challenges here include identifying the types of documents in the dataset and understanding unique failure modes for each document type. For pet policy clauses alone, the research found multiple distinct clause types: breed restriction clauses, clauses on number of pets, service animal exemptions, and others that developers don't anticipate until they examine the data. The research observed a "long tail" of failure modes across virtually every application domain. This echoes broader patterns in machine learning systems where edge cases are numerous and difficult to enumerate in advance. Specific failure modes observed include: - Unusually phrased clauses that LLMs miss during extraction - LLMs overfitting to certain keywords - Extraction of unrelated content due to keyword associations - Unexpected document formats or structures The researchers note it's "not uncommon to see people flag hundreds of issues in a thousand document collection," suggesting that production LLM pipelines face substantial quality challenges at scale. ### The Intent Specification Gap The second gap is between what the developer wants and how they specify it to the LLM. Even when developers believe they understand their task, translating that understanding into unambiguous prompts is surprisingly difficult. Things that seem unambiguous to humans—like "dog and cat policy clauses"—are actually ambiguous to LLMs, which may need explicit specification of weight limits, breed restrictions, quantity limits, and other details. When developers encounter hundreds of failure modes, figuring out how to translate observations into pipeline improvements becomes overwhelming. The options include prompt engineering, adding new operations, task decomposition, or analyzing document subsections—but practitioners "often get very lost" navigating these choices. ## Proposed Solutions and Tooling The research team is developing tooling to address these gaps, available through the DocETL project: ### For the Data Understanding Gap The team is building tools that automatically help users find anomalies and failure modes in their data. Key approaches include: - Automatic clustering of outputs so users can see patterns - Annotation capabilities that help organize and label failure modes - Generation of evaluation datasets based on identified failure mode categories - Potential strategies like generating alternative phrasings with LLMs or implementing hybrid keyword/LLM checks The goal is to help users design "evals on the fly" for each different failure mode, rather than requiring comprehensive evaluation sets upfront. This reflects a practical reality: in production LLM systems, evaluations are never "done first"—teams are constantly discovering new failure modes as they run pipelines. ### For the Intent Specification Gap The research stack includes capabilities to take user-provided notes and automatically translate them into prompt improvements. This is presented through an interactive interface where users can provide feedback, edit suggestions, and maintain revision history—making the system "fully steerable." The emphasis on revision history and user control reflects good practices for production LLM systems where transparency and reproducibility matter. ## Key Takeaways for LLMOps Practitioners The research offers several insights that apply beyond data processing pipelines to LLM operations more broadly: ### Evaluations Are Never Done In every domain studied, evaluations are "very, very fuzzy" and never complete upfront. Teams are constantly collecting new failure modes as they run pipelines and creating new subsets of documents or example traces that represent future evaluations. This suggests that production LLM systems need infrastructure for continuous evaluation collection rather than one-time benchmark creation. ### The Long Tail of Failure Modes The research consistently observed users tracking 10-20+ different failure modes. This has implications for monitoring and observability in production LLM systems—simple aggregate metrics may miss important quality issues hiding in the long tail. ### Distinct Iteration Stages Perhaps the most actionable insight is that practitioners benefit from unpacking the iteration cycle into distinct stages rather than trying to optimize everything simultaneously: - **Stage 1 - Data Understanding**: First, understand your data without worrying about accuracy. Know what's happening with your failure modes. - **Stage 2 - Intent Specification**: Get prompts as well-specified as possible. Eliminate ambiguity to the point where a human would not misinterpret them. - **Stage 3 - Accuracy Optimization**: Only after the first two stages should practitioners apply well-known accuracy optimization strategies (query decomposition, prompt optimization, etc.) This staged approach is notable because it goes against the temptation to immediately apply sophisticated techniques like prompt optimization or task decomposition. The research suggests these techniques only yield "really good gains" when applied to already well-specified pipelines. ## Implications and Assessment This research provides valuable insights for LLMOps practitioners, though it should be noted that the solutions described are still in the prototyping/research stage rather than production-hardened tools. The observations about the difficulty of LLM pipeline development align with broader industry experience, lending credibility to the findings. The emphasis on human-in-the-loop tooling and interactive interfaces reflects a pragmatic view that fully automated LLM pipelines remain challenging—production systems benefit from structured ways for humans to understand and intervene in LLM processing. The three-stage iteration framework (data understanding → intent specification → accuracy optimization) offers a practical methodology that could help teams avoid spinning their wheels on optimization before fundamental specification issues are resolved. However, the research doesn't provide quantitative evidence for how much this methodology improves outcomes compared to less structured approaches. For organizations building document processing pipelines or similar LLM applications, the key actionable insights are: invest in understanding your data's failure modes before optimizing, treat evaluation as an ongoing process rather than a one-time setup, and ensure prompts are specified clearly enough that human ambiguity is eliminated before expecting LLM reliability.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.