ZenML

Systematic Approach to Building Reliable LLM Data Processing Pipelines Through Iterative Development

DocETL 2025
View original source

UC Berkeley researchers studied how organizations struggle with building reliable LLM pipelines for unstructured data processing, identifying two critical gaps: data understanding and intent specification. They developed DocETL, a research framework that helps users systematically iterate on LLM pipelines by first understanding failure modes in their data, then clarifying prompt specifications, and finally applying accuracy optimization strategies, moving beyond the common advice of simply "iterate on your prompts."

Industry

Research & Academia

Technologies

Overview

This case study presents research from UC Berkeley’s DocETL project, focusing on the challenges practitioners face when building LLM pipelines for processing unstructured data. The talk, delivered by Shrea (a PhD candidate at UC Berkeley), provides a research-oriented perspective on why “prompt iteration” is so difficult and what kinds of tooling and methodologies can help practitioners build more reliable LLM systems. Unlike vendor case studies, this is an academic research perspective that offers insights applicable across industries and use cases.

The core insight from this research is that organizations are increasingly trying to use LLMs to extract insights from large collections of unstructured documents—customer service reviews, sales emails, contracts, safety incident reports—but they consistently report that “prompts don’t work” and that iteration feels like “hacking away at nothing.” The research team applied HCI (Human-Computer Interaction) research methods to understand these challenges systematically.

The Problem Space: Data Processing Agents

The research focuses on what they call “data processing agents”—LLM-based systems designed to extract, analyze, and summarize insights from organizational document collections. The examples given include:

A typical pipeline architecture involves sequences of LLM operations: map operations that extract outputs from each document, additional map operations for classification or categorization, and aggregate operations that summarize results by grouping criteria (neighborhood, city, etc.). This is a common pattern in production LLM systems dealing with document processing at scale.

Two Critical Gaps Identified

The research identified two fundamental gaps that explain why LLM pipeline development is so challenging:

The Data Understanding Gap

The first gap is between the developer and their data. When practitioners start building LLM pipelines, they often don’t know what the “right question” is to ask. In the real estate example, a developer might initially request “all pet policy clauses” and only later realize they specifically need “dog and cat pet policy clauses” after examining outputs. This discovery process is inherently iterative and requires looking at actual data and outputs.

The challenges here include identifying the types of documents in the dataset and understanding unique failure modes for each document type. For pet policy clauses alone, the research found multiple distinct clause types: breed restriction clauses, clauses on number of pets, service animal exemptions, and others that developers don’t anticipate until they examine the data.

The research observed a “long tail” of failure modes across virtually every application domain. This echoes broader patterns in machine learning systems where edge cases are numerous and difficult to enumerate in advance. Specific failure modes observed include:

The researchers note it’s “not uncommon to see people flag hundreds of issues in a thousand document collection,” suggesting that production LLM pipelines face substantial quality challenges at scale.

The Intent Specification Gap

The second gap is between what the developer wants and how they specify it to the LLM. Even when developers believe they understand their task, translating that understanding into unambiguous prompts is surprisingly difficult. Things that seem unambiguous to humans—like “dog and cat policy clauses”—are actually ambiguous to LLMs, which may need explicit specification of weight limits, breed restrictions, quantity limits, and other details.

When developers encounter hundreds of failure modes, figuring out how to translate observations into pipeline improvements becomes overwhelming. The options include prompt engineering, adding new operations, task decomposition, or analyzing document subsections—but practitioners “often get very lost” navigating these choices.

Proposed Solutions and Tooling

The research team is developing tooling to address these gaps, available through the DocETL project:

For the Data Understanding Gap

The team is building tools that automatically help users find anomalies and failure modes in their data. Key approaches include:

The goal is to help users design “evals on the fly” for each different failure mode, rather than requiring comprehensive evaluation sets upfront. This reflects a practical reality: in production LLM systems, evaluations are never “done first”—teams are constantly discovering new failure modes as they run pipelines.

For the Intent Specification Gap

The research stack includes capabilities to take user-provided notes and automatically translate them into prompt improvements. This is presented through an interactive interface where users can provide feedback, edit suggestions, and maintain revision history—making the system “fully steerable.” The emphasis on revision history and user control reflects good practices for production LLM systems where transparency and reproducibility matter.

Key Takeaways for LLMOps Practitioners

The research offers several insights that apply beyond data processing pipelines to LLM operations more broadly:

Evaluations Are Never Done

In every domain studied, evaluations are “very, very fuzzy” and never complete upfront. Teams are constantly collecting new failure modes as they run pipelines and creating new subsets of documents or example traces that represent future evaluations. This suggests that production LLM systems need infrastructure for continuous evaluation collection rather than one-time benchmark creation.

The Long Tail of Failure Modes

The research consistently observed users tracking 10-20+ different failure modes. This has implications for monitoring and observability in production LLM systems—simple aggregate metrics may miss important quality issues hiding in the long tail.

Distinct Iteration Stages

Perhaps the most actionable insight is that practitioners benefit from unpacking the iteration cycle into distinct stages rather than trying to optimize everything simultaneously:

This staged approach is notable because it goes against the temptation to immediately apply sophisticated techniques like prompt optimization or task decomposition. The research suggests these techniques only yield “really good gains” when applied to already well-specified pipelines.

Implications and Assessment

This research provides valuable insights for LLMOps practitioners, though it should be noted that the solutions described are still in the prototyping/research stage rather than production-hardened tools. The observations about the difficulty of LLM pipeline development align with broader industry experience, lending credibility to the findings.

The emphasis on human-in-the-loop tooling and interactive interfaces reflects a pragmatic view that fully automated LLM pipelines remain challenging—production systems benefit from structured ways for humans to understand and intervene in LLM processing.

The three-stage iteration framework (data understanding → intent specification → accuracy optimization) offers a practical methodology that could help teams avoid spinning their wheels on optimization before fundamental specification issues are resolved. However, the research doesn’t provide quantitative evidence for how much this methodology improves outcomes compared to less structured approaches.

For organizations building document processing pipelines or similar LLM applications, the key actionable insights are: invest in understanding your data’s failure modes before optimizing, treat evaluation as an ongoing process rather than a one-time setup, and ensure prompts are specified clearly enough that human ambiguity is eliminated before expecting LLM reliability.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Semantic Data Processing at Scale with AI-Powered Query Optimization

DocETL 2025

Shreyaa Shankar presents DocETL, an open-source system for semantic data processing that addresses the challenges of running LLM-powered operators at scale over unstructured data. The system tackles two major problems: how to make semantic operator pipelines scalable and cost-effective through novel query optimization techniques, and how to make them steerable through specialized user interfaces. DocETL introduces rewrite directives that decompose complex tasks and data to improve accuracy and reduce costs, achieving up to 86% cost reduction while maintaining target accuracy. The companion tool Doc Wrangler provides an interactive interface for iteratively authoring and debugging these pipelines. Real-world applications include public defenders analyzing court transcripts for racial bias and medical analysts extracting information from doctor-patient conversations, demonstrating significant accuracy improvements (2x in some cases) compared to baseline approaches.

document_processing unstructured_data data_analysis +34

AI-Powered Digital Co-Workers for Customer Support and Business Process Automation

Neople 2025

Neople, a European startup founded almost three years ago, has developed AI-powered "digital co-workers" (called Neeles) primarily targeting customer success and service teams in e-commerce companies across Europe. The problem they address is the repetitive, high-volume work that customer service agents face, which reduces job satisfaction and efficiency. Their solution evolved from providing AI-generated response suggestions to human agents, to fully automated ticket responses, to executing actions across multiple systems, and finally to enabling non-technical users to build custom workflows conversationally. The system now serves approximately 200 customers, with AI agents handling repetitive tasks autonomously while human agents focus on complex cases. Results include dramatic improvements in first response rates (from 10% to 70% in some cases), reduced resolution times, and expanded use cases beyond customer service into finance, operations, and marketing departments.

customer_support chatbot document_processing +29