Company
DocETL
Title
Systematic Approach to Building Reliable LLM Data Processing Pipelines Through Iterative Development
Industry
Research & Academia
Year
2025
Summary (short)
UC Berkeley researchers studied how organizations struggle with building reliable LLM pipelines for unstructured data processing, identifying two critical gaps: data understanding and intent specification. They developed DocETL, a research framework that helps users systematically iterate on LLM pipelines by first understanding failure modes in their data, then clarifying prompt specifications, and finally applying accuracy optimization strategies, moving beyond the common advice of simply "iterate on your prompts."
This case study presents research conducted at UC Berkeley on the challenges organizations face when building reliable LLM pipelines for data processing tasks. The research, led by PhD candidate Shrea, addresses a fundamental problem in LLMOps: while there are numerous tools for improving LLM accuracy once a pipeline is well-specified, there is virtually no tooling to help users understand their data and specify their intent clearly. The research emerged from observing how organizations across various domains struggle with unstructured data processing tasks. These include customer service review analysis for theme extraction and actionable insights, sales email analysis to identify missed opportunities, and safety analysis in traffic and aviation domains to understand accident causes. The common thread across all these applications is that users consistently report that "prompts don't work" and are typically advised to simply "iterate on your prompts" without systematic guidance. To illustrate the complexity of the problem, the researchers used a real estate example where an agent wants to identify neighborhoods with restrictive pet policies. This seemingly straightforward task requires a sequence of LLM operations: mapping documents to extract relevant information, categorizing clauses, and aggregating results by neighborhood. However, users quickly discover that their initial assumptions about what they want to extract are often incomplete or incorrect. The research identified two critical gaps in current LLMOps practices. The first is the data understanding gap, where users don't initially know what types of documents exist in their dataset or what unique failure modes occur for each document type. For instance, pet policy clauses might include breed restrictions, quantity limits, weight restrictions, service animal exemptions, and various other categories that users only discover through data exploration. The challenge is compounded by the long tail of failure modes, where hundreds of different issues can emerge in even a thousand-document collection. The second gap is intent specification, where users struggle to translate their understanding of failure modes into concrete pipeline improvements. Even when users identify problems, they often get lost deciding whether to use prompt engineering, add new operations, implement task decomposition, or apply other optimization strategies. DocETL, the research framework developed by the team, addresses these gaps through several innovative approaches. For the data understanding gap, the system automatically extracts and clusters different types of outputs, allowing users to identify failure modes and design targeted evaluations. The tool organizes failure modes and helps users create datasets for specific evaluation scenarios. For example, in the real estate case, the system might identify that clauses are phrased unusually, that LLMs overfit to certain keywords, or that extraction occurs for unrelated content due to keyword associations. For the intent specification gap, DocETL provides functionality to translate user-provided notes into prompt improvements through an interactive interface. This allows users to maintain revision history and provides full steerability over the optimization process. The system helps users systematically improve their pipelines rather than randomly trying different approaches. The research revealed several important insights about LLMOps practices that extend beyond data processing applications. First, evaluations in real-world LLM applications are inherently fuzzy and never truly complete. Users continuously discover new failure modes as they run their pipelines, constantly creating new evaluation subsets and test cases. This challenges the traditional notion of having a fixed evaluation dataset. Second, failure modes consistently exhibit a long tail distribution. Users typically encounter tens or twenties of different failure modes that require ongoing monitoring and testing. This complexity makes it impossible to rely on simple accuracy metrics and necessitates more sophisticated evaluation frameworks. Third, the research emphasized the importance of unpacking the iteration cycle into distinct stages rather than attempting to optimize everything simultaneously. The recommended approach involves three sequential phases: first, understanding the data and identifying failure modes without worrying about accuracy; second, achieving well-specified prompts that eliminate ambiguity; and third, applying established accuracy optimization strategies. This staged approach challenges common LLMOps practices where teams attempt to optimize accuracy while simultaneously trying to understand their data and refine their objectives. The research suggests that gains from well-known optimization strategies are only achieved after the foundational work of data understanding and intent specification is completed. The implications for LLMOps practitioners are significant. The research suggests that much of the current tooling ecosystem focuses on the final stage of optimization while neglecting the earlier, more fundamental challenges of data understanding and intent specification. This creates a situation where practitioners struggle with the foundational aspects of their pipelines while having access to sophisticated tools for problems they're not yet ready to solve. The work also highlights the importance of human-in-the-loop approaches for LLMOps. Rather than pursuing fully automated optimization, the research demonstrates value in tools that help users systematically explore their data, understand failure modes, and iteratively refine their specifications with appropriate tooling support. From a broader LLMOps perspective, this research contributes to understanding the full lifecycle of LLM pipeline development. It suggests that successful LLM deployments require not just technical infrastructure for model serving and monitoring, but also sophisticated tooling for data exploration, failure mode analysis, and iterative specification refinement. The research methodology itself provides insights for LLMOps teams. By combining research approaches with human-computer interaction (HCI) methodologies, the team was able to identify systematic patterns in how users struggle with LLM pipeline development. This suggests that LLMOps practices can benefit from more systematic study of user workflows and challenges rather than focusing solely on technical optimization. While the research presents promising approaches through DocETL, it's important to note that this represents early-stage academic work rather than a mature commercial solution. The practical applicability of these approaches in large-scale production environments remains to be validated. However, the systematic analysis of LLMOps challenges and the proposed framework for addressing them provide valuable insights for practitioners working on similar problems. The emphasis on evaluation and testing throughout the research aligns with broader trends in LLMOps toward more sophisticated evaluation frameworks. The recognition that evaluations are never complete and must continuously evolve reflects the dynamic nature of LLM applications in production environments.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.