Company
Outropy
Title
Evolution from Monolithic to Task-Oriented LLM Pipelines in a Developer Assistant Product
Industry
Tech
Year
2025
Summary (short)
The case study details how Outropy evolved their LLM inference pipeline architecture while building an AI-powered assistant for engineering leaders. They started with simple pipelines for daily briefings and context-aware features, but faced challenges with context windows, relevance, and error cascades. The team transitioned from monolithic pipelines to component-oriented design, and finally to task-oriented pipelines using Temporal for workflow management. The product successfully scaled to 10,000 users and expanded from a Slack-only tool to a comprehensive browser extension.
## Overview Outropy developed an AI-powered "Chief of Staff" product targeting engineering leaders—managers, directors, and VPs. The product was designed to function as "VSCode for Everything Else," providing the same kind of instant contextual awareness that IDEs offer for code, but applied to non-coding work activities like meetings, RFCs, and project decisions. The system integrated with multiple workplace tools including Slack, Jira, GitHub, Figma, and Google Workspace to surface relevant information proactively. The product scaled to 10,000 users and later became the foundation for the Outropy platform. This case study is particularly valuable for LLMOps practitioners because it documents the evolution of inference pipeline architecture over roughly two years of production operation, starting just two months after ChatGPT's release in late 2022. The engineering lessons cover the full spectrum of challenges that teams face when deploying LLMs at scale: from initial naive implementations through to mature, production-grade architectures. ## Initial Implementation and Early Challenges The first feature launched was a "Personal Daily Briefing" that appeared when users first became active on Slack, surfacing the three most important and time-sensitive topics for each user. The initial architecture was deceptively simple: worker processes stored Slack content in PostgreSQL, then a two-step process retrieved relevant messages and asked ChatGPT to identify and summarize the most important stories. This naive approach quickly encountered several production realities. The first was context window limitations—in 2023, models had approximately 4,000 token windows, and even with today's larger windows, performance degrades when too much irrelevant information is included. The team observed that more noise consistently led to worse results regardless of model capacity. Writing style became another issue, as users reacted negatively to briefings that referred to them in the third person or framed their own actions as external events. This required personalization at the content generation level. Relevance scoring also proved critical—different users cared about different things, and those interests evolved over time. Simply surfacing "important" topics wasn't sufficient; the system needed to rank stories based on each user's actual interests. The most challenging problem was duplicate summaries. Slack discussions often spanned multiple channels, requiring the system to recognize and merge duplicate topics rather than treating them as separate events. This led to the development of topic tracking and exponential decay algorithms to maintain fresh user interest profiles. ## Pipeline Evolution and Cascading Error Problems To address these challenges, the team evolved to a more sophisticated multi-stage pipeline. The new flow included four major steps: summarizing discussions in each channel (with topic identification), consolidating summaries across channels to deduplicate similar topics, ranking summaries based on user preferences, and finally generating personalized summaries tailored to the user's perspective. This chained LLM approach introduced a new class of failures that is particularly relevant for LLMOps. Unlike traditional data retrieval, LLM responses are generative—every response becomes a decision point where inaccuracies or hallucinations can be treated as facts by subsequent stages. The team observed that even minor model upgrades or slight shifts in data formatting could cause misinterpretations that snowballed into serious distortions. A memorable example illustrates this cascading error pattern: an engineer mentioned in Slack that they "might be out with the flu tomorrow." The importance detection stage flagged this correctly, but by the time the contextualization stage processed it, the system had somehow linked it to COVID-19 protocols. The final daily briefing then advised their manager to enforce social distancing measures—despite the team being entirely remote. Debugging these issues proved extremely difficult. By the time an error surfaced in final output, tracing it back required digging through layers of model interactions, intermediate outputs, and cached results. The team added guardrails stages to catch nonsensical outputs before they reached users, but the initial design flaw remained: once an error was detected, the only options were to rerun the entire pipeline or escalate to human intervention. ## Technical Debt and the Component-Oriented Approach As the product expanded beyond Slack to integrate with GitHub, Figma, Jira, and other tools, complexity grew further. The system now needed entity extraction to identify projects, teams, and key entities, enabling connections across discussions happening on different platforms. This exacerbated the cascading error problem to the point where most briefings were being rejected by guardrails and requiring manual review. The team's response was to add more substages for auto-correction and validation at multiple points, arriving independently at techniques now known as Corrective Retrieval-Augmented Generation (CRAG) and RAG-Fusion. However, this increased pipeline complexity substantially. When building the second feature—Click-to-Context, which allowed users to right-click any Slack message for a contextual explainer—the team faced the temptation to reuse code from the daily briefing pipeline. They initially added context-aware branching inside components, where each function checked which feature it was serving and branched into appropriate logic. This quickly devolved into "a tangled mess of if-else statements," a classic case of control coupling. To ship quickly, they made a deliberate trade-off: copy-pasting code for each new pipeline rather than untangling the shared logic. This technical debt accumulated faster than expected. Maintaining slightly different versions of the same hundred-line methods across scattered locations made even simple changes like switching from OpenAI's APIs to Azure's into multi-week exercises. The team then attempted a component-oriented design, consolidating duplicated code into reusable components with well-defined interfaces. They separated concerns—for example, splitting a component that both fetched Slack messages and calculated social proximity scores into a pure data-fetching service and a separate ranking algorithm. This improved testability and reusability, but the fundamental problem remained: pipelines themselves were still tangled with multiple concerns including data retrieval, error handling, processing logic, and context-aware adaptations. The author notes that this component-oriented approach mirrors how frameworks like LangChain and LlamaIndex operate—they provide pre-built conveniences for assembling pipelines by chaining components, but do little to solve the real challenges of inference pipeline design. This bottom-up approach tends to produce brittle, single-purpose pipelines that don't scale beyond their initial use case. ## Task-Oriented Design: The Breakthrough The breakthrough came when the team stopped thinking bottom-up and instead mapped their pipelines from the top down. Looking at their pipeline steps—Summarize, Consolidate, Rank, Generate—they recognized these weren't just stages in a workflow but standalone, reusable tasks. Everything else was implementation detail. The key insight was treating these tasks as self-contained pipelines that could be composed and reused across different workflows. Unlike component-oriented pipelines where stages directly depend on each other, task-oriented pipelines take specific inputs, produce specific outputs, and make no assumptions about where inputs come from or how outputs will be used. This decoupling improved maintainability and unlocked reuse across multiple AI workflows. A pipeline that summarized Slack discussions could equally summarize GitHub code review discussions or Google Docs comments without modification. The task-oriented approach became the main unit of abstraction for developers building on the Outropy platform. ## Infrastructure: Temporal for Durable Workflows Translating task-oriented concepts into working code presented its own challenges. The team emphasizes that AI product development is "mostly trial and error"—if architecture slows down daily iterations, it directly impacts product quality and user experience. After evaluating several solutions, they chose Temporal to implement durable workflows. Temporal separates business logic (workflows, which are idempotent) from side effects (activities, which handle external calls like OpenAI API requests or database writes). This separation allows Temporal to manage retries, timeouts, and failures automatically. Initially, each pipeline was modeled as a single Temporal workflow with stages implemented as Python objects using a custom dependency injection system. As the task-oriented approach matured, each task pipeline received its own Temporal workflow. However, this created a new problem: Temporal's reliability benefits were isolated to individual pipelines, while communication between pipelines still relied on external coordination. The solution was making agents themselves Temporal workflows. The agent would call each task pipeline as a subworkflow, allowing Temporal to manage everything as a single transaction. The agent workflow could act as a supervisor, tracking state across the entire inference pipeline and handling errors not automatically resolved by Temporal. This enabled rapid iteration on individual tasks without disrupting the production system—crucial for maintaining AI experience quality. ## Practical Lessons for LLMOps Several practical lessons emerge from this case study. First, the team experienced firsthand that LLMs are highly sensitive to input variations—even minor changes in data formatting or model versions can cause unpredictable behavior changes. Continuous adjustment of pipelines remained necessary even after deployment. Second, the importance of guardrails and validation at multiple stages cannot be overstated. The team learned that catching errors early in the pipeline prevents compounding problems downstream, but this must be balanced against pipeline complexity. Third, the evolution from monolithic to task-oriented architecture demonstrates that traditional software engineering principles—decomposition, separation of concerns, loose coupling—apply directly to AI systems. The author explicitly connects their experience to Data Mesh principles, noting that most data teams still build monolithic, one-off pipelines despite known problems, and that AI engineers have inherited these patterns. Fourth, infrastructure choices matter significantly for iteration speed. The adoption of Temporal for durable workflows provided built-in reliability while enabling the rapid experimentation that AI product development requires. Finally, the case study serves as a reminder that production AI systems face all the usual maintenance challenges of traditional software—plus unique pains around output variability, cascading errors through generative stages, and sensitivity to input formatting changes.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.