## Overview
Moveworks built Brief Me, a production AI system embedded within their Copilot platform that enables employees to upload their own documents and interact with them conversationally. The system is designed to handle complex content generation tasks including summarization, question-answering, document comparison, and insight extraction from PDFs, Word documents, and PowerPoint files. This represents a comprehensive LLMOps implementation addressing real-time document processing at enterprise scale.
The case study provides detailed technical documentation of their agentic architecture, though it's important to note this is a first-party account from Moveworks describing their own system. While the technical details are extensive and the evaluation metrics provided appear strong, independent validation of these claims would strengthen the assessment. The system aims to transform hours of manual document review into minutes of AI-assisted interaction.
## System Architecture
The Brief Me system operates through a two-stage pipeline architecture: online source data ingestion and online content generation. The ingestion stage processes uploaded files in real-time with a P90 latency target of under 10 seconds, while the content generation stage handles all user queries once the system is "locked into Brief Me mode." The architecture is explicitly designed for agentic operation, meaning the system can autonomously execute complex tasks, reason through problems, and adapt strategies based on context.
When users upload files or provide URLs, they enter a focused session where only the processed sources are available for querying. This session-based approach ensures context isolation—once a session closes, all sources from that session are removed. This design choice reflects practical production considerations around memory management and user privacy.
## Content Ingestion Pipeline
The ingestion pipeline orchestrates several sequential processes: source fetching, document chunking using proprietary techniques, metadata generation from sources, chunk embedding, and indexing. While the blog post focuses primarily on the content generation stage, the ingestion pipeline is critical infrastructure that enables the entire system. The sub-10-second P90 latency requirement for ingestion is particularly notable given the complexity of operations involved, suggesting significant engineering investment in optimization.
The chunking strategy appears to involve multiple levels of granularity, which becomes important later in the retrieval phase. Metadata generation from sources enriches the context available to downstream components. The use of OpenSearch for indexing provides the foundation for their hybrid search approach.
## Multi-Turn Conversation Support
The system implements query rewriting to enable multi-turn conversational capabilities. The Reasoning Engine examines up to n previous turns in conversation history (the specific value of n is not disclosed) and identifies the most relevant turns to generate a new query with full contextual awareness. Interaction history is modeled as a series of user-assistant response pairs with additional metadata injected from each turn based on reasoning steps.
Currently, Moveworks uses GPT-4o with in-context learning for this task, representing a dependency on external LLM infrastructure. However, they note ongoing efforts to fine-tune an in-house model trained on synthetic data representative of enterprise queries. This transition from external to internal models is a common LLMOps pattern reflecting concerns about cost, latency, control, and potentially data privacy. The synthetic data generation approach for training suggests they're building datasets that mirror their actual use cases.
## Operation Planning: The Core Reasoning Module
The operation planner represents the "crux" of how the system handles complex and varied queries. This module performs planning to determine what operations need to be executed over provided sources. The system defines two atomic actions that can be combined to support most enterprise use cases:
**SEARCH Action**: Indicates the need to search over a document to retrieve specific snippets. This action has two parameters: a search query and the target document. The Reasoning Engine constrains the search space to generate precise responses. For example, if a user has access to four documents but only three are relevant to a query, the system limits the search to those three documents. This filtering demonstrates the system's ability to reason about relevance before executing retrieval operations.
**READ Action**: Indicates that the entire document needs to be processed rather than searched. This action is crucial for summarization tasks where comprehensive understanding is required and selective retrieval would be counterproductive. The context for a READ operation is the entire document, and the system employs a sophisticated map-reduce algorithm (discussed later) to process complete documents while respecting token limits.
The operation planner uses a combination of in-context learning (ICL) and supervised fine-tuning (SFT) methods, with SFT noted as being in progress. They've annotated a large dataset through both dogfooding (internal usage) and synthetic data generation. The output tokens from this model are intentionally small to minimize latency, and the output is structured. Using temperature 0 with ICL ensures deterministic outputs given the constrained action space.
One acknowledged limitation is that the current architecture only supports parallel execution of actions. There are use cases where one action might influence subsequent actions, requiring sequential execution. They note that a lightweight model could be introduced to determine whether sequential execution is needed, but the current focus is on parallelly executable query spaces.
The evaluation metrics for operation planning are strong: 97.24% correct actions, 97.35% correct resources, and 93.80% search query quality on their test set. However, these metrics should be viewed in context—they're evaluated on a specific dataset of 2,200 query-response pairs based on real usage data, which may not capture all edge cases or future use patterns.
## Context Building Through Hybrid Search
After operations are predicted, the agent executes each action to build context. Since actions are independent in the current architecture, they can execute in parallel to reduce latency. For SEARCH actions, the system implements a sophisticated retrieval and ranking framework.
The hybrid search approach combines two complementary retrieval methods:
**Embeddings-based retrieval** provides higher recall and generalizability by capturing semantic relationships beyond keyword matches. Moveworks trains and fine-tunes their own internal embedding models that understand enterprise jargon. They use an MPNet model fine-tuned with a bi-encoder approach—a 100M parameter model that showed strong performance on open-source benchmarks like GLUE and SQuAD. The training data consists of approximately 1 million query-document pairs from both human-annotated real-world queries and synthetically curated data. Training employs standard bi-encoder methods with contrastive and triplet loss.
During inference, uploaded documents are divided into chunks at various granularity levels, each embedded and stored in OpenSearch for fast retrieval. Incoming queries are embedded in the same space, and the most relevant snippets (3-5 per query) are retrieved using approximate nearest neighbor (ANN) search.
**Keyword-based retrieval** using standard BM25 provides higher precision, especially for queries containing proper nouns that don't generalize well with embedding models. The combination of semantic and keyword approaches addresses the weaknesses of each individual method.
The investment in custom embedding models trained on enterprise data represents a significant LLMOps commitment. Many organizations rely on off-the-shelf embedding models, but Moveworks has chosen to invest in domain-specific training to better capture enterprise-specific language and relationships. This decision involves ongoing model development, training infrastructure, and evaluation frameworks but potentially provides competitive differentiation.
## Window Expansion for Improved Context
After retrieving top snippets, the system employs dynamic window expansion to augment results through contextual lookups. For each relevant snippet, this technique expands the retrieval context to broaden the receptive field. This is particularly valuable when answers span multiple pages and paragraph-level embedding search alone is insufficient.
For a retrieved chunk ID 'k', the system considers all chunks from {k - cb, k - cb + 1, ..., k, k + 1, ..., k + cf}, where cb and cf are backward and forward expansion constants determined through experimentation. This heuristic-based approach allows examining chunks before and after relevant ones.
They're experimenting with two advanced approaches: (a) semantic window expansion using embeddings to dynamically adjust chunks based on queries and retrieved snippets, connecting non-neighboring chunks that share common topics, and (b) contextual chunking performed during ingestion where each chunk is enriched with information about relevant chunks and metadata to improve latency.
## Chunk Filtering and Ranking
While retrieval casts a wide net, ranking is crucial for organizing and curtailing unnecessary information. The system implements several steps:
- Filter out retrieved chunks below an empirically determined threshold based on evaluation metrics and precision-recall curve analysis
- Deduplicate chunks after context expansion through the dynamic window approach
- Rank chunks using sophisticated feature engineering and filter based on ranking scores
Feature engineering for ranking is described as an active focus area reused across multiple systems, though specific features aren't detailed in the blog post. The retrieval precision@3 metric of 65.11% on their test set indicates that about two-thirds of the time, the top 3 retrieved chunks are relevant. This is respectable performance but also suggests there's room for improvement in retrieval quality.
## Map-Reduce Approach for Long Context Generation
The output generation component addresses a fundamental challenge with current language models: they struggle to effectively utilize information from long input contexts. Performance is typically highest when relevant information appears at the beginning or end of input context (the "lost in the middle" problem), and significantly degrades when models must access information in the middle of long contexts. Additionally, longer inputs increase latency and may exceed context limits.
Moveworks developed a novel map-reduce algorithm to process contextual data:
**Splitter Function**: Takes entire contextual data gathered from previous steps (SEARCH or READ) and dynamically breaks it down into smaller chunks, adhering to token limits, compute capacities, and performance guarantees. This dynamic approach is more sophisticated than simple fixed-size chunking.
**Mapper Function**: Splits are passed to LLM workers to generate intermediate outputs. Each worker has access to the user query, predicted operations (SEARCH or READ), and metadata from the document store (like document descriptions) to generate grounded and relevant outputs. Importantly, the mapper has access to both the original query and the rewritten query—empirical studies showed that omitting the original query reduces instruction-following ability due to information loss through rewrites.
**Reducer Function**: Aggregates and combines intermediate results to produce final output. This uses a specifically tuned prompting library designed to combine information from various sources. For example, the system maintains distinctions between information sourced from PowerPoint presentations versus Word documents, recognizing that these different formats may require different handling.
This map-reduce approach enables horizontal scaling, making it possible to handle extremely long contexts without sacrificing quality. The architecture allows the system to process documents that would exceed the context window of even the largest available LLMs.
## Engineering Enhancements for Production
Several additional engineering optimizations improve latency, scalability, and reliability:
**Truncated Map-Reduce**: Limits the number of possible splits to avoid overburdening worker nodes. Once a maximum limit is reached, the system stops accepting splits. Two strategies are employed: naively stopping (which can skip tail chunks and cause information loss) or selectively skipping very similar chunks while including wider chunk windows in each split to enhance the receptive field.
**Streaming Splitter**: Yields splits as they're generated in sequence rather than waiting for all splits to complete before processing. This allows workers to start processing early splits while later splits are still being generated, reducing overall latency through pipelining.
**Reference Preservation**: The reducer stage maintains references to original splits even though it only accesses intermediate outputs. This design enables generating grounded attributions to source chunks in the output, which is crucial for the citation system.
These optimizations reflect mature production engineering practices. The streaming approach in particular demonstrates attention to end-to-end latency rather than just individual component performance.
## Citation System for Grounding and Verification
Brief Me provides citations at the paragraph level, going beyond document-level citations offered by most existing tools. This granularity enables users to verify specific claims and place greater trust in generated responses. The system employs a combination of LLM-based and heuristic methods including n-gram matching.
The citation pipeline consists of:
A model predicts citation IDs based on the output response and relevant context. Citations must be complete (not omitting relevant attributions) and appropriately sparse (avoiding overwhelming users with hundreds of citations for a single paragraph). This balance between completeness and sparsity is a key design challenge.
Once citation IDs are generated, a heuristic-based system enhances citation quality and links them to source documents through three operations:
- **Collator**: Combines consecutive citation chunk IDs (e.g., [10][8-13][9-12] becomes [8-13])
- **Deduper**: Identifies appropriate granularity for displaying citations using heuristics, custom rules, and regex, then deduplicates IDs across response sentences
- **Referencer**: Maps citation IDs to actual document chunks while ensuring granularity, sparsity, and completeness
The hybrid approach of using both LLM prediction and heuristic post-processing reflects practical experience that LLMs alone may not provide sufficiently consistent or correctly formatted citations for production use. The heuristics provide guardrails and ensure citations meet system requirements.
## Evaluation Framework
Moveworks implements an extensive evaluation framework to measure efficacy of individual components for both single-turn queries and multi-turn session performance. The evaluation dataset consists of 2,200 query-response pairs based on real usage data from diverse domains including research papers, financial reports, security documents, HR policies, competitive intelligence, and sales pitches. Usage data was logged, monitored, and sent to trained human annotators for review.
Tasks are evaluated on a 3-point scale: "Yes" (score 1), "Somewhat" (score 2), and "No" (score 3). Ten different parameters evaluate each query turn, with session-based evaluation strategy under development. Evaluation is conducted through both human labelers and sophisticated LLM graders, with the LLM grader approach increasing their ability to annotate larger datasets.
Key metrics span multiple components:
**Operation Planner**: Action accuracy (97.24% "Yes"), resource prediction accuracy (97.35% "Yes"), and search query quality (93.80% "Yes") demonstrate strong performance on correctly identifying required operations and resources.
**Search Metrics**: Retrieval precision@3 shows 65.11% "Yes", 26.74% "Somewhat", and 9.50% "No". While the majority of retrievals are precise, there's notable room for improvement. Retrieval recall@k measures the proportion of relevant documents retrieved and is described as the most useful relevance metric for retrieval.
**Output Quality**: Completeness (97.98% "Yes") assesses whether responses are reasonable answers to user queries in conversation context, though notably this "does not address the correctness of the assistant's response"—an important caveat. Groundedness (89.21% "Yes") evaluates whether generated content is faithful to retrieved content and user query information. Citation correctness measures whether output citations properly attribute to actual document sources.
The evaluation approach demonstrates LLMOps maturity with multi-dimensional assessment of different pipeline stages. However, some limitations should be noted: the evaluation is conducted on usage data from their own system, which may introduce selection bias. The "completeness" metric explicitly doesn't measure correctness, which is a critical quality dimension. The 2,200-sample test set, while substantial, represents a snapshot and may not capture all failure modes or edge cases that emerge at scale.
The move toward LLM-based graders for evaluation reflects a common pattern in LLMOps where human annotation is augmented or partially replaced by LLM evaluation. This enables faster iteration and larger-scale evaluation but introduces its own challenges around the reliability and potential biases of LLM graders.
## LLMOps Considerations and Production Challenges
Several aspects of the Brief Me system highlight broader LLMOps challenges and considerations:
**Model Dependencies**: The system currently relies on GPT-4o for query rewriting and likely other components, creating dependencies on external LLM providers with implications for cost, latency, data privacy, and service reliability. The noted work toward fine-tuning in-house models represents a strategic move toward greater control and potentially lower operational costs at the expense of upfront investment in model development.
**Synthetic Data Generation**: Moveworks extensively uses synthetic data generation for both training embedding models and fine-tuning operation planning components. This approach is increasingly common in LLMOps as organizations struggle to obtain sufficient high-quality labeled data for specialized domains. The quality of synthetic data generation becomes a critical determinant of overall system performance.
**Multi-Model Orchestration**: The system orchestrates multiple models and components—embedding models for retrieval, GPT-4o for query rewriting and operation planning, additional LLMs for mapping and reducing, models for citation prediction. This complexity introduces challenges in debugging, monitoring, and understanding failure modes. When the system produces a poor output, isolating which component failed requires sophisticated observability.
**Latency Management**: The P90 latency target of under 10 seconds for ingestion and the various latency optimizations (parallel execution, streaming splitters, truncated map-reduce) demonstrate that production LLM systems must carefully balance quality with performance. The numerous engineering optimizations suggest that achieving acceptable latency required significant effort beyond simply chaining together model calls.
**Context Window Management**: The map-reduce approach addresses fundamental limitations in current LLMs around long context handling. This represents a practical engineering solution to work within constraints of existing models rather than waiting for future models with larger context windows. However, it adds complexity and potential points of failure.
**Evaluation at Scale**: The comprehensive evaluation framework with ten parameters and multiple assessment methods reflects the reality that evaluating production LLM systems requires much more than simple accuracy metrics. The ongoing work on session-based evaluation acknowledges that multi-turn performance is critical but harder to assess than single-turn interactions.
**Hybrid Approaches**: Throughout the system, Moveworks combines learned approaches (embeddings, LLMs) with heuristic methods (BM25, n-gram matching for citations, various filtering and ranking rules). This pragmatic hybrid approach suggests that purely learned systems may not yet provide the reliability and control required for production deployment.
## Balanced Assessment
The Brief Me system represents a sophisticated production implementation of agentic AI for document processing. The technical depth of the description and the comprehensive evaluation framework suggest serious engineering investment. The reported metrics are strong across most dimensions, and the system addresses real enterprise needs around document analysis.
However, several caveats warrant consideration. This is a first-party description by Moveworks of their own system, so claims should be viewed accordingly. The evaluation is conducted on their own usage data, which may not represent all real-world scenarios. Some metrics show room for improvement—retrieval precision at 65% and groundedness at 89% indicate the system doesn't always return perfectly relevant information or stay fully grounded in sources.
The reliance on external LLMs (GPT-4o) for critical components creates dependencies that may affect cost, performance, and data handling considerations for enterprises. The noted work toward in-house models suggests Moveworks recognizes this limitation. The current architecture's restriction to parallel execution of operations limits handling of certain complex reasoning tasks.
The extensive use of synthetic data for training raises questions about how well the system generalizes to truly novel enterprise scenarios not represented in the synthetic data generation process. The map-reduce approach, while clever, adds complexity and latency compared to single-pass processing.
Despite these considerations, the system demonstrates many LLMOps best practices: comprehensive evaluation across multiple dimensions, careful attention to latency and scalability, hybrid approaches combining multiple techniques, custom model training for domain-specific performance, and thoughtful architecture for handling long contexts and providing verifiable citations. The production deployment of such a complex multi-component system and the detailed technical disclosure provide valuable insights for the broader LLMOps community.