ZenML

Scaling AI-Powered File Understanding with Efficient Embedding and LLM Architecture

Dropbox 2024
View original source

Dropbox implemented AI-powered file understanding capabilities for previews on the web, enabling summarization and Q&A features across multiple file types. They built a scalable architecture using their Riviera framework for text extraction and embeddings, implemented k-means clustering for efficient summarization, and developed an intelligent chunk selection system for Q&A. The system achieved significant improvements with a 93% reduction in cost-per-summary, 64% reduction in cost-per-query, and latency improvements from 115s to 4s for summaries and 25s to 5s for queries.

Industry

Tech

Technologies

Overview

Dropbox, the cloud storage and file synchronization company, built AI-powered summarization and question-answering (Q&A) features directly into their web file preview experience. The goal was to help knowledge workers suffering from information overload by enabling them to quickly understand file contents without reading entire documents, watching full videos, or remembering exactly where specific information was stored. The system can summarize files of any length and format, answer questions about file contents, and even work across multiple files simultaneously.

This case study offers a comprehensive look at how a large-scale production system integrates LLM capabilities into an existing infrastructure, with particular attention to performance optimization, cost management, and architectural decisions that enable scale.

The Riviera Framework: Foundation for LLM Integration

The foundation of Dropbox’s LLM features is their existing file conversion framework called Riviera. This system was originally designed to convert complex file types (like CAD drawings) into web-consumable formats (like PDF) for file previews. Riviera operates at enormous scale, processing approximately 2.5 billion requests per day—totaling nearly an exabyte of data—across roughly 300 file types.

The architecture consists of a frontend that routes requests through plugins, with each plugin running in a “jail” (an isolated container designed for safe execution of third-party code). Riviera maintains a graph of possible conversions and can chain multiple plugins together to perform complex multi-step transformations. For example, extracting text from a video might follow the pipeline: Video (.mp4) -> Audio (.aac) -> Transcript (.txt).

A crucial architectural decision was treating embeddings as just another file conversion type within Riviera. This allowed the team to leverage existing infrastructure features, particularly the sophisticated caching layer. The pipeline becomes: Video (.mp4) -> Audio (.aac) -> Transcript (.txt) -> AIEmbedding. By separating embedding generation from summary/Q&A generation, the system can reuse embeddings across multiple requests—if a user summarizes a video and then asks follow-up questions, the embeddings only need to be generated once.

Embedding Generation Strategy

The embeddings plugin takes text data extracted from various file types and converts it into vector representations. A key design decision was to chunk the text into paragraph-sized segments and calculate an embedding for each chunk, rather than generating a single embedding for the entire file. This approach increases the granularity of stored information, capturing more detailed and nuanced semantic meaning.

The same chunking and embedding method is applied for both summarization and Q&A features, allowing them to share the same embedding cache within Riviera. This design choice significantly reduces redundant computation and API calls.

Summarization: K-Means Clustering Approach

For summarization, Dropbox needed to define what constitutes a “good summary”—one that identifies all the different ideas or concepts in a document and provides the gist of each. The solution uses k-means clustering to group text chunks based on their embeddings in multi-dimensional vector space.

The process works as follows: chunks with similar semantic content are grouped into clusters, major clusters are identified (representing the main ideas of the file), and a representative chunk from each cluster is concatenated to form the context. This context is then sent to an LLM for final summary generation.

The team explicitly compared this approach to alternatives like the map-reduce “summary of summaries” approach and found k-means clustering superior in two key ways:

Q&A: Similarity-Based Retrieval

The Q&A plugin operates in a conceptually opposite manner to summarization. While summarization selects chunks for dissimilarity (to capture diverse topics), Q&A selects chunks for similarity to the user’s query.

The process generates an embedding for the user’s question, then computes the distance between this query embedding and each text chunk embedding. The closest chunks are selected as context and sent to the LLM along with the query. The relevant chunk locations are returned to users as sources, allowing them to reference specific parts of the file that contributed to the answer.

Both summarization and Q&A features also request context-relevant follow-up questions from the LLM using function calling and structured outputs. Testing showed that follow-up questions help users more naturally explore file contents and topics of interest.

Multi-File Support: Power Law Dynamics

Expanding from single-file to multi-file processing required significant evolution of infrastructure, UI, and algorithms. A key challenge was determining how many relevant chunks or files to send to the LLM for any given query—direct questions might need only a few chunks, while broad questions require more context.

The team discovered that this cannot be determined from the question alone; it also depends on the context. The question “What is Dropbox?” could be direct (if asked about a list of tech companies) or broad (if asked about the Dropbox Wikipedia page).

The solution uses power law dynamics: the system takes the top 50 text chunks by relevance score and cuts off the bottom 20% of the score spread between max and min. Direct questions have steeper power law curves, meaning fewer chunks pass the threshold (more are discarded). Broad questions have flatter curves, allowing more chunks to be included.

For example, if the most relevant chunk scores 0.9 and the least scores 0.2, everything below 0.34 is discarded, leaving about 15 chunks. If scores range from 0.5 to 0.2, the threshold drops to 0.26, leaving about 40 chunks. This approach allows direct questions to get less but more relevant context, while broad questions receive more context for expanded answers.

Critical LLMOps Decisions

Several strategic decisions shaped the production system:

Real-time vs. Pre-computation: The team chose real-time processing over pre-computing embeddings and responses. While pre-computation would offer lower latency, real-time processing allows users to choose exactly which files to share with the LLM—and only those files. Privacy and security were the primary drivers, though avoiding the cost of pre-computing requests that users might never make was an additional benefit.

Chunk Priority Calculation: To optimize token usage, the system calculates priority tiers for chunks. The first two chunks chronologically receive top priority, followed by clustering-based selection. This maximizes topic breadth for summaries and relevancy for answers.

Caching Strategy: The initial version did not cache embeddings, resulting in redundant LLM calls for the same document. The optimized version caches embeddings, reducing API calls and enabling summaries and Q&A to share the same chunks and embeddings.

Segmentation Benefits: By sending only the most relevant parts of a file to the LLM in a single request, the system achieves lower latency, lower cost, and better quality results. The team explicitly notes that “sending garbage into the LLM means garbage out from the LLM.”

Results and Performance Improvements

The optimizations yielded substantial improvements:

These metrics demonstrate significant success in both cost management and user experience improvement.

Considerations and Limitations

The article notes that these features are still in early access and not yet available to all users. They are also optional and can be turned on or off for individual users or teams. Dropbox mentions adhering to a set of AI principles as part of their commitment to responsible AI use.

The case study represents a well-documented example of integrating LLM capabilities into existing infrastructure at scale, with particular emphasis on architectural decisions that enable efficient caching, cost optimization, and quality improvements through clustering-based content selection rather than naive full-document processing.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64