Company
Dust.tt
Title
Building Synthetic Filesystems for AI Agent Navigation Across Enterprise Data Sources
Industry
Tech
Year
2025
Summary (short)
Dust.tt observed that their AI agents were attempting to navigate company data using filesystem-like syntax, prompting them to build synthetic filesystems that map disparate data sources (Notion, Slack, Google Drive, GitHub) into Unix-inspired navigable structures. They implemented five filesystem commands (list, find, cat, search, locate_in_tree) that allow agents to both structurally explore and semantically search across organizational data, transforming agents from search engines into knowledge workers capable of complex multi-step information tasks.
## Overview Dust.tt, a company building AI agent platforms for enterprise knowledge work, discovered an interesting behavioral pattern in their production AI systems in April 2025. Their agents were spontaneously attempting to use filesystem-like syntax (`file:front/src/some-file-name.tsx`, `path:/notion/engineering/weekly-updates`) when trying to access specific resources, rather than relying solely on semantic search capabilities. Rather than treating this as a bug, Dust recognized this as valuable insight into how AI agents naturally want to interact with structured information, leading them to develop what they call "synthetic filesystems" for enterprise data navigation. ## The Core Problem and Solution Architecture The fundamental challenge Dust identified was that while semantic search excels at finding information based on meaning, it falls short when agents need to navigate structural relationships within organizational data. When an agent needs "the TeamOS section of last week's team meeting notes," it's not searching for semantic meaning—it's following a known structural path. The company realized they needed to provide agents with both semantic search capabilities and structural navigation tools working in concert. Building on their existing "content nodes architecture," Dust developed a system that imposes coherent, navigable structures on data sources that lack inherent filesystem organization. This approach creates synthetic filesystems that map disparate enterprise data sources into Unix-inspired directory structures that agents can navigate intuitively. ## Technical Implementation Details ### Synthetic Filesystem Mapping Dust's implementation involves creating logical filesystem abstractions across different platforms: - **Slack channels** become directories containing files representing individual threads - **Notion workspaces** become root folders, with databases functioning as special directories that are both navigable containers and searchable tables - **GitHub repositories** maintain their natural hierarchical structure - **Google and Microsoft spreadsheets** become folders containing table representations This mapping was built upon work they had previously done in their content nodes architecture migration, though initially they hadn't recognized the navigation potential of this hierarchical organization. ### Unix-Inspired Command Set The system implements five core commands modeled after Unix filesystem operations: - **`list`** - Displays folder contents (equivalent to Unix `ls`) - **`find`** - Searches for files by name within hierarchies - **`cat`** - Reads file contents with pagination support - **`search`** - Performs semantic search within specific subtrees - **`locate_in_tree`** - Shows complete paths leading to specific files Each command operates on the synthetic filesystem while treating documents from various sources (Notion pages, Slack messages, Google Drive files) as unified file or folder entities within a Unix-like system paradigm. ### Context Window Management A significant technical challenge emerged with the `cat` command implementation. Unlike traditional Unix `cat` which dumps entire file contents, AI agents operate under strict context window limitations. Dust solved this by adding `limit` and `offset` parameters to their `cat` implementation: ``` cat: { nodeId: string, offset?: number, // Start position limit?: number, // Max characters grep?: string // Filter lines } ``` This approach allows agents to read documents in manageable chunks, jump to specific sections, and filter content without exceeding context limits. The implementation treats LLMs similarly to computer programs with limited working memory that must intelligently sample file portions to understand content while working within strict memory constraints. ### Handling Dual-Nature Objects Traditional filesystems maintain clear distinctions between files and directories, but platforms like Notion break this assumption with documents that can contain other documents recursively. Dust resolved this by allowing objects to function as both files (readable via `cat`) and directories (listable via `ls`). The system informs models whether a given item contains nested elements, making this the primary criterion for listability. This dual nature enables agents to read page content and then navigate into sub-pages using familiar filesystem commands. ## Production Deployment and Agent Behavior When deployed in production, the system demonstrates sophisticated multi-step reasoning capabilities. For a query like "What was in the TeamOS section of the last team weekly's notion doc?", agents now follow logical navigation patterns: - Use `find` to locate the team weeklies database - Call `list` to examine recent entries - Identify the most recent document - Use `cat` with grep filtering to extract the specific TeamOS section This structural navigation capability doesn't replace semantic search but complements it. Agents can narrow scope through navigation, then apply precise semantic search within that context. This mirrors human information-seeking behavior where people browse, explore adjacent content, and build mental maps of information location. ## Advanced Workflow Capabilities The system enables complex investigative workflows that combine both structural and semantic approaches. For example, when debugging a system issue, an agent might: - Begin with semantic search across the entire codebase for error messages or stack traces - Use `locate_in_tree` on results to understand where related files exist within the architecture - Navigate to parent directories and use `list` to discover related modules and configuration files - Apply focused semantic search within specific subtrees to understand broader context - Finally use `cat` to examine specific implementation details This workflow demonstrates how the synthetic filesystem becomes scaffolding that makes semantic search more efficient, while semantic search provides the capability to quickly locate relevant content scattered across knowledge bases, which then serves as exploration starting points. ## Technical Architecture and LLMOps Considerations From an LLMOps perspective, this implementation represents several important production considerations: **Scalability and Performance**: The system must handle diverse data source APIs while maintaining consistent response times for filesystem operations. Each synthetic filesystem command translates to potentially multiple API calls across different platforms. **State Management**: Unlike traditional filesystems, the synthetic filesystem must handle dynamic content that changes frequently across different platforms, requiring careful cache invalidation and consistency strategies. **Error Handling**: Agents must gracefully handle situations where files or directories become unavailable due to permissions changes, deletions, or API failures across different underlying systems. **Integration Complexity**: The system requires maintaining connectors to multiple enterprise platforms (Notion, Slack, Google Drive, GitHub, etc.), each with different API patterns, rate limits, and data models. ## Production Results and Impact The implementation transforms agent capabilities from sophisticated search engines into knowledge workers capable of contextual understanding and complex multi-step tasks. Agents develop the ability to discover relationships between information and tackle complex investigative workflows that require both broad exploration and deep investigation. The company notes that this shift represents a fundamental change in how AI systems interact with organizational knowledge. Rather than processing information in isolation, agents can now comprehend how organizations structure and use their knowledge, leading to more contextually aware and effective assistance. ## Critical Assessment While Dust presents this as a significant advancement, several considerations warrant attention: **Complexity Trade-offs**: The synthetic filesystem introduces additional abstraction layers that may increase system complexity and potential failure points. The benefits must be weighed against the operational overhead of maintaining these abstractions. **Platform Dependencies**: The system's effectiveness depends heavily on the stability and API consistency of multiple third-party platforms, potentially creating reliability challenges in production environments. **Learning Curve**: While Unix commands are familiar to technical users, the effectiveness of this approach may vary depending on the specific use cases and user familiarity with filesystem concepts. **Performance Implications**: The multi-step navigation workflows, while powerful, may introduce latency compared to direct semantic search approaches, particularly for simple information retrieval tasks. The case study represents an interesting approach to enterprise AI agent architecture, demonstrating how production LLM systems can be enhanced by recognizing and accommodating natural interaction patterns that emerge from agent behavior. However, as with any architectural decision, the long-term success will depend on how well the added complexity translates to measurable improvements in agent effectiveness and user satisfaction in real-world enterprise environments.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.