ZenML

Observational Memory: Human-Inspired Context Compression for Agent Systems

Mastra 2026
View original source

Mastra developed an observational memory system for LLM agents that compresses conversations 5-40x while maintaining temporal awareness and contextual relevance. The system uses two background agents (observer and reflector) to extract meaningful information from conversations while intelligently discarding noise, modeling how human memory retains what matters and lets details fade. The solution achieved 94.87% on the LongMemEval benchmark with GPT-5-mini and 84.23% with GPT-4o, outperforming existing approaches. Deployed in production across hiring and healthcare applications within the Mastra TypeScript agent framework, the system leverages prompt caching for cost efficiency and runs background compression to avoid blocking user interactions.

Industry

Tech

Technologies

Overview

Mastra built an observational memory system designed to address one of the fundamental challenges in production LLM deployments: managing context windows efficiently while preserving the information that matters most. The system is modeled on human cognition, specifically how humans selectively retain important information while allowing less relevant details to fade over time. This approach represents a significant departure from traditional context management strategies like vector databases, graph databases, or wholesale compaction, instead relying on two lightweight background agents that continuously observe and refine conversational memory.

The system has been deployed in production across multiple domains including hiring and healthcare applications, and is integrated into the Mastra TypeScript agent framework. It achieved state-of-the-art results on LongMemEval, a challenging public benchmark designed to test memory systems across 500 questions spanning 57 million tokens from approximately 50 conversations. The benchmark specifically evaluates contradictions, cross-session recall, and temporal reasoning—all critical capabilities for production agent systems.

The Core Problem: Context Window Management at Scale

Production LLM systems face a critical tension: conversation history grows linearly with interaction length, but context windows have finite capacity. Traditional approaches either truncate history (losing important information), use vector/graph databases for retrieval (adding infrastructure complexity and latency), or perform wholesale compaction at the last moment (blocking user interactions and suffering from context drift). Mastra’s observational memory tackles this by compressing conversations early and continuously, achieving compression ratios of 5-40x while preserving meaning and temporal context.

The system is particularly effective in “noisy-context scenarios” where large volumes of data contain only small fragments of signal. Examples include deep search agents, tool-heavy workflows, and headless browser automation where massive outputs might contain only a few relevant facts. In these contexts, the ability to extract signal from noise while maintaining temporal relationships becomes critical for agent effectiveness.

Architecture: Observer and Reflector Agents

The observational memory system is built around two specialized agents that run in the background:

The Observer is a lightweight agent that activates when conversation history reaches a configurable token threshold (default: 30,000 tokens). It uses an efficient model like Gemini 2.5 Flash to review the recent chat history and condense it into timestamped observations. These observations capture decisions, outcomes, and key facts while discarding verbose tool outputs, intermediate reasoning steps, and clarifications that didn’t affect final results. Importantly, the observer runs asynchronously in the background at token-based intervals (roughly every 20% of the threshold by default), allowing the main agent to continue responding without interruption. Near the threshold, the interval halves to keep observations fresh, and when the threshold is reached, pre-buffered observations can usually be promoted instantly.

The Reflector addresses the fact that even compact observations accumulate over time. When the observation history itself grows too large (default: 40,000 tokens of observations), the reflector activates to reshape the entire observation history. It removes duplicated information across observations, strips away details no longer relevant, and merges related observations into tighter summaries. Where the observer appends new observations, the reflector fundamentally rewrites the observation history to maintain efficiency.

Both agents are designed to be asynchronous and non-blocking by default, which is critical for production deployments where user experience depends on responsive interactions. The system includes safety thresholds (default: 1.2x the configured limits) that force synchronous operation if background processing falls behind, ensuring memory management never causes context overflow.

Prioritization and Intelligent Forgetting

A key innovation in the observational memory system is its priority-based compression strategy. Each observation carries a priority emoji that signals its importance:

The reflector uses these priorities as compression signals. At higher compression levels, low-priority observations (🟢) are dropped first, while high-priority observations (🔴) and completions (✅) are preserved. This creates a form of “intelligent forgetting” rather than random information loss, modeling how human memory prioritizes salient information.

For example, a coding session spanning dozens of messages and tool calls might compress to just a few prioritized observations capturing the app name, technology stack, authentication decisions, deployment status, and completion markers—discarding the verbose intermediate steps while retaining the narrative arc and key decisions.

Temporal Awareness and Context

LLMs have no innate sense of time, treating all memories as equally current by default. This creates serious problems for production agents that need to answer questions like “what did we decide last week?” or “has this been updated since the migration?” Mastra’s solution pre-computes temporal context that the model would otherwise struggle to infer:

This temporal scaffolding eliminates the need for the LLM to perform date math or infer temporal relationships, which LLMs notoriously struggle with. The model receives observations in a format that makes temporal reasoning straightforward, which proved critical for achieving high scores on LongMemEval’s temporal reasoning tests.

Cost Optimization: Prompt Caching and Dynamic Model Routing

While running two additional LLM agents might seem expensive, Mastra has optimized costs through several mechanisms:

Prompt caching is the primary cost savings driver. Because the observation block remains stable across turns, it qualifies for prompt caching with most providers. Each new user message hits the cache, delivering cheaper tokens and lower latency without additional implementation work. The cost savings from running a smaller, cached context on the main model far outweigh the cost of running the observer and reflector on cheap models.

Dynamic model routing further optimizes costs by selecting models based on input size. The observer and reflector don’t use the same model for every call—instead, they route based on token count. For example, small inputs (up to 5,000 tokens) might use Ministral 8B, medium inputs (up to 20,000 tokens) use Mistral Small, large inputs use GPT-5.4-mini, and huge inputs (up to 1,000,000 tokens) use Gemini 3.1 Flash Lite. This ensures that simple observations get processed by fast, cheap models while more complex consolidations get more capable models.

The default models for both observer and reflector are set to Gemini 2.5 Flash, which offers an excellent balance of capability, speed, and cost for most production use cases.

Production Deployment and Framework Integration

Observational memory is deployed in production across hiring and healthcare applications, and is integrated as the default memory system in the Mastra TypeScript agent framework. The integration is remarkably simple from a developer perspective—enabling observational memory requires just one line of configuration:

memory: new Memory({
  options: {
    observationalMemory: true,
  },
})

However, the system exposes extensive configuration options for production tuning:

This configurability is essential for production deployments where different use cases have different memory requirements. A customer support agent might need aggressive compression to handle long sessions efficiently, while a legal research agent might need more conservative settings to preserve more detail.

Evaluation and Benchmarking

The system was evaluated on LongMemEval, described as “one of the hardest public memory benchmarks” with 500 questions across 57 million tokens from about 50 conversations. The benchmark specifically tests contradictions, cross-session recall, and temporal reasoning—all critical for production agent systems.

Results showed observational memory achieving 84.23% with GPT-4o (the benchmark’s stable reference model), beating the oracle baseline at 82.40% and a leading alternative at 81.60%. With GPT-5-mini, accuracy reached 94.87%. The authors note that GPT-4o is reported because it provides consistency for comparisons, even though more capable models are available.

However, the authors appropriately note that “benchmarks are indicators, not the full story” and emphasize that real-world performance in production use cases is what ultimately matters. The system’s success in production deployments across hiring and healthcare provides validation beyond benchmark scores.

Comparison to Alternative Approaches

The documentation explicitly compares observational memory to compaction systems like the one in Claude Code, highlighting fundamental architectural differences:

Compaction systems wait until context is nearly full, then block the conversation to produce a dense summary of everything that happened. By the time compaction fires, context drift and rot may have already occurred. The user then waits—sometimes a minute or more—while the system performs wholesale summarization.

Observational memory works inversely: it begins compressing early with a smaller token threshold, using a much smaller slice of the context window far more efficiently. The observer runs continuously in the background, scrutinizing every message to extract only the essence. It effectively rewrites message history incrementally rather than replacing it wholesale at the last moment. The result is structured, timestamped, prioritized, and temporally annotated.

This comparison reveals an important tradeoff in production LLMOps: reactive vs. proactive memory management. Compaction is simpler to implement (one model call when needed) but creates user-facing latency and risks context degradation. Observational memory is more complex (two background agents with scheduling) but provides continuous compression without blocking user interactions.

Practical Considerations and Limitations

The documentation acknowledges several important limitations and considerations:

Intentional lossiness: The system is designed to be lossy—it keeps what matters and lets details fade. This is appropriate for most conversational agents but not all use cases. Some tasks require exact recall rather than summaries. Mastra is addressing this by building a lightweight retrieval layer that lets agents inspect raw message history only when needed, combining observational memory with selective retrieval.

Shared token budget limitation: When enabling shareTokenBudget: true (which allocates token budget between messages and observations), the system currently requires disabling async buffering (bufferTokens: false), which is described as a temporary limitation. This suggests ongoing development to resolve architectural constraints.

Scope experimental features: Resource-scoped observations (sharing memory across all threads for a user) are marked as experimental, indicating they may not be fully production-ready for all scenarios.

Framework dependencies: The current implementation is tightly integrated with Mastra’s TypeScript agent framework. While the authors mention plans to bring observational memory to other frameworks like Claude Code, OpenCode, and OpenClaw, this suggests the approach is not yet framework-agnostic.

LLMOps Insights and Production Lessons

This case study offers several valuable insights for production LLMOps:

Background processing architecture: Running memory management asynchronously in background threads is critical for user experience in production agents. The observer’s buffering strategy (running every 20% of threshold, increasing frequency near limits) represents thoughtful engineering for real-time systems.

Cost-performance tradeoffs: The combination of cheap models for memory management, prompt caching for the observation block, and dynamic model routing demonstrates sophisticated cost optimization. The key insight is that spending on efficient memory management (observer + reflector on cheap models) saves far more on the main model’s context consumption.

Temporal scaffolding: The elaborate temporal context system (relative labels, gap markers, precise timestamps) reveals how much work is needed to make LLMs reason about time effectively. This is a general lesson for production systems: LLMs need significant scaffolding for capabilities humans find trivial.

Evaluation beyond benchmarks: While the team achieved state-of-the-art LongMemEval scores, they appropriately emphasize production validation in real applications (hiring, healthcare). This reflects mature LLMOps practice where benchmark performance is necessary but not sufficient.

Intelligent lossy compression: The priority-based forgetting mechanism (using emoji indicators) is a clever solution to the fundamental tradeoff between context size and information preservation. Rather than trying to preserve everything or randomly truncating, the system makes principled decisions about what matters.

Framework integration design: The single-line enablement with extensive configuration options demonstrates good production API design—easy to start, flexible to tune. This pattern is common in mature LLMOps tools.

Future Direction and Evolution

The documentation indicates several areas of ongoing development: expanding retrieval capabilities to complement observational memory, bringing the approach to other frameworks and tools, and resolving current limitations like the shared token budget constraint. The authors position the future as “observational memory AND retrieval” rather than one or the other, suggesting a hybrid approach where lossy compression handles most memory needs while precise retrieval handles edge cases requiring exact recall.

The open-source nature of the implementation (with links to source code for observer and reflector, though not included in the provided text) suggests community-driven evolution and transparency about implementation details, which is important for production adoption where teams need to understand and potentially customize behavior.

Overall, this case study represents sophisticated production LLMOps engineering tackling a fundamental challenge—context window management—with an approach that balances performance, cost, user experience, and information preservation through human-inspired selective memory and intelligent forgetting.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Production Monitoring and Issue Discovery for AI Agents

Raindrop 2026

Raindrop's CTO Ben presents a comprehensive framework for building reliable AI agents in production, addressing the challenge that traditional offline evaluations cannot capture the full complexity of real-world user behavior. The core problem is that AI agents fail in subtle ways without concrete errors, making issues difficult to detect and fix. Raindrop's solution centers on a "discover, track, and fix" loop that combines explicit signals like thumbs up/down with implicit signals detected semantically in conversations, such as user frustration, task failures, and agent forgetfulness. By clustering these signals with user intents and tracking them over time, teams can identify the most impactful issues and systematically improve their agents. The approach emphasizes experimentation and production monitoring over purely offline testing, drawing parallels to how traditional software engineering shifted from extensive QA to tools like Sentry for error monitoring.

chatbot customer_support question_answering +40

Context Engineering and Agent Development at Scale: Building Open Deep Research

LangChain 2025

Lance Martin from LangChain discusses the emerging discipline of "context engineering" through his experience building Open Deep Research, a deep research agent that evolved over a year to become the best-performing open-source solution on Deep Research Bench. The conversation explores how managing context in production agent systems—particularly across dozens to hundreds of tool calls—presents challenges distinct from simple prompt engineering, requiring techniques like context offloading, summarization, pruning, and multi-agent isolation. Martin's iterative development journey illustrates the "bitter lesson" for AI engineering: structured workflows that work well with current models can become bottlenecks as models improve, requiring engineers to continuously remove structure and embrace more general approaches to capture exponential model improvements.

code_generation summarization chatbot +39