ZenML

Terminal-Native AI Coding Agent with Multi-Model Architecture and Adaptive Context Management

Opendev 2026
View original source

OpenDev is an open-source, command-line AI coding agent written in Rust that addresses the fundamental challenges of building production-ready autonomous software engineering systems. The agent tackles three critical problems: managing finite context windows over long sessions, preventing destructive operations while maintaining developer productivity, and extending capabilities without overwhelming token budgets. The solution employs a compound AI system architecture with per-workflow LLM binding, dual-agent separation of planning from execution, adaptive context compaction that progressively reduces older observations, lazy tool discovery via Model Context Protocol (MCP), and a defense-in-depth safety architecture. Results demonstrate approximately 54% reduction in peak context consumption, session lengths extending from 15-20 turns to 30-40 turns without emergency compaction, and a robust framework for terminal-first AI assistance that operates where developers manage source control, execute builds, and deploy environments.

Industry

Tech

Technologies

Overview

OpenDev represents a comprehensive technical exploration of building production-grade LLM-based coding agents that operate in terminal environments. Published in March 2026 as an arXiv paper (arXiv:2603.05344v3), this work documents the first comprehensive technical report for an open-source, terminal-native, interactive coding agent. The system addresses the emerging paradigm shift from IDE-integrated AI assistants to versatile command-line agents that operate where developers actually work: managing source control, executing builds, and deploying environments.

The fundamental insight driving OpenDev’s architecture is that effective autonomous coding assistance requires solving three interconnected problems that cannot be addressed through prompt engineering alone: managing finite context windows over sessions that routinely exceed model token budgets, preventing destructive operations when agents can execute arbitrary shell commands, and extending capabilities without overwhelming the prompt budget. The system is explicitly designed as a “compound AI system” following Zaharia et al.’s framework, meaning it is not a single monolithic LLM but a structured ensemble of agents and workflows, each independently bound to user-configured models.

Core Architecture and LLMOps Principles

Multi-Model Routing and Workload Optimization

OpenDev implements per-workflow LLM configurability where different execution phases independently select models based on their specific requirements. The system defines five specialized model roles: an action model for primary execution with tool access, a thinking model for extended reasoning without tool distraction, a critique model for self-evaluation (implementing Reflexion-inspired patterns), a vision model for processing screenshots and images, and a compact model for fast summarization during context compression. Each role has fallback chains, so if a specialized model is unavailable, the system gracefully degrades to more general-purpose alternatives.

This architecture embodies the compound AI systems principle that state-of-the-art results come from composing multiple models rather than relying on a single model call. The practical implication is that OpenDev is model-agnostic by construction: switching providers or optimizing cost requires only configuration changes, not code changes. The system’s capabilities continuously upgrade as better models emerge, without requiring architectural modifications.

Extended ReAct Execution Loop

The reasoning engine extends the standard ReAct (Reason-Act-Observe) pattern with explicit thinking and self-critique phases that separate deliberation from action. The execution pipeline runs four phases per turn: automatic context compaction when token budget nears exhaustion, an optional thinking phase for pre-action reasoning at configurable depth (OFF, LOW, MEDIUM, HIGH), an optional self-critique phase (automatically included at HIGH level), and the standard action phase with full tool access.

The key architectural decision is that thinking and action are separate API calls. When tools are available, models tend to act quickly rather than think deeply. By providing a thinking phase with no tool schemas in the prompt, OpenDev forces the model to deliberate without the pressure of available actions. The thinking trace is then injected into the action phase context as a system reminder, making the reasoning visible to the action model. At the HIGH thinking level, a critique model evaluates the initial reasoning trace and the thinking model refines its output based on feedback, implementing a self-improvement loop.

Adaptive Context Compaction

Context management is treated as a first-class engineering concern rather than an afterthought. As agents operate within ReAct loops, tool observations (file contents, command outputs, search results) accumulate and quickly dominate the context window, frequently consuming 70-80% of the available token budget. OpenDev’s Adaptive Context Compaction (ACC) framework monitors token usage incrementally using API-reported prompt token counts as calibration anchors, then applies five graduated reduction strategies at progressive pressure thresholds: warning at 70%, observation masking at 80%, fast pruning at 85%, aggressive masking at 90%, and full LLM-based compaction at 99%.

The innovation is that cheaper strategies (masking old tool outputs with reference pointers, pruning outputs beyond the recency window) often reclaim sufficient space to avoid expensive LLM summarization. Quantitative results show ACC reduces peak context consumption by approximately 54% and extends typical session length from 15-20 turns (before emergency compaction) to 30-40 turns without requiring full conversation summarization. An artifact index tracks all files touched and operations performed, which is serialized into compaction summaries to ensure the agent remembers what it has worked with even after history compression. The full conversation is archived to a scratch file, making compaction effectively non-lossy since the agent can recover any detail by reading the archive.

Dual-Memory Architecture

For the thinking phase, which requires strategic context but cannot consume unbounded history, OpenDev implements a cognitive-science-inspired dual-memory architecture. Episodic memory provides an LLM-generated summary of the full conversation history, capturing strategic long-range context (decisions made, overall goals, key findings, important file paths) while preserving actionable identifiers. This summary is regenerated periodically (every 5 messages) rather than on every turn, amortizing cost and preventing summary drift that would occur from iteratively compressing a compression. Working memory retains the last several message pairs (6 exchanges by default) verbatim, providing fine-grained operational details needed for immediate decision-making. The combined injection provides both strategic overview and tactical precision while keeping the thinking token budget bounded regardless of conversation length.

Event-Driven System Reminders

A fundamental reliability problem in long-running sessions is attention decay: as conversations grow, model attention drifts away from initial system prompt instructions, leading to silent failures like premature task completion, abandoned error recovery, and unchecked exploration spirals. OpenDev addresses this with context-aware system reminders: short, single-purpose messages injected exactly when the agent needs them, right before the decision point where it would otherwise fail.

Twenty-four named reminders organized into six categories (phase control, task lifecycle, todo enforcement, error recovery, behavioral correction, JSON retry) are triggered by event detectors that examine conversation state after each iteration. For example, when the agent calls task_complete with incomplete todos, a reminder is injected listing outstanding items. When the agent hits a file-editing error, a classified recovery template provides specific guidance (“the file has changed since you last read it; re-read and retry”) rather than generic retry instructions. Crucially, reminders use role:user rather than role:system because user messages at maximum recency receive higher salience than system context buried under tool results.

Reminder frequency is capped per type using guardrail counters (2 nudges max for todo enforcement, 3 attempts max for error recovery) to prevent reminders from becoming background noise the model learns to ignore. This injection timing represents a sophisticated approach to behavioral steering over long horizons, treating prompt influence as a signal-to-noise engineering problem.

Tool System and Extensibility

Comprehensive Built-In Tools

OpenDev provides 35 built-in tools organized into 12 handler categories: file operations (read, write, edit with 9-pass fuzzy matching, list, search), process execution (shell commands with auto-background for servers, process management), web interaction (browser-engine fetching via Crawl4AI, privacy-respecting search, screenshots), semantic code analysis (6 LSP-based tools for symbol resolution, renaming, references), user interaction and task management (structured multi-choice questions, kanban-style todo tracking), visual analysis (screenshot capture, image analysis via vision models), notebooks (Jupyter cell editing), planning (plan presentation and approval), completion signaling, MCP tool discovery, batch execution, subagent delegation, and skill loading.

The file editing tool is particularly sophisticated, implementing a chain-of-responsibility pattern with nine progresser classes that address LLM formatting drift: exact match, line-trimmed, block-anchor with sequence matching, whitespace-normalized, indentation-flexible, escape-normalized, trimmed-boundary, context-aware, and multi-occurrence. This design absorbs LLM imprecision as a first-class property: the agent’s intent is usually correct even when its literal reproduction of target text has minor formatting variations. Each replacer returns the actual substring found in the file (not the search query) to preserve original formatting.

Language Server Protocol Integration

For semantic code analysis, OpenDev integrates Language Server Protocol via a four-layer abstraction: an agent-facing tool layer (6 tools), a symbol retriever with unified API and pattern matching, an LSP server wrapper handling language detection (30+ file extensions) and server lifecycle management, and a low-level protocol handler managing JSON-RPC 2.0 communication over stdio. Each language server maintains a two-level cache keyed by file content hash: Level 1 caches raw LSP responses, Level 2 caches processed symbol trees. When files haven’t changed, queries return from Level 2 without contacting the server; when files change but response schema hasn’t, only Level 2 recomputes from cached Level 1 data.

Lazy Tool Discovery via MCP

A critical innovation for context efficiency is lazy discovery of external tools via Model Context Protocol. The core problem: a system with 100 external tools, each schema averaging 200 tokens, consumes 20,000 tokens purely for tool definitions before the first user message. OpenDev’s solution maintains a set of discovered tools, initially empty. When the agent calls search_tools with a query (e.g., “database query tools”), a keyword-based scorer ranks all registered MCP tools, returns top matches with names and descriptions, and marks matched tools as discovered so their schemas are included in subsequent LLM calls. Direct invocation of an MCP tool by qualified name auto-discovers it without prior search. This approach reduced baseline MCP integration overhead from 40% of context to under 5%, growing only as capabilities are actually used.

Subagent Orchestration

OpenDev supports spawning specialized subagents with filtered tool registries and independent conversation contexts. Eight builtin subagent types address distinct roles: Code-Explorer (read-only navigation), Planner (read + write for plans), PR-Reviewer (code review with diff analysis), Security-Reviewer (vulnerability scanning), Web-Clone (website replication), Web-Generator (site creation from specifications), Project-Init (scaffold generation), and Ask-User (structured surveys). Each subagent is an instance of the same MainAgent class but with allowed_tools parameter restricting which tools appear in its schema.

A key design property is automatic parallelization: when the main agent emits multiple spawn_subagent calls in the same LLM response, the SubAgentManager executes them concurrently via asyncio.gather, enabling natural fan-out work patterns without explicit concurrency management. The dual-agent architecture that separates planning from execution exemplifies schema-level safety: the Planner subagent cannot write files because write tools are absent from its schema, not because runtime checks block attempts. This makes violations structurally impossible rather than relying on permission enforcement.

Safety Architecture

Defense-in-Depth Approach

Because the agent can execute arbitrary shell commands, overwrite files, and spawn persistent processes, OpenDev employs a five-layer defense-in-depth safety architecture where each layer operates independently. Layer 1 provides prompt-level guardrails with security policy, action safety rules, read-before-edit requirements, git workflow conventions, and error recovery guidance embedded in the system prompt. Layer 2 implements schema-level tool restrictions where plan-mode whitelist removes write tools entirely from schema, per-subagent allowed_tools filters apply, and MCP discovery is gated. Layer 3 provides a runtime approval system with three autonomy levels (Manual, Semi-Auto, Auto), pattern/command/prefix/danger matching rules, and persistent permissions that survive session restarts. Layer 4 implements tool-level validation with DANGEROUS_PATTERNS blocklist for catastrophic commands (rm -rf /, fork bombs, curl|bash pipes), stale-read detection preventing overwrites of concurrent edits, output truncation, and timeouts. Layer 5 provides lifecycle hooks where external scripts can hard-block tool calls (exit code 2), mutate arguments transparently, or observe events for auditing.

The critical architectural insight is that schema gating (making unsafe tools invisible rather than blocked) is fundamentally more robust than runtime permission checks. When a tool is absent from the agent’s schema, the model cannot reason about invoking it, argue for exceptions, or probe for bypass conditions. This represents safety through architectural constraints rather than behavioral enforcement.

Approval Persistence and Doom-Loop Detection

The approval system persists user decisions to disk (user-global and project-scoped JSON stores) so that when a user marks a command as “always allow,” that decision survives session restarts. This prevents approval fatigue that would lead to blanket auto-approval defeating the safety system. Dangerous patterns (rm -rf /, sudo, dd to device files) are hardcoded at priority 100 and cannot be overridden by user configuration.

To prevent stuck loops where agents repeatedly call the same tool with identical arguments, OpenDev implements fingerprint-based doom-loop detection with two-tier escalation. Each tool call is fingerprinted as MD5(tool_name, args) and tracked in a sliding window of 20 recent calls. If any fingerprint appears 3+ times, the system injects a warning message into conversation. If repetition continues, it escalates to an approval-based pause presenting “Allow / Break?” to the user. This catches stuck loops within 3 repetitions, much faster than coarse-grained iteration caps.

Memory and Persistence

Agentic Context Engineering (ACE)

OpenDev implements an experience-driven memory pipeline that accumulates project-specific knowledge across sessions. The system maintains a playbook: a collection of natural-language bullets tagged with effectiveness counters (helpful, harmful, neutral) and timestamps. A four-stage pipeline keeps the playbook current. The Selection stage uses BulletSelector to score every bullet by weighted combination of effectiveness (0.5), recency decay (0.3), and semantic similarity to current query via cosine embeddings (0.2), then top-ranked bullets inject into system prompt. The Episodic Reflection stage triggers every 5 messages where a Reflector analyzes accumulated experience, producing reasoning traces, error identification, root-cause analysis, and correct approaches. The Curation stage has a Curator read reflections and plan concrete playbook mutations (add, update, tag, remove bullets) emitted as delta batches. The Application stage applies mutations to bullet table and persists to session-scoped JSON.

This closed-loop system enables the agent to learn from tool outcomes within and across sessions without hardcoding strategies into prompts. The playbook evolves based on what actually works in the specific project context.

Session Storage and Undo

Each conversation is stored as two files: JSON metadata (session ID, timestamps, working directory, title, summary, cost tracking) and JSONL transcript (one message per line with role, content, timestamps, tool calls, token counts). Splitting metadata from messages means listing sessions requires reading only small metadata files. Sessions auto-save every 5 turns with atomic file operations (write to temp, flock, rename) to prevent data loss under concurrent access.

For undo capability, OpenDev maintains shadow git repositories: bare repositories that share no history with the user’s actual repository but use the same working directory. At every agent step that modifies files, the system runs git add . && git write-tree against the shadow repository’s object store, recording a tree hash in session metadata. The /undo command computes git diff between current tree and snapshot tree, identifies changed files, and restores them via git checkout. This leverages git’s content-addressable storage for perfect file-level restoration without interfering with the user’s version control workflow.

Production Deployment Considerations

Configuration Hierarchy

Configuration follows a four-tier hierarchy ensuring reasonable defaults while allowing customization: built-in defaults, environment variables (API credentials only, never from files), user-global settings (~/.opendev/settings.json), and project-local settings (.opendev/settings.json). Each tier overrides the one above. Context window limits are derived automatically from model capabilities fetched from an external catalog API and cached locally with 24-hour TTL using stale-while-revalidate strategy.

Dual Interface Support

OpenDev supports both terminal UI (TUI built on Textual with blocking modal approvals) and web UI (FastAPI/WebSockets with asynchronous polling approvals). Both implement a shared UICallback contract, keeping the agent layer UI-agnostic. The dual-path input dispatch at the REPL boundary routes slash commands (session management, mode switching, model selection, MCP configuration) through deterministic command handlers, while natural language queries enter the agent reasoning loop. This separation ensures system-level operations remain fast and predictable.

Provider Abstraction and Caching

Each model selection triggers lazy initialization of provider-specific API clients (only models actually used are initialized), reducing startup latency. Model capabilities (context length, vision support, reasoning features) are cached locally with TTL refresh. For providers supporting prompt caching (currently Anthropic), the system splits assembled prompts into stable (cacheable) and dynamic parts, with the stable block carrying cache_control headers. Since system prompts are re-sent on every LLM call and the stable portion comprises 80-90% of total, caching yields approximately 88% reduction in input token cost for the cached portion over multi-turn sessions.

Lessons Learned and Design Principles

The paper synthesizes five cross-cutting design tensions that shaped the architecture. Context Pressure as Central Constraint recognizes that context is consumed by both system (prompts, tool schemas) and agent actions (tool outputs, history), with tool outputs consuming 70-80% of context in typical sessions. Key lessons include treating context as a budget not a buffer, implementing graduated reduction stages, offloading large outputs to filesystem, and calibrating from API-reported token counts not local estimates.

Behavioral Steering Over Long Horizons acknowledges that system prompt influence decays as conversations grow. Lessons include injecting reminders at decision points using role:user for maximum salience, separating thinking from action in distinct API calls, encoding explicit decision trees for tool selection, and using provider-conditional prompt sections.

Safety Through Architectural Constraints emphasizes that schema gating (removing tools from available set) is fundamentally more robust than runtime permission checks. Lessons include making unsafe tools invisible not blocked, implementing defense-in-depth with independent layers, persisting approval decisions, and prioritizing modal UI during interrupts.

Designing for Approximate Outputs recognizes that LLMs reliably produce approximately-correct outputs. Lessons include designing tools to absorb LLM imprecision (9-pass fuzzy matching), adapting recovery hints to agent’s available tool set, auto-promoting server-like commands to background, and auto-installing missing dependencies.

Lazy Loading and Bounded Growth notes that eager loading fails at scale. Lessons include loading metadata indexes at startup with deferred full content to point of use, bounding every resource that grows with session length (iteration limits, undo history, concurrent tools, nudge budgets), implementing self-healing indexes, and routing deterministic operations outside agent loop.

Benchmark Context and Future Directions

While OpenDev is a production system rather than a research benchmark submission, the paper positions its design decisions within the broader evaluation ecosystem. Terminal-Bench findings show frontier agents resolve fewer than 65% of curated CLI tasks, while LongCLI-Bench reports pass rates below 20% for multi-category programming tasks, suggesting substantial room for improvement in context management and multi-step reasoning. Future work identified includes quantitative evaluation on SWE-bench and Terminal-Bench, adaptive resource allocation that adjusts thresholds based on task complexity, cross-project knowledge transfer in memory systems, structured code representations (dependency graphs, call graphs, ontologies), richer multi-agent coordination patterns beyond hierarchical delegation, learned system reminder optimization via reinforcement learning, and hybrid CLI-IDE integration using the shared callback protocol.

Significance for LLMOps

OpenDev represents a comprehensive blueprint for building production-ready LLM-based agents that operate in real development environments. The explicit treatment of scaffolding (pre-runtime agent assembly) versus harness (runtime orchestration wrapping the reasoning loop) provides a clean architectural pattern for agent lifecycle management. The compound AI systems approach with per-workflow model routing demonstrates how to optimize cost-quality-latency tradeoffs without architectural coupling to specific models. The five-layer safety architecture and defense-in-depth principles offer concrete patterns for preventing catastrophic failures in agents with destructive capabilities. The adaptive context engineering framework with graduated compaction, dual-memory, event-driven reminders, and experience-driven playbooks addresses the central challenge of maintaining agent effectiveness over long sessions. The comprehensive documentation of design tensions, failed approaches, and lessons learned makes this work valuable not just as a coding agent implementation but as a reference architecture for production LLM systems more broadly.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57

Building Custom Agents at Scale: Notion's Multi-Year Journey to Production-Ready Agentic Workflows

Notion 2026

Notion, a knowledge work platform serving enterprise customers, spent multiple years (2022-2026) iterating through four to five complete rebuilds of their agent infrastructure before shipping Custom Agents to production. The core problem was enabling users to automate complex workflows across their workspaces while maintaining enterprise-grade reliability, security, and cost efficiency. Their solution involved building a sophisticated agent harness with progressive tool disclosure, SQL-like database abstractions, markdown-based interfaces optimized for LLM consumption, and a comprehensive evaluation framework. The result was a production system handling over 100 tools, serving majority-agent traffic for search, and enabling workflows like automated bug triaging, email processing, and meeting notes capture that fundamentally changed how their company and customers operate.

chatbot question_answering summarization +50