Google Deepmind developed Deep Research, a feature that acts as an AI research assistant using Gemini to help users learn about any topic in depth. The system takes a query, browses the web for about 5 minutes, and outputs a comprehensive research report that users can review and ask follow-up questions about. The system uses iterative planning, transparent research processes, and a sophisticated orchestration backend to manage long-running autonomous research tasks.
Google DeepMind’s Deep Research is one of the first production agentic systems that autonomously conducts multi-minute web research on behalf of users. The product, built on Gemini 1.5 Pro with specialized post-training, represents a significant departure from traditional synchronous AI interactions. This case study comes from a podcast conversation with the PM and Tech Lead of the Deep Research team, providing rare insight into the engineering decisions, evaluation strategies, and operational challenges of deploying a long-running AI agent at scale.
Deep Research addresses a common user pain point: queries that have multiple facets and traditionally require opening 50-60 browser tabs over a weekend of research. Rather than returning instant answers, the system takes approximately 5 minutes to browse the web, synthesize information from multiple sources, and produce a comprehensive research report with citations.
One of the key architectural decisions was implementing an “editable Chain of Thought” approach. Before the agent begins its work, it produces a research plan that the user can review and modify. This serves multiple purposes: it provides transparency about what the agent will do, gives users an opportunity to steer the research direction, and helps users understand topics they might not know enough about to specify precisely.
The team deliberately chose to present a proposed plan rather than asking users direct follow-up questions. This design decision stemmed from the observation that users often don’t know what questions to ask. By presenting what the agent would do by default, users gain insight into the topic’s facets while having the opportunity to refine the approach.
In practice, the team found that most users simply hit “start” without editing the plan—similar to Google’s “I’m Feeling Lucky” button. However, the transparency mechanism still provides value by helping users understand why they receive the report they get.
The agent primarily uses two tools: search and deep browsing within specific web pages. The execution follows a sophisticated pattern:
This iterative planning capability was one of the hardest technical challenges. The team emphasized doing this in a generalizable way without requiring domain-specific training for each type of query or use case. They achieved this through careful post-training that balanced learning new behaviors without losing pre-training knowledge—essentially avoiding overfitting.
The system processes web content through both raw HTML and markdown transformation. While newer generation models are increasingly capable of native HTML understanding, markdown conversion helps reduce noise from JavaScript, CSS, and extraneous markup. The decision of which representation to use depends on the specific content and access method—for embedded snippets that are inherently HTML, the system preserves that format.
Currently, the system does not use vision capabilities despite models improving significantly at visual question answering. The tradeoff between added latency from rendering and the value gained doesn’t favor vision for most queries—the cases where vision would help represent a small portion of the tail distribution.
Deep Research leverages Gemini’s million-to-two-million token context window, keeping all browsed content in context across conversation turns. This enables fast follow-up responses when information has already been collected. However, the team also implements RAG as a fallback when context limits are exceeded.
The decision framework for context versus RAG involves several considerations:
The team described evaluation as one of their most significant challenges. The high entropy of the output space—comprehensive text reports that vary significantly based on queries—makes automated evaluation difficult.
The team tracks distributional metrics across a development set:
When a new model version shows large distribution changes in these metrics, it serves as an early signal that something has changed (for better or worse).
Despite the existence of auto-raters, the team relies heavily on human evaluation. They defined specific attributes they care about in collaboration with product leadership: comprehensiveness, completeness, and groundedness.
The PM noted candidly: “Sometimes you just have to have your PM review examples.”
Rather than organizing evaluations around verticals like travel or shopping, the team developed an ontology based on underlying research behavior types:
This ontology-based approach to evaluation ensures coverage across different user needs without over-optimizing for any single vertical.
Deep Research required building a new asynchronous execution platform—a significant departure from Google’s typically synchronous chat interactions. Key capabilities include:
The team didn’t reveal a public name for this orchestration system, but noted it differs from traditional workflow engines like Apache Airflow or AWS Step Functions because those are designed for static execution graphs, while agentic systems require dynamic adaptation based on the agent’s decisions.
The team discovered a counterintuitive finding about latency: users actually value the time the agent spends working. This contradicts typical Google product metrics where improved latency correlates with higher satisfaction and retention.
Initially, the team was very concerned about the 5-minute execution time. They even built two versions—a “hardcore mode” taking 15 minutes and a shipped version taking 5 minutes—with a hard cap to prevent exceeding 10 minutes. However, users have responded positively to longer execution times, perceiving them as evidence of thorough work.
From an engineering perspective, the latency budget represents a tradeoff between exploration (being more complete) and verification (checking things the model might already know). The optimal balance likely varies by query type—some queries demand higher verification (e.g., historical financial data) while others allow more latitude.
Deep Research runs on a specialized version of Gemini 1.5 Pro that has undergone specific post-training. This explains why the feature couldn’t simply toggle to using Gemini 2.0 Flash—the post-training work is model-specific.
The team noted they don’t have special API access that external developers lack, but emphasized that significant effort goes into:
The team discussed several areas for future development:
The team expressed particular interest in thinking models not just for better reasoning, but for their potential to better balance internal knowledge use versus grounding in external sources—a key tension in research agents.
The Deep Research team noted learning from the Notebook LM team, particularly their approach of picking a focused problem and doing it exceptionally well rather than trying to build a general platform. They share learnings about extracting optimal performance from Gemini models, especially for the “last mile” of product development where subtle model behaviors significantly impact user experience.
This case study illustrates how building production agentic systems requires innovations across multiple dimensions: novel UX patterns for transparency and steering, sophisticated orchestration for long-running async execution, careful evaluation strategies that go beyond benchmarks, and thoughtful engineering of the context management and tool use layer.
Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.