ZenML

AI-Powered Email Search Assistant with Advanced Cognitive Architecture

Superhuman 2024
View original source

Superhuman developed Ask AI to solve the challenge of inefficient email and calendar searching, where users spent up to 35 minutes weekly trying to recall exact phrases and sender names. They evolved from a single-prompt RAG system to a sophisticated cognitive architecture with parallel processing for query classification and metadata extraction. The solution achieved sub-2-second response times and reduced user search time by 14% (5 minutes per week), while maintaining high accuracy through careful prompt engineering and systematic evaluation.

Industry

Tech

Technologies

Overview

Superhuman, known for its premium email client, developed Ask AI as an AI-powered search assistant designed to transform how users navigate their inboxes and calendars. The product addresses a real productivity pain point: users were spending up to 35 minutes per week trying to recall exact phrases and sender names using traditional keyword search in their email clients. The Ask AI feature delivers context-aware answers to complex natural language queries like “When did I meet the founder of that series A startup for lunch?” — questions that would be nearly impossible to answer with traditional keyword-based search.

The case study provides valuable insights into the evolution of their LLM architecture, the challenges encountered in production, and the techniques employed to overcome them. It is worth noting that this case study is published by LangChain, which may have some inherent promotional bias, though the technical details shared appear credible and instructive.

Initial Architecture and Its Limitations

Superhuman’s first iteration of Ask AI used a relatively simple single-prompt LLM architecture that performed retrieval augmented generation (RAG). The flow was straightforward: the system would generate retrieval parameters using JSON mode, pass them through hybrid search and heuristic reranking, and then have the LLM produce an answer.

However, this single-prompt design encountered several production challenges:

These limitations are common in production LLM systems and highlight the gap between prototype performance and production-ready reliability. The Superhuman team’s experience validates that single-prompt RAG systems, while quick to build, often lack the sophistication needed for real-world use cases with diverse query types.

Evolved Multi-Agent Architecture

The team transitioned to a more sophisticated cognitive architecture that could better understand user intent and provide more accurate responses. The new design features parallel processing and task-specific handling, representing a significant architectural maturation.

Query Classification and Parameter Generation

When a user submits a query, two parallel processes occur:

Tool Classification: The system classifies the query based on user intent to determine which tools or data sources to activate. The classifier identifies whether the query requires email search only, email plus calendar event search, checking availability, scheduling an event, or a direct LLM response without tools. This classification ensures that only relevant tools are invoked, improving response quality and efficiency.

Metadata Extraction: Simultaneously, the system extracts relevant tool parameters such as time filters, sender names, or relevant attachments. These extracted parameters are used in retrieval to narrow the scope of search and improve accuracy.

This parallel processing approach is a sensible design choice for production systems, as it reduces latency by not serializing independent operations while also providing better separation of concerns in the architecture.

Task-Specific Tool Use

Once the query is classified, the appropriate tools are invoked. For search tasks, the system employs hybrid search combining semantic and keyword-based approaches, with reranking algorithms to prioritize the most relevant information. This hybrid approach is notable because it acknowledges that pure semantic search is not always superior — sometimes exact keyword matches are exactly what the user needs.

Response Generation

Based on the classification from the first step, the system selects different prompts and preferences. These prompts contain context-specific instructions with query-specific examples and encoded user preferences. The LLM then synthesizes information to generate a tailored response.

Rather than relying on one large, all-encompassing prompt, the Ask AI agent uses task-specific guidelines during post-processing. This modular approach to prompting allows the agent to maintain consistent quality across diverse tasks — a practical technique for production systems that need to handle multiple use cases with varying requirements.

Prompt Engineering Techniques

The case study reveals some interesting prompt engineering practices that Superhuman developed through iteration:

Structured Prompts: They organized their prompts by adding chatbot rules to define system behavior, task-specific guidelines, and semantic few-shot examples to guide the LLM. This nesting of rules helped the LLM more reliably follow instructions.

Double Dipping Instructions: Perhaps the most interesting technique mentioned is “double dipping” — repeating key instructions in both the initial system prompt and the final user message. This dual reinforcement ensures that essential guidelines are rigorously followed, addressing a common production issue where LLMs sometimes lose sight of instructions in longer contexts. While this technique may seem redundant, it is a pragmatic solution to real-world reliability challenges.

Evaluation and Launch Strategy

Superhuman’s evaluation approach demonstrates a mature understanding of how to validate AI products before and during production deployment:

Static Dataset Testing: The team first tested against a static dataset of questions and answers, measuring retrieval accuracy and comparing how changes to their prompt impacted accuracy. This provides a baseline and enables regression testing.

Staged Rollout: They adopted a “launch and learn” approach with systematic expansion:

This four-month testing process allowed the team to identify the most pressing user needs and prioritize improvements accordingly. The staged approach is a best practice for production LLM systems, as it limits blast radius while gathering real-world feedback that static test sets cannot capture.

User Feedback Integration: The use of simple thumbs up/down feedback from users provides a continuous signal for identifying issues and measuring satisfaction, though it is worth noting that such binary feedback has limitations in diagnosing specific failure modes.

Production Performance Requirements

The case study mentions specific production requirements that shaped the architecture:

The emphasis on response speed is particularly relevant for conversational AI products where latency directly impacts user experience and adoption.

User Experience Considerations

The team put significant thought into how Ask AI integrates into the existing product:

The decision to offer both search bar integration and a chat interface came from user feedback and testing, demonstrating the importance of letting real usage patterns inform product decisions rather than making assumptions about how users will interact with AI features.

Results

According to the case study, users have cut search time by 5 minutes every week since Ask AI’s release, representing a 14% time savings compared to the previous 35 minutes spent on email and calendar search. While these metrics are self-reported by the company and should be viewed with appropriate skepticism, they provide a concrete measure of the product’s claimed value.

Key LLMOps Takeaways

This case study illustrates several important LLMOps principles for production systems:

The evolution from a single-prompt RAG system to a multi-agent architecture with parallel processing and task-specific handling demonstrates that production AI systems often need more sophistication than initial prototypes suggest. The architectural patterns employed — query classification, parallel metadata extraction, hybrid search with reranking, and task-specific prompt selection — represent a mature approach to handling diverse user intents reliably.

The prompt engineering techniques, particularly the “double dipping” of instructions, show pragmatic solutions to real-world LLM reliability challenges. The staged evaluation and rollout process demonstrates how to systematically validate AI products while limiting risk and gathering actionable feedback.

Overall, the case study provides a useful example of how a consumer productivity product evolved its LLM architecture to meet production requirements around reliability, speed, and diverse query handling.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61