Company
Superhuman
Title
AI-Powered Email Search Assistant with Advanced Cognitive Architecture
Industry
Tech
Year
2024
Summary (short)
Superhuman developed Ask AI to solve the challenge of inefficient email and calendar searching, where users spent up to 35 minutes weekly trying to recall exact phrases and sender names. They evolved from a single-prompt RAG system to a sophisticated cognitive architecture with parallel processing for query classification and metadata extraction. The solution achieved sub-2-second response times and reduced user search time by 14% (5 minutes per week), while maintaining high accuracy through careful prompt engineering and systematic evaluation.
## Overview Superhuman, known for its premium email client, developed Ask AI as an AI-powered search assistant designed to transform how users navigate their inboxes and calendars. The product addresses a real productivity pain point: users were spending up to 35 minutes per week trying to recall exact phrases and sender names using traditional keyword search in their email clients. The Ask AI feature delivers context-aware answers to complex natural language queries like "When did I meet the founder of that series A startup for lunch?" — questions that would be nearly impossible to answer with traditional keyword-based search. The case study provides valuable insights into the evolution of their LLM architecture, the challenges encountered in production, and the techniques employed to overcome them. It is worth noting that this case study is published by LangChain, which may have some inherent promotional bias, though the technical details shared appear credible and instructive. ## Initial Architecture and Its Limitations Superhuman's first iteration of Ask AI used a relatively simple single-prompt LLM architecture that performed retrieval augmented generation (RAG). The flow was straightforward: the system would generate retrieval parameters using JSON mode, pass them through hybrid search and heuristic reranking, and then have the LLM produce an answer. However, this single-prompt design encountered several production challenges: - The LLM did not always follow task-specific instructions reliably, leading to inconsistent behavior - The system struggled to reason about dates accurately, which is critical for calendar-related queries involving deadlines and scheduling - Performance varied significantly by search type — the system handled certain queries well (finding flights, summarizing company updates) but struggled with others (calendar availability, complex multi-step searches) These limitations are common in production LLM systems and highlight the gap between prototype performance and production-ready reliability. The Superhuman team's experience validates that single-prompt RAG systems, while quick to build, often lack the sophistication needed for real-world use cases with diverse query types. ## Evolved Multi-Agent Architecture The team transitioned to a more sophisticated cognitive architecture that could better understand user intent and provide more accurate responses. The new design features parallel processing and task-specific handling, representing a significant architectural maturation. ### Query Classification and Parameter Generation When a user submits a query, two parallel processes occur: **Tool Classification:** The system classifies the query based on user intent to determine which tools or data sources to activate. The classifier identifies whether the query requires email search only, email plus calendar event search, checking availability, scheduling an event, or a direct LLM response without tools. This classification ensures that only relevant tools are invoked, improving response quality and efficiency. **Metadata Extraction:** Simultaneously, the system extracts relevant tool parameters such as time filters, sender names, or relevant attachments. These extracted parameters are used in retrieval to narrow the scope of search and improve accuracy. This parallel processing approach is a sensible design choice for production systems, as it reduces latency by not serializing independent operations while also providing better separation of concerns in the architecture. ### Task-Specific Tool Use Once the query is classified, the appropriate tools are invoked. For search tasks, the system employs hybrid search combining semantic and keyword-based approaches, with reranking algorithms to prioritize the most relevant information. This hybrid approach is notable because it acknowledges that pure semantic search is not always superior — sometimes exact keyword matches are exactly what the user needs. ### Response Generation Based on the classification from the first step, the system selects different prompts and preferences. These prompts contain context-specific instructions with query-specific examples and encoded user preferences. The LLM then synthesizes information to generate a tailored response. Rather than relying on one large, all-encompassing prompt, the Ask AI agent uses task-specific guidelines during post-processing. This modular approach to prompting allows the agent to maintain consistent quality across diverse tasks — a practical technique for production systems that need to handle multiple use cases with varying requirements. ## Prompt Engineering Techniques The case study reveals some interesting prompt engineering practices that Superhuman developed through iteration: **Structured Prompts:** They organized their prompts by adding chatbot rules to define system behavior, task-specific guidelines, and semantic few-shot examples to guide the LLM. This nesting of rules helped the LLM more reliably follow instructions. **Double Dipping Instructions:** Perhaps the most interesting technique mentioned is "double dipping" — repeating key instructions in both the initial system prompt and the final user message. This dual reinforcement ensures that essential guidelines are rigorously followed, addressing a common production issue where LLMs sometimes lose sight of instructions in longer contexts. While this technique may seem redundant, it is a pragmatic solution to real-world reliability challenges. ## Evaluation and Launch Strategy Superhuman's evaluation approach demonstrates a mature understanding of how to validate AI products before and during production deployment: **Static Dataset Testing:** The team first tested against a static dataset of questions and answers, measuring retrieval accuracy and comparing how changes to their prompt impacted accuracy. This provides a baseline and enables regression testing. **Staged Rollout:** They adopted a "launch and learn" approach with systematic expansion: - Internal pod stakeholders with thumbs up/down feedback - Company-wide launch with the same feedback mechanism - Dedicated AI beta group - Community champions - Beta waitlist - General availability This four-month testing process allowed the team to identify the most pressing user needs and prioritize improvements accordingly. The staged approach is a best practice for production LLM systems, as it limits blast radius while gathering real-world feedback that static test sets cannot capture. **User Feedback Integration:** The use of simple thumbs up/down feedback from users provides a continuous signal for identifying issues and measuring satisfaction, though it is worth noting that such binary feedback has limitations in diagnosing specific failure modes. ## Production Performance Requirements The case study mentions specific production requirements that shaped the architecture: - **Sub-2-second response times** to maintain a smooth user experience — a critical constraint that influenced architectural decisions around parallelization - **Reduced hallucinations** through post-processing layers and brief follow-up validation The emphasis on response speed is particularly relevant for conversational AI products where latency directly impacts user experience and adoption. ## User Experience Considerations The team put significant thought into how Ask AI integrates into the existing product: - Integration within the search bar with a toggle between traditional search and Ask AI - A separate chat-like interface for follow-up questions and conversation history - User control over choosing between semantic or regular search - Validation of uncertain results with the user before providing a final answer The decision to offer both search bar integration and a chat interface came from user feedback and testing, demonstrating the importance of letting real usage patterns inform product decisions rather than making assumptions about how users will interact with AI features. ## Results According to the case study, users have cut search time by 5 minutes every week since Ask AI's release, representing a 14% time savings compared to the previous 35 minutes spent on email and calendar search. While these metrics are self-reported by the company and should be viewed with appropriate skepticism, they provide a concrete measure of the product's claimed value. ## Key LLMOps Takeaways This case study illustrates several important LLMOps principles for production systems: The evolution from a single-prompt RAG system to a multi-agent architecture with parallel processing and task-specific handling demonstrates that production AI systems often need more sophistication than initial prototypes suggest. The architectural patterns employed — query classification, parallel metadata extraction, hybrid search with reranking, and task-specific prompt selection — represent a mature approach to handling diverse user intents reliably. The prompt engineering techniques, particularly the "double dipping" of instructions, show pragmatic solutions to real-world LLM reliability challenges. The staged evaluation and rollout process demonstrates how to systematically validate AI products while limiting risk and gathering actionable feedback. Overall, the case study provides a useful example of how a consumer productivity product evolved its LLM architecture to meet production requirements around reliability, speed, and diverse query handling.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.