Glean tackles enterprise search by combining traditional information retrieval techniques with modern LLMs and embeddings. Rather than relying solely on AI techniques, they emphasize the importance of rigorous ranking algorithms, personalization, and hybrid approaches that combine classical IR with vector search. The company has achieved unicorn status and serves major enterprises by focusing on holistic search solutions that include personalization, feed recommendations, and cross-application integrations.
Glean is an enterprise search company founded in 2019 by a team of former Google engineers, including Deedy Das who previously served as a Tech Lead on Google Search. The company reached unicorn status with a Series C led by Sequoia at a $1 billion valuation in 2022. Their product serves as an AI-powered internal search and employee portal for enterprises, with customers including Databricks, Canva, Confluent, Duolingo, Samsara, and various Fortune 50 companies.
The genesis of Glean came from a pain point familiar to many ex-Googlers: the loss of internal tools like Google’s Moma, which indexes everything used inside Google and allows employees to search across all company resources with proper permissions handling. When these engineers left Google and joined other companies, they realized how difficult it was to function without being able to efficiently find documents, presentations, and information created by colleagues.
One of the most significant technical insights from this case study is Glean’s deliberate choice to not rely solely on vector search or the latest AI techniques. Instead, they employ a hybrid approach that combines multiple retrieval strategies:
This hybrid approach reflects a mature understanding that cutting-edge AI alone does not guarantee better user experience. The team observed that many enterprise search competitors lean heavily into marketing around “AI-powered, LLM-powered vector search” as buzzwords, but the actual user experience improvements from these techniques can be difficult for users to evaluate.
A critical differentiator for Glean is their personalization system, which considers:
This personalization layer sits on top of the hybrid retrieval system and is described as a key factor in making their search “good” rather than just technologically sophisticated.
The team emphasizes that effective search comes from “the rigor and intellectual honesty that you put into tuning the ranking algorithm” rather than algorithm complexity. This is described as a painstaking, long-term, and slow process. According to Das, Google Search itself ran without much “real AI” until around 2017-2018, relying instead on carefully tuned ranking components that each solved specific sub-problems extremely well.
An important product insight from this case study is that pure search functionality is not compelling enough to drive sustained user engagement. Users might use a search tool once and forget about it. To achieve retention, Glean evolved into a broader employee portal with features including:
While the interview was conducted in April 2023, the discussion around LLMs and their integration into enterprise search is instructive for LLMOps practitioners:
When asked about “Glean Chat,” Das indicated they were experimenting with various LLM-powered technologies but would launch what users respond to best. This suggests a user-centric approach to LLM feature development rather than technology-first thinking.
The conversation includes thoughtful analysis of where LLMs excel versus where traditional search remains superior:
The interview discusses retrieval augmented generation as a technique for combining search with LLM generation. The key insight is that RAG-style approaches still fundamentally require search in the backend to provide context, meaning the quality of the underlying search system remains critical even in LLM-augmented products.
A significant limitation of LLMs discussed is handling fresh information. LLMs cannot be trained quickly or cost-efficiently enough to index new data sources and serve them simultaneously. This makes traditional search or RAG approaches necessary for any enterprise application requiring current information.
Das provides valuable perspective on AI infrastructure economics, noting that engineers at large companies like Google are completely abstracted from cost considerations. At a startup, understanding infrastructure costs is essential because it directly impacts unit economics. He advocates for more transparency around training costs in research papers and has done analysis estimating training costs for various models (approximately $4 million for LLaMA, $27 million for PaLM).
He also notes the distinction between the cost of the final training run versus the total cost including experimentation, hyperparameter tuning, architecture exploration, and debugging failed runs—which can be approximately 10x the final training cost.
An often overlooked but highly practical application discussed is using LLMs to generate synthetic training data for smaller, specialized models. For example, using GPT-4 to generate training data for a named entity recognition (NER) task, then either training a traditional model on this data or using low-rank adaptation (LoRA) to distill the large model into a smaller, faster, more cost-effective model. This approach is described as transforming work that previously took dedicated teams years into something achievable in weeks.
The case study touches on the difficulties of selling productivity tools to enterprises. Unlike customer support tools where ROI can be calculated directly (20% improvement in ticket resolution = $X cost savings), search and productivity tools require softer arguments about employee time savings and efficiency. Buyers often default to “we work fine without it” unless they experience the product directly.
While Glean has achieved significant commercial success, several aspects warrant balanced consideration:
Overall, this case study provides valuable insights into building production search systems that balance cutting-edge AI techniques with proven information retrieval methods, emphasizing that rigorous engineering and user-centric product development often matter more than adopting the latest AI trends.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Glean implements enterprise search and RAG systems by developing custom embedding models for each customer. They tackle the challenge of heterogeneous enterprise data by using a unified data model and fine-tuning embedding models through continued pre-training and synthetic data generation. Their approach combines traditional search techniques with semantic search, achieving a 20% improvement in search quality over 6 months through continuous learning from user feedback and company-specific language adaptation.
Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.