ZenML

Fine-tuning Custom Embedding Models for Enterprise Search

Glean 2023
View original source

Glean implements enterprise search and RAG systems by developing custom embedding models for each customer. They tackle the challenge of heterogeneous enterprise data by using a unified data model and fine-tuning embedding models through continued pre-training and synthetic data generation. Their approach combines traditional search techniques with semantic search, achieving a 20% improvement in search quality over 6 months through continuous learning from user feedback and company-specific language adaptation.

Industry

Tech

Technologies

Overview

This case study is drawn from a technical talk by Manov, a software engineer at Glean who has been with the company for approximately three years, working primarily on semantic search and ML systems for search ranking and assistant quality. The presentation was part of a course on systematically improving RAG applications, hosted by Jason (likely Jason Liu of instructor fame). Glean is an enterprise AI company that aggregates data from various enterprise applications (Google Drive, GitHub, Jira, Confluence, Slack, etc.) to power search and AI assistant capabilities. The talk focuses specifically on how Glean approaches embedding model fine-tuning to achieve high-quality enterprise search, which serves as the foundation for their RAG systems.

The Core Problem

Enterprise AI search faces fundamentally different challenges compared to internet search. While web search benefits from power-law distributions where most queries target popular websites and common information sources like Wikipedia or Stack Overflow, enterprise data is highly heterogeneous. Companies use a diverse array of applications including document stores (Google Docs, Confluence, Notion), messaging platforms (Slack), code repositories (GitHub, GitLab), meeting systems, and various specialized tools. Each of these has different data structures, and importantly, each company develops its own internal dialect with project names, acronyms, and domain-specific terminology that generic embedding models simply cannot understand.

Glean’s guiding philosophy is that search quality is the foundation of enterprise AI. Without reliable, high-quality search, RAG systems will pull irrelevant context, leading to hallucinations and poor user experiences that create significant business costs. This makes the embedding layer critical infrastructure for the entire system.

Technical Architecture and Approach

Unified Data Model

A key architectural decision at Glean is the creation of a unified data model that maps diverse enterprise applications into a consistent document schema. Rather than building federated search across disparate systems, Glean creates a single unified index. This requires careful mapping of each application’s data structures to the standardized format. For example, Slack messages, which are naturally short and lack titles, require special handling—Glean models conversations (threads or temporally-grouped messages) as documents rather than individual messages, using the first message as a proxy for a title and the rest as the body.

This unified data model is essential not just for search but for scalable ML training. It allows the ML team to work with consistent abstractions across all customer data sources, making it feasible to build training pipelines that work across hundreds of enterprise deployments.

Custom Embedding Models Per Customer

Rather than using a single large, general-purpose embedding model, Glean builds custom embedding models for each customer. The rationale is that enterprise-specific models, even if smaller, significantly outperform generic models when fine-tuned to the specific domain and task. This represents a significant operational challenge—with hundreds of customers, there are hundreds of models to train, manage, evaluate, and deploy.

Multi-Stage Training Pipeline

The training process involves several stages:

Stage 1: Masked Language Modeling for Domain Adaptation

The process begins with a base model using BERT-based architecture. Despite newer architectures being available, BERT remains tried-and-true for this application. The first phase uses masked language modeling (MLM) to adapt the general language model to the company’s specific domain. This involves taking sentences from the customer’s corpus, masking certain words, and training the model to predict them. Glean mentions using techniques to specifically mask domain-relevant words to accelerate learning of company-specific terminology.

The advantage of MLM is that it requires no labeled data—every document in the enterprise becomes training data. This means even small customers with limited activity still have sufficient data for this domain adaptation phase.

Stage 2: Contrastive Learning for Embedding Quality

MLM alone produces a language model, not an embedding model optimized for search. The next phase converts this domain-adapted language model into a high-quality embedding model through contrastive learning on pairs of semantically related content. Glean uses several sources for generating these pairs:

Handling Heterogeneous Enterprise Data

The talk emphasizes that naive application of these techniques to all data sources fails. Each application has nuances that must be understood:

The speaker emphasizes that there’s no substitute for understanding your data and talking to customers. Due to security constraints, engineers often cannot directly view customer data, making customer interviews and behavior understanding even more critical.

Synthetic Data Generation

For smaller customers or corpuses with limited data, synthetic data generation using LLMs becomes valuable. The approach involves generating question-answer pairs from documents using LLMs. However, naive prompting produces poor results—LLMs tend to generate questions that closely mirror the exact phrasing of the source text, which doesn’t reflect how real users formulate queries.

Effective synthetic data generation requires:

Continuous Learning from User Feedback

Glean operates a search product alongside their assistant, which provides valuable user feedback through query-click pairs. When users search and click on results, this provides positive training signal for the embedding model. After six months of continuous learning, Glean reports approximately 20% improvement in search quality.

For RAG-only settings without explicit search, gathering feedback is more challenging. The speaker acknowledges this as an open problem. Some proxy signals include:

Models are retrained monthly. The speaker notes that major new concepts rarely become critically important within a week, so monthly retraining balances freshness with operational overhead. Importantly, when models are updated, all vectors must be re-indexed—there’s no way to avoid this if embedding quality matters.

Evaluation Strategy

Evaluating enterprise search is complex because:

Glean’s approach involves:

Online Metrics: Session satisfaction (do users find relevant documents based on click patterns?), time spent on clicked documents, upvote/downvote ratios, and usage trends.

Unit Testing for Models: Rather than trying to evaluate everything holistically, Glean builds targeted eval sets for specific desirable behaviors. For example:

Each custom model is evaluated against target benchmarks for these capabilities. Models underperforming on specific tests can be investigated and hyperparameter-tuned.

LLM-as-Judge: For comparing system versions, Glean uses A/B testing with LLM judges evaluating response quality across dimensions like factuality and relevance. The speaker notes LLM judges have noise and brittleness (small output changes can flip judgments), but they enable scalable evaluation when combined with multiple quality dimensions.

Key Insights and Philosophy

The speaker emphasizes several principles:

Access Control

Access control is handled at query time via their search engine, filtering results based on user-level ACL information. However, training also considers permissions—training preferentially on widely-accessible documents both respects privacy and provides more useful signal (widely-accessed documents affect more users).

Operational Considerations

The talk touches on several operational realities of running embedding systems at scale:

This represents a mature LLMOps operation where the embedding layer is treated as core infrastructure requiring rigorous engineering practices around training, evaluation, and deployment.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Enterprise-Scale Healthcare LLM System for Unified Patient Journeys

John Snow Labs 2024

John Snow Labs developed a comprehensive healthcare LLM system that integrates multimodal medical data (structured, unstructured, FHIR, and images) into unified patient journeys. The system enables natural language querying across millions of patient records while maintaining data privacy and security. It uses specialized healthcare LLMs for information extraction, reasoning, and query understanding, deployed on-premises via Kubernetes. The solution significantly improves clinical decision support accuracy and enables broader access to patient data analytics while outperforming GPT-4 in medical tasks.

healthcare question_answering data_analysis +37

Building a Multi-Agent Research System for Complex Information Tasks

Anthropic 2025

Anthropic developed a production multi-agent system for their Claude Research feature that uses multiple specialized AI agents working in parallel to conduct complex research tasks across web and enterprise sources. The system employs an orchestrator-worker architecture where a lead agent coordinates and delegates to specialized subagents that operate simultaneously, achieving 90.2% performance improvement over single-agent systems on internal evaluations. The implementation required sophisticated prompt engineering, robust evaluation frameworks, and careful production engineering to handle the stateful, non-deterministic nature of multi-agent interactions at scale.

question_answering document_processing data_analysis +48