Gong developed "Deal Me", a natural language question-answering feature for sales conversations that allows users to query vast amounts of sales interaction data. The system processes thousands of emails and calls per deal, providing quick responses within 5 seconds. After initial deployment, they discovered that 70% of user queries matched existing structured features, leading to a hybrid approach combining direct LLM-based QA with guided navigation to pre-computed insights.
Gong is a revenue intelligence platform that serves sales teams by aggregating and analyzing all the data that flows through B2B sales deals. The core value proposition is that salespeople typically spend 80% of their time on non-selling activities—searching for leads on LinkedIn, preparing for meetings, summarizing calls, sending follow-up emails, updating CRMs, and reporting to managers. By automating and streamlining these activities, Gong aims to increase productive selling time significantly.
The platform serves users across the entire sales organization hierarchy: from SDRs (Sales Development Representatives) who identify leads, to account executives who run deals, to managers who coach their teams, and up to executives who need aggregate insights on sales performance and deal health.
This case study focuses on a feature called “Ask Me” (referred to as “D Me” in the presentation), which is a natural language Q&A system built on top of Gong’s deal intelligence product. The feature was developed by Gong’s Speech and NLP research team, starting approximately 18 months before the presentation (around early 2023), inspired by the initial ChatGPT experience.
B2B enterprise sales deals are fundamentally different from consumer transactions. A single deal can:
Simply aggregating all this information in one place was already a differentiator for Gong compared to competitors. However, the challenge was extracting actionable insights from this massive corpus of unstructured data. When a sales manager wants to review a deal with their representative in a weekly one-on-one meeting, they previously had to either listen to recordings or wait for verbal summaries. The Ask Me feature aimed to let managers (and sellers) get immediate answers to any question about a deal.
The first major technical challenge was unifying all deal data into a format that could be fed to an LLM. This required creating a unified schema that could represent:
This was described as the first time Gong needed to build a unified API to bring all these different data types together into a single queryable format.
A critical trade-off emerged around context length. Longer context provides richer, more nuanced answers, but models (especially circa 2023) struggled with very long contexts, often “losing” information in the middle of long documents. The team experimented extensively with different chunking strategies.
Their current optimal approach operates at approximately the conversation level—feeding the query plus one conversation (or a batch of emails) at a time, rather than trying to stuff entire deal histories into single prompts. This granular approach improved accuracy while keeping context manageable.
The system uses sophisticated prompt engineering beyond naive question-forwarding:
Since answers may need to synthesize information from multiple sources (calls, emails, metadata), the system implements an aggregation chain:
This parallel processing architecture was necessary to meet latency requirements.
The team set a target of approximately 5 seconds for end-to-end response time. This constraint was described as “very limiting” and required careful optimization of:
A significant focus was placed on preventing hallucinations, which are particularly problematic in a sales context. The example given was a manager asking “What happened when Michael Jordan joined the deal?”—if no such person exists in the deal, the system should not fabricate an answer.
The solution involved adding a validation layer after generation that verifies claims against source data before returning responses.
A critical feature that was NOT in v1 but was added in v2 (based on customer demand) was the ability to trace answers back to their source. The current system:
This explainability feature was described as “very, very, very important” to customers.
With approximately 4,000 customer companies, each containing thousands of users who might make multiple queries per day, costs quickly became a critical concern. The presentation explicitly mentioned reaching “very, very high” cost figures that couldn’t be disclosed. Cost optimization became a major workstream.
The team implemented HyDE (Hypothetical Document Embeddings) to improve retrieval quality. This technique:
This was described as one of the approaches that “really worked” for bringing queries and relevant content closer together in embedding space.
A significant operational challenge emerged around model versioning. The team noted that prompts that worked well on a June version of GPT-3.5 might not work on a November version. This necessitated building:
Perhaps the most valuable insight from the deployment came from analyzing actual user queries at scale. The team discovered that approximately 70% of questions users asked were for information that Gong already provided through specialized, purpose-built features with:
Common queries included:
This led to a pivot in the product strategy. Instead of treating Ask Me purely as a free-form Q&A system, the current version implements intelligent routing:
This approach:
The presenter concluded with key takeaways for building AI products:
The case study represents a mature perspective on LLM deployment, moving past initial hype to a pragmatic understanding of where LLMs add value versus where traditional approaches remain superior.
This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
A comprehensive overview of ML infrastructure evolution and LLMOps practices at major tech companies, focusing on Doordash's approach to integrating LLMs alongside traditional ML systems. The discussion covers how ML infrastructure needs to adapt for LLMs, the importance of maintaining guard rails, and strategies for managing errors and hallucinations in production systems, while balancing the trade-offs between traditional ML models and LLMs in production environments.