New Computer improved their AI assistant Dot's memory retrieval system using LangSmith for testing and evaluation. By implementing synthetic data testing, comparison views, and prompt optimization, they achieved 50% higher recall and 40% higher precision in their dynamic memory retrieval system compared to their baseline implementation.
New Computer is the company behind Dot, described as “the first personal AI designed to truly understand its users.” The core value proposition of Dot centers around a long-term memory system that learns user preferences over time by observing verbal and behavioral cues. This case study focuses on how New Computer leveraged LangSmith, part of the LangChain ecosystem, to systematically improve their memory retrieval systems in production. The reported results include 50% higher recall and 40% higher precision compared to a previous baseline implementation, though it should be noted that these metrics are self-reported and the baseline comparison details are not fully specified.
Unlike traditional RAG (Retrieval-Augmented Generation) systems that rely on static document collections, New Computer built what they describe as an “agentic memory” system. This represents an interesting architectural pattern where documents are dynamically created or pre-calculated specifically for later retrieval, rather than being ingested from existing sources. This creates unique challenges for LLMOps because the system must structure information appropriately during memory creation to ensure effective retrieval as the memory store grows over time.
The memory system includes several sophisticated features that go beyond simple vector similarity search. Memories in Dot’s system include optional “meta-fields” such as status indicators (e.g., COMPLETED or IN PROGRESS) and datetime fields like start or due dates. These meta-fields enable more targeted retrieval for high-frequency query patterns, such as task management queries like “Which tasks did I want to get done this week?” This multi-modal approach to memory representation creates additional complexity when optimizing retrieval performance.
One of the most significant LLMOps challenges New Computer faced was the need to iterate quickly across multiple retrieval methods while maintaining user privacy. Their solution involved creating synthetic data by generating a cohort of synthetic users with LLM-generated backstories. This approach allowed them to test performance without exposing real user data, which is a critical consideration for any personal AI system handling sensitive information.
The synthetic data generation process involved an initial conversation to seed the memory database for each synthetic user. The team then stored queries (messages from synthetic users) along with the complete set of available memories in a LangSmith dataset. This created a reproducible test environment that could be used across multiple experiment iterations.
Using an in-house tool connected to LangSmith, the team labeled relevant memories for each query and defined standard information retrieval evaluation metrics including precision, recall, and F1 score. This labeling infrastructure enabled them to create ground truth datasets against which they could measure retrieval performance across different methods and configurations.
The New Computer team tested a diverse range of retrieval methods, including semantic search, keyword matching, BM25 (a probabilistic ranking function commonly used in information retrieval), and meta-field filtering. Their baseline system used simple semantic search that retrieved a fixed number of the most relevant memories per query.
A key insight from their experimentation was that different query types performed better with different retrieval methods. In some cases, similarity search or keyword methods like BM25 worked better, while in others, these methods required pre-filtering by meta-fields to perform effectively. This finding highlights the importance of query-aware retrieval routing or hybrid approaches in production LLM systems.
The case study notes that running multiple retrieval methods in parallel can lead to a “combinatorial explosion of experiments,” which emphasizes the importance of having robust tooling for rapid experimentation. LangSmith’s SDK and Experiments UI reportedly enabled New Computer to run, evaluate, and inspect experiment results efficiently. The platform’s comparison view allowed them to visualize F1 performance across different experimental configurations, facilitating data-driven decisions about which approaches to incorporate into production.
Beyond memory retrieval, the case study describes how New Computer applied similar experimentation methodology to their conversational prompt system. Dot’s responses are generated by what they call a “dynamic conversational prompt” that incorporates relevant memories, tool usage (such as search results), and contextual behavioral instructions.
This type of highly variable prompt system presents a classic LLMOps challenge: changes that improve performance on one type of query can cause regressions on others. The New Computer team addressed this by again using synthetic users to generate queries spanning a wide range of intents, creating a diverse test set to catch potential regressions.
LangSmith’s experiment comparison view proved valuable for identifying regressed runs resulting from prompt changes. The team could visually inspect the global effects of prompt modifications across their entire test set, making it easier to spot where improvements in one area caused degradation in another. When failures were identified, the built-in prompt playground allowed engineers to adjust prompts directly within the LangSmith UI without context-switching to other tools, reportedly improving iteration speed.
The case study mentions that New Computer recently launched Dot and achieved notable conversion metrics, with more than 45% of users converting to the paid tier after hitting the free message limit. While this is a business metric rather than a technical one, it suggests that the improvements in memory retrieval and conversation quality may have contributed to user retention and willingness to pay.
The ongoing use of LangSmith appears to be central to New Computer’s development workflow, with the case study indicating that the partnership with LangChain “will remain pivotal” to how the team develops new AI features. This suggests a sustained commitment to observation and experimentation tooling rather than a one-time optimization effort.
While the reported improvements (50% higher recall and 40% higher precision) are significant, several details would strengthen the case study. The baseline comparison is described as a “previous baseline implementation of dynamic memory retrieval,” but the specific configuration and limitations of this baseline are not fully specified. The absolute values of recall and precision are not provided, only relative improvements, making it difficult to assess overall system performance.
The synthetic data approach, while valuable for privacy preservation, also raises questions about how well synthetic user queries and memories reflect real-world usage patterns. The case study does not address whether the improvements observed on synthetic data translated proportionally to improvements on real user interactions.
Additionally, the case study is published on LangChain’s blog and specifically highlights LangSmith’s features, so readers should consider that the perspective may emphasize the positive aspects of using these specific tools. That said, the technical approach of systematic experimentation with labeled datasets and standardized metrics represents sound LLMOps practice regardless of the specific tooling used.
The New Computer case study illustrates several important LLMOps patterns for teams building production LLM applications. First, the use of synthetic data generation for privacy-preserving evaluation is a practical approach for systems handling sensitive user information. Second, the multi-method retrieval experimentation demonstrates the value of not assuming a single retrieval approach will work best across all query types. Third, the emphasis on regression detection through comparison views highlights the interconnected nature of LLM systems where local improvements can cause global regressions.
The integration of prompt experimentation directly into the development workflow, including the ability to modify prompts and immediately evaluate changes, represents an evolution toward more mature LLMOps practices where experimentation and deployment are more tightly coupled. This approach can accelerate iteration cycles but also requires robust evaluation infrastructure to catch issues before they reach production users.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.
A comprehensive overview of ML infrastructure evolution and LLMOps practices at major tech companies, focusing on Doordash's approach to integrating LLMs alongside traditional ML systems. The discussion covers how ML infrastructure needs to adapt for LLMs, the importance of maintaining guard rails, and strategies for managing errors and hallucinations in production systems, while balancing the trade-offs between traditional ML models and LLMs in production environments.