Paradigm (YC24) built an AI-powered spreadsheet platform that runs thousands of parallel agents for data processing tasks. They utilized LangChain for rapid agent development and iteration, while leveraging LangSmith for comprehensive monitoring, operational insights, and usage-based pricing optimization. This enabled them to build task-specific agents for schema generation, sheet naming, task planning, and contact lookup while maintaining high performance and cost efficiency.
Paradigm is a Y Combinator 2024 (YC24) startup that has developed what they describe as “the first generally intelligent spreadsheet.” The core innovation is integrating AI agents into a traditional spreadsheet interface, enabling users to trigger hundreds or thousands of individual agents that perform data processing tasks on a per-cell basis. This case study, published by LangChain in September 2024, details how Paradigm leveraged LangChain and LangSmith to build, iterate, monitor, and optimize their multi-agent system in production.
It’s worth noting that this case study is published on LangChain’s own blog, so there is an inherent promotional angle to consider. However, the technical details provided offer genuine insights into how a startup approaches LLMOps challenges when running complex agent systems at scale.
Paradigm’s architecture centers on deploying numerous task-specific agents that work together to gather, structure, and process data within their spreadsheet product. The company uses LangChain as their primary framework for building these agents, taking advantage of its abstractions for structured outputs and rapid iteration capabilities.
The case study highlights several specific agents that Paradigm developed using LangChain:
Schema Agent: This agent takes a prompt as context and generates a set of columns along with column-specific prompts that instruct downstream spreadsheet agents on how to gather relevant data. This represents a meta-level agent that configures the behavior of other agents in the system.
Sheet Naming Agent: A micro-agent responsible for automatically naming each sheet based on the user’s prompt and the data contained within the sheet. This is an example of how smaller, focused agents can handle auxiliary tasks throughout the product.
Plan Agent: This agent organizes tasks into stages based on the context of each spreadsheet row. The purpose is to enable parallelization of research tasks, reducing latency without sacrificing accuracy. This agent essentially acts as an orchestrator that optimizes the execution order of other agents.
Contact Info Agent: Performs lookups to find contact information from unstructured data sources.
The agents leverage LangChain’s structured output capabilities to ensure data is generated in the correct schema, which is critical for a spreadsheet application where data consistency and proper formatting are essential.
One of the key LLMOps themes in this case study is rapid iteration. LangChain facilitated fast development cycles for Paradigm, allowing the team to refine critical parameters before deploying agents to production. The areas of focus for iteration included:
The ability to quickly iterate on these parameters and test changes is a fundamental aspect of LLMOps, as production AI systems often require continuous refinement based on real-world performance data.
The most detailed LLMOps content in this case study relates to monitoring and observability through LangSmith. Given that Paradigm’s product can trigger thousands of individual agents simultaneously, traditional debugging and monitoring approaches would be insufficient. The complexity of these operations necessitated a sophisticated system to track and optimize agent performance.
LangSmith provided Paradigm with what the case study describes as “full context behind their agent’s thought processes and LLM usage.” This granular observability enabled the team to:
A particularly interesting use case mentioned is analyzing and refining the dependency system for column generation. When generating data for multiple columns in a spreadsheet, certain columns may depend on data from other columns. Paradigm needed to optimize the order in which columns are processed, prioritizing tasks that require less context before moving on to more complex jobs that depend on previously generated data.
The case study describes a concrete workflow where the team could change the structure of the dependency system, re-run the same spreadsheet job, and then use LangSmith to assess which system configuration led to the most clear and concise agent traces. This type of A/B testing for agent system architecture is a sophisticated LLMOps practice that requires robust observability tooling.
A notable aspect of this case study is how Paradigm used LangSmith’s monitoring capabilities to implement a precise usage-based pricing model. This represents a business-critical application of LLMOps observability that goes beyond pure technical optimization.
LangSmith provided context on:
This visibility enabled Paradigm to accurately calculate the cost of different tasks and build a nuanced pricing model. The case study provides examples of how different tasks have varying costs:
By diving into historical tool usage and analyzing input/output tokens per job, Paradigm could better understand how to shape both their pricing structure and their tool architecture going forward. This is a practical example of how LLMOps insights translate directly into business model decisions.
The case study mentions that Paradigm has “a multitude of tools and APIs integrated into their backend that the agents can call to do certain tasks.” While specific integrations are not detailed, this highlights the reality of production agent systems that must interface with external services for data retrieval, verification, and other operations.
The monitoring of these tool calls through LangSmith suggests that Paradigm is tracking not just LLM inference costs but the entire operational footprint of their agent system, including external API usage.
While this case study provides useful insights into how a startup approaches multi-agent LLMOps, it’s important to note several limitations:
Despite these limitations, the case study offers valuable patterns for teams building similar multi-agent systems, particularly around the importance of observability at scale and using operational data to inform both technical and business decisions. The specific examples of agent types and the dependency system optimization workflow provide concrete inspiration for LLMOps practitioners.
The Paradigm case study illustrates several important LLMOps principles. First, when running agents at scale, comprehensive observability becomes essential rather than optional. The ability to trace individual agent executions and understand their reasoning is crucial for debugging and optimization. Second, rapid iteration capabilities during development allow teams to refine prompts, model selection, and system parameters before committing changes to production. Third, granular monitoring of token usage and tool calls enables accurate cost attribution, which is particularly important for usage-based pricing models. Finally, the dependency system optimization example shows how teams can use trace analysis to compare different architectural approaches and select the configuration that produces the best results.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.
Wesco, a B2B supply chain and industrial distribution company, presents a comprehensive case study on deploying enterprise-grade AI applications at scale, moving from POC to production. The company faced challenges in transitioning from traditional predictive analytics to cognitive intelligence using generative AI and agentic systems. Their solution involved building a composable AI platform with proper governance, MLOps/LLMOps pipelines, and multi-agent architectures for use cases ranging from document processing and knowledge retrieval to fraud detection and inventory management. Results include deployment of 50+ use cases, significant improvements in employee productivity through "everyday AI" applications, and quantifiable ROI through transformational AI initiatives in supply chain optimization, with emphasis on proper observability, compliance, and change management to drive adoption.