Benchling, a 14-year-old platform for life sciences R&D data management, launched Benchling AI six months ago to bring intelligent agents to scientific workflows. The problem scientists face is the time-consuming nature of drug discovery, from initial experiments to FDA submissions, involving manual data entry, analysis, and report writing. Benchling AI addresses this through a chat-based agent interface that leverages their extensive historical data repository to help scientists find relevant experiments, design new tests, analyze results, and generate regulatory reports. The system uses multiple model families in parallel for critical tasks like data entry, employs custom-built harnesses tailored to scientific workflows rather than coding-focused architectures, and integrates agent skills that function like standard operating procedures. Early results suggest the potential to reduce drug discovery timelines by 2x through eliminating workflow bottlenecks and enabling more efficient experimental design.
Benchling represents a compelling case study in building production AI agents for life sciences, demonstrating how domain-specific data platforms can create significant advantages when layering intelligence on top of existing infrastructure. The company has operated for 14 years as a platform where life science R&D organizations store and manage experimental data, physical lab materials, instrument automation, and data analysis. Six months ago, they launched Benchling AI, an agent-based system that sits atop this data repository to accelerate scientific workflows.
Overall Architecture and Design Philosophy
The fundamental architecture follows what Nick Stone describes as a “pretty standard agent loop” with an LLM calling tools, but the interesting aspects lie in tool design, skills integration, and context management. The system presents as a chat interface where scientists can ask questions ranging from simple data retrieval to complex multi-step tasks. Task horizons vary dramatically, from five-second queries to fifteen-to-twenty-minute operations like generating regulatory reports. The agent can interact with objects within Benchling, search historical experiments, design new experiments, analyze data, and compile comprehensive documentation.
A critical architectural decision involves not using standard coding agent harnesses. Stone argues that coding agent harnesses are optimized for tasks with clear specifications and verifiable end states, which don’t map well to scientific discovery. He references research like Google’s co-scientist paper that used tournament-style LLM competitions to generate novel hypotheses, a radically different approach from specification-driven coding. While Benchling’s agents do write code for tasks like file manipulation and data analysis, this represents only a fraction of use cases. Many queries involve no code at all, reflecting the broader scope of scientific inquiry beyond computational tasks.
The Data Advantage
One of Benchling’s most significant competitive advantages stems from having accumulated 14 years of structured life sciences data before building agents. Stone emphasizes that when scientists first tried ChatGPT, they asked domain-specific questions and received poor answers due to hallucinations or insufficient context. By contrast, when models operate on top of the right data, answer quality increases dramatically. This data-first approach enables faster responses and higher quality outputs compared to generic agents that might use protocols like Model Context Protocol to query external systems. While Benchling both implements MCP as a server and client, they note its limitations compared to having direct internal access to comprehensive data repositories.
The company has actively adapted their platform to better serve agents. Before AI, they had a standard search platform, but they’ve since modified chunking strategies, changed backend storage, and optimized for embedding search. These improvements benefit both agents and human users. The existence of a dedicated search team allows the agent team to focus on agent architecture while collaborating closely on search optimization, demonstrating the organizational advantages of an established company.
Multi-Model Strategy and Routing
Benchling employs a sophisticated multi-model approach, with individual agent runs potentially hitting seven or eight different models, though most runs use only one or two. They maintain one main driver model for the core harness but call out to other models for specific tasks. The system leverages what Stone calls “spiky intelligence,” recognizing that different model families have different strengths. Claude, GPT, and Gemini each make different types of errors and excel at different tasks. For instance, Gemini shows particular strength with image-related tasks.
For critical operations like data entry, Benchling routes the same problem to multiple model families simultaneously and cross-compares results. When two models disagree, it typically signals an error or something requiring human review. When they agree, the data quality is usually sufficient. This approach arose from recognizing that bringing unstructured data from PDFs, Excel spreadsheets, and instrument outputs into structured tables is foundational—if this step fails, everything downstream becomes irrelevant. Life sciences organizations are highly sensitive to data quality, making this multi-model validation approach worthwhile despite increased token costs.
The routing strategy doesn’t yet involve sophisticated taxonomies of which models excel at which specific tasks. Instead, they route to multiple models for critical operations and select specific model families for known strengths like image processing. Speed and cost considerations also influence routing decisions, but for performance improvement on important tasks, they prioritize running multiple models despite higher token consumption.
Skills and Memory
Benchling has adopted the agent skills specification, allowing both user-created skills and built-in native skills developed by Benchling. Users can create personal skills or share them with teams and the broader organization using the platform’s existing permissioning infrastructure. The agent can create and update skills autonomously, though it’s prompted to be conservative and typically waits for user initiation. Stone describes skills as analogous to Standard Operating Procedures in laboratory settings—not guaranteed to produce correct results but designed to specify exact steps for consistent execution.
Skills represent an early form of memory in the system. When a user completes a task, the agent can create a skill to perform similar tasks faster and with higher fidelity in the future. Benchling is also developing background agents that could analyze recent chat sessions and proactively suggest skill creation, though they remain cautious about generating unhelpful skills.
Looking forward, Benchling envisions two types of background agents: schedule-based agents that generate regular reports for recurring meetings, and event-based agents triggered by platform activities. The event-based approach is particularly powerful given Benchling’s rich existing infrastructure. For example, when an instrument in the lab takes measurements and that data automatically flows to Benchling, an agent could trigger analysis so scientists return from the lab to find not just cleaned data but completed analysis with recommendations for next steps.
Evaluation and Production Monitoring
Evaluation presents significant challenges in life sciences because scientific questions are highly specific to each organization’s research focus. Stone acknowledges this difficulty directly, noting that while they do build evaluations for verifiable tasks like data retrieval and data entry, many scientific tasks resist standardized evaluation because ground truth requires running expensive, time-consuming physical experiments.
The team relies heavily on production trace analysis, with everyone on the team reviewing traces. They’ve instituted a weekly fire chief rotation where someone addresses issues and brings interesting traces to weekly technical operations meetings. Users provide thumbs up/thumbs down feedback that signals which traces warrant closer examination. Engineers and product managers working on specific features examine production traces after releases or beta deployments to understand actual usage patterns.
This production-centric evaluation approach reflects the reality that building realistic evaluations for novel scientific discovery is extremely difficult. Stone mentions they’re working on releasing better evaluations that more accurately represent problems scientists face in industry rather than academic benchmarks, but observational learning from real usage remains central to their improvement cycle.
Tool Design and Context Management
Stone identifies tool design and context engineering as key factors differentiating good agents from mediocre ones. This remains challenging despite longer context windows, which help but don’t solve the problem entirely. Benchling has adopted some patterns from popular coding agent harnesses, like virtual file systems that mimic local development environments, but they also lean heavily on their existing SQL infrastructure.
The models prove quite capable at writing SQL queries, so Benchling has implemented techniques like embedding table names and descriptions to create faster paths for agents to query appropriate data. This represents a balance between following industry trends toward file-based paradigms and leveraging 14 years of architectural investment. The approach combines keeping up with latest tools where models excel while capitalizing on established infrastructure.
Context management becomes particularly important given the complexity and volume of scientific data. While specific implementation details aren’t fully elaborated, the emphasis on this aspect suggests ongoing experimentation with how to present relevant context without overwhelming model capabilities or exceeding practical limits.
User Experience and Education
Benchling invests significantly in user education, recognizing that scientists don’t necessarily follow every development in AI technology. Stone estimates that only about half of scientists have even heard of skills as a concept, and fewer have experience building them. The company frames skills as “SOPs for agents,” leveraging familiar laboratory concepts to make new AI capabilities comprehensible.
The platform has evolved from initially offering multiple separate agents to consolidating into a single interface to reduce user confusion about which agent to use for which task. Under the hood, they maintain mostly one agent with sub-agents or sub-workflows for specific tasks, though they continue experimenting with different harnesses for harder scientific problems.
Benchling has maintained a large service organization since its founding because life sciences work is highly complex and varied—no two companies pursue identical research. This forward-deployed approach extends naturally to AI implementation, where the team spends considerable time understanding customer workflows and bottlenecks rather than simply providing tools and expecting immediate productivity gains. The service-oriented approach helps translate complex AI capabilities into practical scientific applications.
Verification and Human-in-the-Loop Patterns
Most scientific tasks are less verifiable than data entry. Even data entry, while more verifiable than hypothesis generation, becomes difficult for humans to fully verify when dealing with large datasets. However, verification remains easier than manual entry followed by verification, so the AI still provides value. For high-stakes tasks like FDA report generation, human verification always occurs before submission, but AI dramatically accelerates first draft creation—what historically took months can now be accomplished in minutes to hours.
An emerging pattern involves using one AI agent to perform work and another fresh agent session to verify it, creating an automated review layer before human inspection. This multi-stage verification approach adds rigor while maintaining efficiency gains.
Some tasks are fundamentally unverifiable in silico because ground truth requires running physical experiments costing hundreds of thousands of dollars and taking weeks or months. Scientists use AI to suggest experiments based on historical data and their intuition, but confirmation requires actual laboratory work. This represents a fundamental difference from coding where test suites can provide rapid verification.
Charging Model and Cost Considerations
Benchling employs a usage-based pricing model rather than per-seat licensing. Customers receive monthly free credit allotments based on their seat count, with the ability to purchase additional credits as needed. This approach accommodates the spiky nature of scientific work—baseline usage for routine tasks complemented by intensive periods like regulatory submission preparation when report generation consumes substantial resources.
Interestingly, Benchling doesn’t expose underlying model selection to users. Unlike software engineers who might follow every model release and develop intuitions about model-specific capabilities, scientists shouldn’t need to care which model handles their request. Benchling considers it their responsibility to select appropriate models, build evaluations, and monitor production traffic to ensure quality. This abstraction shields users from infrastructure complexity while allowing Benchling to optimize model selection for specific task types.
Intelligence-Limited vs. System-Limited Tasks
A key insight from Stone involves distinguishing intelligence-limited tasks from system-limited ones. Many scientific tasks are intelligence-limited—smarter models will genuinely perform better. This differs from domains where the primary challenge involves system integration or workflow orchestration. For intelligence-limited tasks, the strategy involves spending more tokens through techniques like multi-model parallel execution, because the investment yields qualitatively better answers.
Stone draws explicit parallels between working with LLMs and conducting biological research. Both involve probing black box systems that aren’t fully understood, observing responses, and iterating. This contrasts with traditional software engineering’s deterministic mindset. He suggests that people effective with LLMs may resemble scientists and biologists more than software engineers, emphasizing product-minded engineers who deeply understand user problems and can merge that understanding with knowledge of where LLMs excel and struggle.
Domain-Specific Model Training Considerations
Despite being intelligence-limited in a specialized domain, Benchling hasn’t pursued extensive fine-tuning or custom model training. Several factors drive this decision. First, their customers have strict security and privacy requirements that preclude cross-customer training. Second, attempts by various organizations to fine-tune models specifically for biology haven’t demonstrated clear advantages over state-of-the-art frontier models. The loss of broader model capabilities typically outweighs gains from domain specialization.
Stone acknowledges this might change, potentially with task-specific models for narrow applications while a frontier model serves as the main driver, but current frontier models provide sufficient general intelligence. The difference from other organizations pursuing fine-tuning often relates to cost optimization—those organizations have solved their intelligence problem and now face scaling costs. Benchling remains intelligence-constrained for many tasks, making frontier model capabilities more valuable than cost reduction through smaller specialized models.
The challenge of making models smarter at biology stems partly from verification difficulties. Reinforcement learning and other post-training techniques work best with clear feedback signals. Biology’s complexity and the expense of experimental validation make this more difficult than domains with rapid, cheap verification.
Timeline and Impact Expectations
Benchling aims to reduce the time from initial drug discovery to patient delivery by 2x. This acceleration comes from two mechanisms. First, eliminating white space in research workflows—moments when scientists wait for data transfer, statistical analysis, or other bottlenecks that cascade into longer delays. Second, improving experimental efficiency through structured approaches like design of experiments methodology, where AI helps scientists plan factorial experiments that yield statistical power more quickly than sequential single-variable modifications.
However, Stone is measured about expectations for fully autonomous drug discovery. He predicts agents will make drug discovery cheaper and faster but sees agent-discovered drugs as years away. The models don’t yet understand biology well enough, and humanity’s overall knowledge base has gaps that require physical experimentation to fill. Even dramatically improved models will need to interface with the real world to generate data, and robust multi-purpose lab automation remains difficult. The realistic near-term future involves humans and agents working in concert.
Future Directions and Experimentation
Benchling continues experimenting with harness design, particularly for intelligence-limited tasks where spending more tokens makes sense. Current experiments extend beyond parallel multi-model execution to explore entirely different harness architectures. Stone mentions potential harnesses optimized for novelty—addressing the misconception that LLMs can’t produce novel outputs. While many coding tasks involve well-scoped specifications, scientific research requires making unexpected connections across organizational research data and generating genuinely new ideas.
A vision emerges of AI scientists that proactively suggest ideas by understanding the totality of organizational science, potentially through background agents monitoring new experimental results and proposing novel hypotheses. This shifts scientific work from task execution toward reviewing AI-generated ideas, though Stone notes that scientists typically prefer thinking about hard problems and experimental design over tedious data entry or code writing, making them natural reviewers rather than experiencing the review fatigue reported in some software engineering contexts.
The company continues balancing standardization with customization. Some concepts like laboratory notebooks work well for both humans and agents—planning what to do, recording what actually happened, and implementing review processes. Other aspects benefit from agent-generated custom interfaces like HTML dashboards. The key involves maintaining continuity rather than interfaces that change constantly, which most humans resist.
Broader Industry Context
Stone observes that life sciences may emerge as the second major domain for AI agents after coding, attributing this to both the universal importance of human health and food security, and the fact that many AI researchers have life sciences backgrounds. The parallel between working with LLMs and conducting biological research creates natural synergy—both involve probing complex systems, building assays to understand behavior, and iterating based on observations. This methodological alignment may give life sciences advantages in adopting agent-based approaches compared to domains that expect deterministic engineering outcomes.
The case study illustrates how established platforms with rich historical data can create sustainable advantages in the agent era, how multi-model strategies can enhance reliability for critical tasks, and how domain-specific considerations should drive architectural decisions rather than blindly adopting patterns from other domains. Benchling’s measured approach to evaluation, emphasis on production monitoring, and commitment to user education offer valuable patterns for others deploying agents in specialized professional domains.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
Moody's Analytics, a century-old financial institution serving over 1,500 customers across 165 countries, transformed their approach to serving high-stakes financial decision-making by evolving from a basic RAG chatbot to a sophisticated multi-agent AI system on AWS. Facing challenges with unstructured financial data (PDFs with complex tables, charts, and regulatory documents), context window limitations, and the need for 100% accuracy in billion-dollar decisions, they architected a serverless multi-agent orchestration system using Amazon Bedrock, specialized task agents, custom workflows supporting up to 400 steps, and intelligent document processing pipelines. The solution processes over 1 million tokens daily in production, achieving 60% faster insights and 30% reduction in task completion times while maintaining the precision required for credit ratings, risk intelligence, and regulatory compliance across credit, climate, economics, and compliance domains.