Cambrium: LLMs and Protein Engineering: Building a Sustainable Materials Platform

Overview

Cambrium is a three-year-old biotech startup based in Berlin that uses AI to design proteins which become sustainable materials. Founded in September 2020, the company has achieved what is reportedly the fastest biotech product commercialization in EU history—bringing their first product, NovaColl (a vegan human collagen for cosmetics), to market in under two years. Pierre, the Head of Engineering at Cambrium (and employee number two), shares insights into how the company applies LLMs and generative AI in a production biotechnology context, offering a fascinating case study of LLMOps in a domain very different from typical software applications.

Company Context and Unique Constraints

Cambrium’s engineering team operates under constraints quite different from typical ML/AI companies. Unlike teams focused on low-latency serving or high QPS (queries per second), Cambrium’s primary customers are internal scientists—researchers they work alongside and “drink beer with.” This fundamentally changes the operational requirements. Their services don’t need to handle millions of customers or maintain 24/7 uptime with millisecond response times.

However, they face a unique constraint that dominates their engineering priorities: the extraordinarily high cost of biological experiments. Every data point is precious because DNA handling, laboratory equipment, consumables, and the experiments themselves are extremely expensive. This creates an unusual LLMOps priority: data integrity and pipeline reliability are paramount. As Pierre explains, if a data pipeline fails and data “disappears in the cloud limbo,” it’s actually quite dramatic because recreating that experimental data is prohibitively costly.

The cost of prediction errors in their ML models is relatively low compared to domains like healthcare diagnostics—if a protein doesn’t perform as expected, the material simply doesn’t work as intended, resulting in financial loss but no human safety issues. This risk profile allows for more experimental approaches to model development while still demanding extreme care with data handling.

LLM Applications at Cambrium

Internal RAG-Based Research Assistant

Cambrium has deployed an LLM-based assistant built on a LangChain stack with retrieval-augmented generation (RAG) capabilities. This internal tool serves as a sophisticated search engine that can query across multiple data sources:

Experimental data from laboratory systems
Slack messages and internal communications
Notion pages and documentation

The assistant can answer questions ranging from simple tasks like generating LinkedIn posts about recent products to more complex queries like identifying the latest high-performing protein sequences produced in the lab. When asked about their latest product, it correctly retrieves information showing the product is NovaColl and when it was released.

Pierre describes this as a “glorified search engine” but emphasizes its practical utility for researchers who need quick access to information scattered across multiple internal systems. Notably, Pierre expresses frustration with the market hype around similar RAG products, suggesting that while the LangChain stack is well-engineered, the business value of such applications is sometimes overstated by companies raising significant funding for essentially the same technical approach.

Protein Programming Language

Before the LLM boom, Cambrium developed what they call a “protein programming language”—a formally compiled programming language for specifying protein designs. This represents an early approach to bridging natural language concepts with computational protein design. The language allows scientists to specify:

Blocks with particular biophysical properties
Flexibility requirements for different protein segments
Constraints like non-allergenicity
Formulation pH compatibility for cosmetic applications
Potency and efficacy optimization targets

This language transforms protein design into a mathematical optimization problem, enabling the team to sift through approximately 10^18 possible protein sequences to identify top candidates for laboratory testing. The NovaColl product was developed using this approach, and Pierre credits this computational capability with enabling them to “one-shot” regulatory feasibility checks and scalability validations.

Generative Protein Design with Diffusion Models

For their next products (details confidential), Cambrium is using more advanced LLM-era techniques including diffusion models for protein generation. Pierre describes being able to “see the protein coming out of like a blur of atoms and folding together”—using diffusion-style generative approaches where proteins gradually take shape with properties defined by mathematical loss functions.

This represents a shift from the constraint-based programming language approach to true generative AI for proteins. The underlying principle is that proteins, like natural language, are sequences of discrete units (amino acids vs. letters) with complex “grammar” and “syntax” that determine whether a sequence has meaningful function. Random amino acid sequences typically produce non-functional “spaghetti ball” structures, but understanding the structural and functional grammar allows for generating sequences with specific properties.

Technical Infrastructure and Data Challenges

Data Sources and Quality

Cambrium deals with two primary data source types, each presenting different challenges:

Robot/Machine Data: Lab instruments produce “ugly CSVs” that need parsing—relatively straightforward engineering challenges.
Scientist Experimental Notes: This is more challenging. Biology curricula have been slow to adopt data-driven and digitalization practices, meaning scientists often don’t naturally think in data-first terms. Training scientists with the right mindset for data-driven experimentation is essential but has a non-trivial learning curve.

The company addresses this by:

Hiring scientists with data-driven mindsets
Deploying robots to handle physical tasks (like pipetting) so scientists can focus on logging experimental parameters digitally
Building systems that free scientists’ hands so they can interact with data entry systems during experiments

Data Pipeline Reliability

Given the high cost of experimental data, pipeline reliability is critical. Data loss is treated as a serious incident because unlike web applications where you might just lose some user analytics, losing experimental data means potentially having to repeat expensive laboratory work. This creates a strong emphasis on robust data engineering practices even though the team doesn’t need the scale-focused infrastructure of consumer-facing applications.

Hiring and Building an AI-Enabled Lab

Pierre emphasizes that Cambrium’s position as a startup building “something from nothing” gave them an advantage—they could hire the right people with data-driven mindsets from the start. Larger, established biotech companies face the enormous challenge of digitalizing decades of paper notebooks and retraining hundreds of scientists with established habits. This is a significant competitive advantage for smaller, newer companies in the biotech space.

The Protein-Language Analogy

A key insight shared is the deep similarity between protein sequences and natural language that makes LLM-style approaches applicable to protein design:

Proteins are composed of 20 amino acids, each represented by a letter
Just as random letters rarely form meaningful words (and almost never form coherent essays), random amino acid sequences rarely form functional proteins
Both domains have complex syntax and grammar that determines meaning/function
Both can leverage sequence-to-sequence and generative modeling approaches

This analogy isn’t just theoretical—it’s driving practical applications at Cambrium where they’re applying generative AI techniques originally developed for language to create novel proteins with designed properties.

Business Model and Sustainability Focus

The broader strategic context shapes Cambrium’s LLMOps approach. They started with cosmetics because:

High price elasticity and innovation hunger in the industry
Faster regulatory pathways compared to medical applications
Cosmetics saved the biotech industry after the 2015 oil price crash

Their long-term vision includes working toward sustainable materials that could eventually replace plastics and other commodity chemicals. The computational capabilities—including LLM-based approaches—are essential to de-risking the expensive experimental pathway from high-value specialty chemicals to lower-margin commodity materials.

Key Takeaways for LLMOps Practitioners

This case study offers several valuable perspectives for LLMOps practitioners:

Not all production ML/LLM systems require high-scale, low-latency infrastructure—understanding your actual constraints is crucial
Data integrity can be more important than throughput when data generation is expensive
RAG applications have clear utility for internal research tools even if the business case for external products is sometimes overstated
Domain-specific “programming languages” can serve as effective intermediate representations between human intent and ML optimization
The protein-language analogy enables transferring NLP techniques to completely different domains
Building for researchers requires different considerations than building for consumers or enterprise customers

The emphasis on data pipeline reliability, scientist workflow integration, and experimental de-risking through AI represents a distinctive LLMOps pattern for research-intensive organizations where the cost of data generation dominates other operational concerns.

LLMs and Protein Engineering: Building a Sustainable Materials Platform

Industry

Technologies