Doctolib is transforming their healthcare data platform from a reporting-focused system to an AI-enabled unified platform. The company is implementing a comprehensive LLMOps infrastructure as part of their new architecture, including features for model training, inference, and GenAI assistance for data exploration. The platform aims to support both traditional analytics and advanced AI capabilities while ensuring security, governance, and scalability for healthcare data.
Doctolib, a prominent European healthcare technology company, has published a detailed architectural blueprint for transforming their data platform from a monolithic, centralized system into a modern “Unified Healthcare Data Platform” capable of supporting AI and machine learning use cases at production scale. This case study is notable because it represents an aspirational architecture rather than a completed implementation—the company is transparent about their current limitations and their planned solutions. This distinction is important when evaluating the claims made, as many of the capabilities described are intended rather than proven in production.
The primary driver for this transformation is Doctolib’s ambition to evolve from a reporting-focused platform to becoming a leader in AI for healthcare. Their existing platform, while effective for business intelligence and analytics, was not designed to handle the requirements of training and deploying machine learning models, particularly large language models (LLMs), in a healthcare context where data sensitivity and governance are paramount.
Over the past 4-5 years, Doctolib’s data team grew from a small startup team to over a hundred members. During this growth, they adopted a pragmatic but ultimately limiting approach: a single Git repository, single AWS account, single Redshift data warehouse, and single Airflow orchestrator. This monolithic architecture created several challenges that became particularly acute when attempting to support AI and ML use cases:
The centralized Git repository with a single daily release cycle led to CI pipelines taking 30-40 minutes, slowing development velocity. The shared Airflow instance struggled with event-driven workflows essential for ML pipelines, and all DAGs sharing the same IAM role created security vulnerabilities unacceptable for healthcare data. The monolithic Redshift warehouse meant all users had administrative rights, making it impossible to enforce fine-grained access controls needed for sensitive healthcare data used in AI training.
Perhaps most critically, the architecture lacked the foundation for supporting ML workloads: no vector databases for embeddings, no model registry, no feature store, and no infrastructure for deploying and monitoring models in production.
The new platform architecture includes explicit components for LLMOps, which Doctolib describes as providing “the infrastructure, workflows, and management capabilities necessary to operationalize large language models (LLMs) in production.” The key components include:
The architecture explicitly calls out LLMOps tooling as a component of their ML Training Platform. This includes tools for model fine-tuning, deployment, monitoring, versioning, prompt optimization, and cost management. While the article does not specify which specific tools they plan to use (e.g., LangChain, LlamaIndex, or proprietary solutions), the functional requirements are clearly articulated. The inclusion of prompt optimization as a first-class concern suggests they anticipate significant investment in prompt engineering practices.
As part of their ML Storage layer, Doctolib plans to implement a vector database “optimized for storing, indexing, and searching high-dimensional vector data, enabling efficient similarity searches for AI applications.” This is a critical component for any LLM-based system that relies on retrieval-augmented generation (RAG) or semantic search. The vector database will work alongside their traditional Lakehouse architecture, which combines data lake storage with data warehouse governance.
The Inference Platform includes several components essential for production LLM deployment:
The Model Inference Engine’s multi-backend support is particularly relevant for LLM deployment, where GPU optimization is crucial for acceptable latency and cost management.
Within their Data Exploration and Reporting layer, Doctolib includes a “GenAI Assistant” described as a “conversational AI tool enabling natural language data exploration for non-technical users.” This represents an internal application of LLM technology to democratize data access—a common pattern where organizations first apply LLMs to their own internal workflows before exposing them to customers.
Several other components in the architecture indirectly support LLMOps but are essential for production-grade deployments:
The feature store serves as a “centralized repository for managing, storing, and serving features used in machine learning models.” For LLM applications, this could include pre-computed embeddings, user context features, or structured data used to augment prompts.
The model registry provides “centralized management of machine learning model lifecycles, ensuring governance, traceability, and streamlined deployment.” For LLMs, this becomes particularly important given the size and versioning complexity of these models, especially when fine-tuning is involved.
The experiment tracking capabilities help “data scientists and ML engineers log, organize, and compare experiments,” recording metadata such as hyperparameters, model architectures, datasets, evaluation metrics, and results. For LLM work, this would extend to tracking prompt variations, fine-tuning runs, and evaluation benchmarks.
The data governance layer is particularly important for healthcare AI applications. Components include:
The emphasis on healthcare ontologies and standards (HL7, FHIR, OMOP, DICOM) suggests they plan to leverage structured medical knowledge in their AI applications, potentially for semantic search or knowledge-grounded responses.
Doctolib describes four teams within their Data and Machine Learning Platform organization, with the ML Platform team explicitly responsible for “implementing all platform components that allow data scientists and ML engineers to explore, train, deploy, and serve models that can be integrated into Doctolib’s products at a production-grade level.”
This clear ownership model is important for LLMOps maturity. The separation between the ML Platform team and other teams (Data Engineering Platform, Data Ingestion & Output, Data Tools) with well-defined interfaces helps prevent the common anti-pattern of unclear ownership that often plagues ML systems in production.
It’s important to note several caveats when evaluating this case study:
This is primarily an architectural vision rather than a proven implementation. The article explicitly states this is a planned rebuild, and subsequent posts will detail actual technical choices. The claims about LLMOps capabilities represent intentions rather than demonstrated results.
The article does not provide specific details about LLM use cases they plan to support. Beyond the GenAI Assistant for internal data exploration, there’s no discussion of customer-facing LLM applications, which might be intentional given the sensitivity of healthcare AI.
There’s no discussion of specific evaluation frameworks, testing strategies for LLM outputs, or approaches to handling hallucinations—critical concerns for healthcare applications where accuracy is paramount.
Cost management for LLM inference, while mentioned as part of LLMOps tooling, is not elaborated upon despite being a significant operational concern.
Doctolib’s architectural blueprint represents a thoughtful approach to building infrastructure capable of supporting LLMOps at scale in a healthcare context. The explicit inclusion of LLMOps tooling, vector databases, model serving infrastructure, and governance frameworks demonstrates awareness of the unique requirements of production LLM systems. However, as this represents planned rather than implemented architecture, the true test will come in subsequent publications that detail actual implementations and lessons learned. The emphasis on data governance and security is appropriate for healthcare AI, though the absence of discussion around LLM-specific challenges like evaluation, hallucination mitigation, and content safety leaves some important questions unanswered.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Climate tech startups are leveraging Amazon SageMaker HyperPod to build specialized foundation models that address critical environmental challenges including weather prediction, sustainable material discovery, ecosystem monitoring, and geological modeling. Companies like Orbital Materials and Hum.AI are training custom models from scratch on massive environmental datasets, achieving significant breakthroughs such as tenfold performance improvements in carbon capture materials and the ability to see underwater from satellite imagery. These startups are moving beyond traditional LLM fine-tuning to create domain-specific models with billions of parameters that process multimodal environmental data including satellite imagery, sensor networks, and atmospheric measurements at scale.
Codeium's journey in building their AI-powered development tools showcases how investing early in enterprise-ready infrastructure, including containerization, security, and comprehensive deployment options, enabled them to scale from individual developers to large enterprise customers. Their "go slow to go fast" approach in building proprietary infrastructure for code completion, retrieval, and agent-based development culminated in Windsurf IDE, demonstrating how thoughtful early architectural decisions can create a more robust foundation for AI tools in production.