Vellum: Production-Grade AI Development Platform with Workflow Orchestration, Testing, and Observability

Company

Vellum

Title

Production-Grade AI Development Platform with Workflow Orchestration, Testing, and Observability

Industry

Tech

Link

https://www.youtube.com/watch?v=jDcFrX_arMo

Year

2025

Summary (short)

Vellum is a platform that helps enterprises build reliable AI solutions by providing workflow orchestration, prompt testing, evaluation frameworks, and observability tools. The company addresses the challenge that while AI models can be quickly prototyped, they often lack the reliability needed for production deployment. Vellum enables cross-functional teams including product managers, developers, and domain experts to collaborate on AI development through a comprehensive toolset that supports the entire lifecycle from initial prototyping to production monitoring. Their platform has been successfully deployed by companies like Redfin (real estate chatbot), Strada (compliance solutions), and various healthcare organizations, with a focus on enabling enterprises to move from proof-of-concept to production-ready AI systems that meet strict reliability and compliance requirements.

## Overview This case study presents insights from Anita Kirkovska, Head of Growth at Vellum, discussing their platform for building reliable AI solutions in production environments. Vellum represents a comprehensive LLMOps platform that addresses the critical gap between rapid AI prototyping and production-ready deployment, particularly for enterprise customers operating in highly regulated industries. The conversation reveals several key patterns in how enterprises are approaching LLM deployment in 2025, including the critical importance of cross-functional collaboration, the trade-offs between model selection and cost, and the evolving role of orchestration as models approach potential scaling limits. Notably, the discussion highlights that approximately 25% of developers surveyed by Vellum in the previous year had successfully deployed AI solutions to production, indicating significant challenges in moving from prototype to production. ## Platform Capabilities and Architecture Vellum provides four core capabilities that form the foundation of their LLMOps platform: **Workflow Orchestration**: The platform enables users to build AI workflows that combine deterministic code with AI model calls. This orchestration layer has emerged as increasingly critical as companies realize that reliability in production depends not just on model quality but on how models are integrated into broader systems. The workflow builder allows non-technical product managers to test hypotheses and design user experiences before involving engineering teams, representing a shift in how AI development processes are structured. **Prompt Testing**: The platform provides mechanisms for testing prompts across different scenarios and inputs. This addresses the fundamental challenge that LLMs are non-deterministic systems with high variance in outputs given the same inputs, making traditional software testing approaches insufficient. **Evaluation Frameworks**: Vellum includes comprehensive evaluation tools that allow teams to build confidence in their AI systems before production deployment. Significantly, the platform enables domain experts and legal teams to participate in the evaluation process. For example, Redfin's legal team used Vellum's evaluation platform to write test cases ensuring compliance with real estate regulations around protected demographic characteristics. **Observability and Monitoring**: Once systems are deployed, Vellum provides observability tools to monitor performance in production. The platform supports proactive monitoring approaches, as demonstrated by their customer Strada (compliance software) which runs automated tests to detect issues before customers encounter them. ## Customer Profile and Use Cases Vellum is notably "moving up market" toward larger enterprise customers, contrasting with typical startup adoption patterns for new technologies. This reflects that enterprise customers particularly value reliability and have both the resources and regulatory requirements that make comprehensive LLMOps tooling essential rather than optional. **Redfin Case Study (Real Estate)**: Redfin built a customer-facing chatbot using Vellum over approximately six months before launching in beta. The extended development timeline reflects the high-stakes nature of the application - real estate is a highly regulated industry where the AI system must avoid making suggestions based on protected demographic characteristics. The chatbot integrates with multiple tools to help users search for properties, connect with agents, and navigate the buying process. Even after six months of development, Redfin launched in beta knowing they would encounter real-world examples they hadn't anticipated, planning to iteratively improve based on production usage. They are now exploring integration of images, videos, and virtual tours within the chatbot experience. **Healthcare Applications**: Multiple healthcare organizations use Vellum for various applications, reflecting the industry's high-risk profile that demands reliability. Use cases include voice agents for appointment booking and patient inquiries, as well as clinical documentation systems. One particularly detailed example involves a medical documentation application that records patient-doctor conversations, transcribes them, extracts relevant information through orchestration, generates standardized SOAP notes (Subjective, Objective, Assessment, Plan), and automatically populates electronic health record systems. This workflow dramatically reduces administrative burden on physicians while ensuring accurate medical documentation. Healthcare customers frequently use only open-source models that they fine-tune and host themselves, reflecting strict data retention and security requirements. **Headspace (Mental Health)**: Mentioned as a high-risk customer where AI systems must perform correctly due to the sensitive nature of mental health applications. ## Technical Approach and Best Practices The conversation reveals several important technical patterns and philosophies that inform Vellum's approach to production AI: **The Spectrum of Agency**: Rather than defining "agents" as a binary category, Anita articulates agency as a spectrum based on how much control is released to the AI model. A basic AI workflow combines deterministic code with AI model calls. As you give the model more control - such as tool access through function calling, or control flow decisions, or broader system access through protocols like Model Context Protocol (MCP) - the workflow becomes "more agentic." This framing avoids the hype around autonomous agents while providing a practical framework for incremental capability expansion. Importantly, Anita advises companies not to strive for fully autonomous agents immediately but to proceed stepwise, adding capabilities gradually while building confidence and reliability. **Orchestration as the Critical Layer**: With models potentially approaching scaling limits (the "local maxima" that Anita predicts), orchestration emerges as the key differentiator for production AI systems. The logic and structure surrounding model calls - including routing, guardrails, fallback mechanisms, and integration with traditional code - increasingly determines system reliability and performance. This represents a maturation of the field beyond pure model improvement toward systems engineering. **Model Selection and Cost Trade-offs**: Vellum hosts all major open-source models through various inference providers (Fireworks, Samanova, Cerebras for fast inference) and integrates with closed-source model APIs. The platform enables sophisticated routing strategies where user requests are classified and routed to appropriately-sized models. For example, a chatbot might use a small, fast model for simple classification, a medium model for straightforward queries, and reserve large reasoning-capable models for complex questions that justify the additional latency and cost. This tiered approach manages the trade-off between performance, speed, and cost while maintaining reliability. **Testing and Evaluation Philosophy**: The platform recognizes that AI testing differs fundamentally from traditional software testing. Rather than checking if outputs match expected values, AI evaluation focuses on understanding model behavior across a potentially vast space of possible interactions. The conversation emphasizes early and continuous testing, with product managers and domain experts writing test cases before engineering work begins. For high-risk applications, legal and compliance teams participate in defining acceptable behavior through evaluation criteria. **Cross-Functional Collaboration**: A recurring theme is enabling non-engineering teams to participate meaningfully in AI development. Product managers have become significantly more technical and use Vellum's workflow builder to prototype solutions and test hypotheses before engineering resources are committed. This shifts AI development from a purely technical discipline to a collaborative process involving product, legal, compliance, and domain expertise alongside engineering. ## Deployment Models and Security Vellum supports multiple deployment models reflecting varied enterprise security and data retention requirements: - **SaaS**: Standard cloud deployment for customers comfortable with third-party data handling - **VPC**: Virtual private cloud deployment for customers requiring data isolation - **On-Premise**: On-premises deployment for customers with strictest security requirements, particularly common in healthcare The insurance and healthcare sectors show particular preference for open-source models they can fine-tune and host themselves, maintaining complete control over data and model behavior. The platform supports one-click upload of fine-tuned models that become available within minutes. ## Production Readiness Patterns Several patterns emerge around what constitutes production readiness for AI systems: **Proactive Monitoring**: Rather than waiting for users to report issues, production-ready systems include automated testing and monitoring that detect problems before customer impact. Strada's compliance software exemplifies this with continuous automated testing of their AI components. **Gradual Rollout**: Even after extensive development and testing, companies like Redfin launch in beta, explicitly planning to encounter unexpected real-world scenarios and iterate based on production feedback. This acknowledges the inherent unpredictability of AI systems while managing risk through controlled exposure. **Risk-Based Development Strategy**: Anthropic's framework of high-risk vs. low-risk applications influences development approaches. High-risk applications (healthcare, financial services, real estate with fair housing laws) require extensive testing, longer development cycles, and more conservative deployment. Low-risk applications (code generation tools like Cursor where errors are easily caught and corrected) can adopt more experimental approaches and accept higher error rates. **Design for Perceived Performance**: For applications using larger, slower models, careful UX design becomes critical. Techniques like output streaming (displaying text as it's generated rather than waiting for complete responses), progress indicators, and thoughtful information architecture help manage user expectations around AI system response times. This represents the importance of design expertise alongside technical implementation in production AI systems. ## Model Provider Dynamics The conversation touches on concerns about model provider changes breaking production systems. However, Anita indicates this hasn't been a significant issue in practice because well-architected systems rely on orchestration and guardrails that buffer against model variations. The focus on system architecture rather than raw model capabilities provides resilience against model changes or inconsistencies. Additionally, the platform maintains an LLM leaderboard (lmleaderboard.com) that compares model performance across tasks, helping customers make informed decisions about model selection. The rapid pace of model improvement is evident - the example of Google's Gemma 3 with 27 billion parameters now matching capabilities that previously required 100+ billion parameter models, with corresponding cost implications. ## Challenges and Limitations The conversation provides balanced perspective on current AI limitations: **Production Deployment Gap**: Only 25% of developers in Vellum's survey had successfully deployed AI to production, indicating significant challenges in moving from prototype to production-ready systems. **Reliability Challenges**: AI models remain fundamentally non-deterministic, making traditional software quality assurance approaches insufficient. Even with comprehensive testing and orchestration, systems cannot achieve perfect reliability, requiring continuous monitoring and improvement. **Model Limitations**: Current models still struggle with autonomous decision-making in high-stakes scenarios. The question of whether an AI system can autonomously prescribe medication without human oversight remains aspirational rather than practical. **MCP and Advanced Protocols Not Production-Ready**: While Model Context Protocol and similar approaches generate excitement, they're not yet deployed in production customer-facing applications due to control, security, and authentication challenges. Internal tools show more experimentation with these approaches where risks are lower. **Scaffolding Still Required**: Despite model improvements, production systems still require extensive scaffolding - the orchestration, guardrails, routing logic, and error handling that surround model calls. The hope that models would simply "work" without extensive surrounding infrastructure hasn't materialized. ## Market and Industry Trends Several broader trends emerge from the conversation: **Enterprise Early Adoption**: Contrary to typical technology adoption patterns where startups lead, AI is seeing strong enterprise adoption driven by clear value propositions in reliability, compliance, and operational efficiency. Companies like Mercedes and major IBM customers are actively deploying AI, though with longer deployment cycles and more conservative approaches than startups. **Product Manager Role Evolution**: PMs are becoming significantly more technical, prototyping AI solutions and defining system behavior before engineering implementation. This represents a shift in product development workflows enabled by no-code/low-code AI development tools. **Voice Agent Growth**: Healthcare shows particular growth in voice agents for patient interactions, appointment scheduling, and information requests, representing an alternative interface to text-based chatbots. **Multimodality Emergence**: While text remains dominant, production systems are beginning to incorporate images, video, and voice. However, video-based AI solutions remain relatively rare in production as of 2025. **Workflow Over Autonomy**: The most successful production deployments focus on AI-enhanced workflows with appropriate human oversight rather than fully autonomous agents. The industry is converging on hybrid approaches that combine AI capabilities with deterministic logic and human judgment at appropriate decision points. ## Critical Perspective While the conversation provides valuable insights into production AI deployment, several areas deserve critical consideration: **Selection Bias**: As Vellum's Head of Growth, Anita's perspective naturally emphasizes successful deployments and satisfied customers. The 25% production deployment rate from their survey suggests 75% of development efforts don't reach production, representing challenges that receive less attention in this conversation. **Platform Lock-in Considerations**: While Vellum provides clear value, dependence on proprietary orchestration platforms creates potential lock-in effects. The conversation doesn't address portability or migration concerns if customers need to change platforms. **Cost and Complexity Trade-offs**: The comprehensive tooling Vellum provides adds complexity and cost. The conversation doesn't deeply explore whether all organizations need this level of infrastructure or whether simpler approaches might suffice for certain use cases. **Hype Criticism with Caution**: While Anita criticizes "profiting off hype" around agents, Vellum itself markets capabilities around AI workflows and agent development. The distinction between legitimate capability development and hype can be subjective. ## Forward-Looking Perspectives Anita offers several predictions and aspirations for the near-term future: **Approaching Local Maxima**: She predicts current model scaling approaches may hit limits soon, shifting innovation focus toward orchestration, novel architectures, and systems engineering rather than simply scaling existing approaches. This echoes broader industry discussions about the sustainability of current scaling paradigms. **Orchestration Maturation**: As model scaling potentially slows, the orchestration layer will become increasingly critical for differentiation and capability expansion. This suggests continued innovation in workflow design, tool use, and system integration patterns. **High-Risk Autonomous Applications**: Her aspiration is seeing an AI agent successfully handle high-risk autonomous tasks that were previously impossible, such as medical prescription without human oversight. However, she acknowledges this remains aspirational rather than imminent. **Reduced Hype, Increased Pragmatism**: She hopes to see reduced hype-driven development in favor of more pragmatic, reliability-focused approaches. However, this seems optimistic given ongoing market dynamics and investment pressures. The conversation ultimately presents a pragmatic, systems-oriented view of production AI deployment that emphasizes reliability, cross-functional collaboration, and careful orchestration over autonomous agent capabilities or model performance alone. This perspective aligns with enterprise needs but may underweight the continued importance of foundational model improvements and the potential for breakthrough capabilities to reshape deployment patterns.

Start deploying reproducible AI workflows today