## Overview
This case study documents presentations from six startups at Anthropic's Code with Claude conference, each demonstrating production deployments of Claude-powered applications across dramatically different use cases. The companies span from development tools (Tempo Labs, Zencoder, Bito, Create) to creative applications (Gamma, Diffusion), collectively showcasing the breadth of LLM deployment patterns and operational considerations when building on frontier models. A recurring theme across all presentations is the transformative impact of specific model releases—particularly Claude 3.5 Sonnet and Claude 3.7 Sonnet—and how new model capabilities like tool use, web search, and extended context windows unlocked entirely new product categories and user experiences.
## Tempo Labs: Visual IDE for Non-Engineers
Tempo Labs positioned their product as "Cursor for PMs and designers," building a visual integrated development environment that feels more like Figma than traditional code editors. Their core value proposition addresses a fundamental collaboration gap in software development: enabling non-engineers to directly contribute to codebases without requiring deep programming expertise.
The technical architecture runs on cloud-based Docker containers rather than local execution, enabling collaborative editing similar to Figma's multiplayer experience. Users can share links to running applications and collaboratively code together, with all changes persisting in a shared environment. The interface presents three primary tabs—Product (PRD), Design, and Code—allowing users to work across different abstraction layers while Claude handles the underlying code generation.
From an LLMOps perspective, Tempo's most interesting aspect is the tight integration between visual manipulation and code generation. Users can drag-and-drop components, adjust spacing and layout properties through visual controls, and delete elements through a DOM tree view, with all actions translating to actual source code modifications in real-time. This bidirectional synchronization between visual interface and code representation requires careful prompt engineering to ensure Claude generates idiomatic, maintainable code that maps cleanly to visual operations.
The company reports significant production impact: approximately 10-15% of front-end pull requests are now being opened directly by designers without engineering involvement, and roughly 60% of pull requests contain substantial front-end code generated by designers, PMs, and Claude that proves useful for accelerating production engineering work. These metrics suggest their LLMOps implementation successfully balances code quality with accessibility, though the presentation doesn't detail their evaluation framework or quality assurance processes.
One notable operational consideration is Tempo's approach to version control integration. The demo showed committing changes directly to GitHub, suggesting they've built infrastructure to manage Git operations through their collaborative cloud environment. This likely requires careful handling of authentication, branching strategies, and merge conflict resolution when multiple users collaborate simultaneously.
## Zencoder: Full Software Development Lifecycle Automation
Andrew Ph from Zencoder brought a broader perspective on AI-assisted development, positioning their solution not just as a coding assistant but as a comprehensive platform spanning the entire software development lifecycle (SDLC). His background building and selling software businesses for over $2 billion with teams exceeding 1,000 people informed a key insight: only 2-5% of ideas come to life in large organizations because most time is consumed by routine work. Zencoder's mission centers on automating 90% of that routine to enable 10x faster development.
The presentation outlined three generational shifts in AI coding assistance. The first generation involved simple code completion—convenient but not transformative. The second generation emerged with Claude 3.5 Sonnet in October 2024, enabling true coding agents within IDEs and causing usage to skyrocket 10-100x. The critical technical capabilities enabling this shift included robust tool and environment support, transition from coding-focused models to software engineering-focused models, and larger context windows to handle substantial codebases.
Zencoder is now positioning themselves for a third generation centered on verification and computer use. The emphasis on verification as "key to scaling AI" and "delivering more fully autonomous cycles" reflects a mature understanding of production LLM challenges. Without verification mechanisms, fully autonomous agents can drift or produce incorrect outputs at scale. The mention of computer use capabilities—allowing AI to interact with running applications—suggests they're building feedback loops where agents can test their own work.
A major announcement during the presentation was Zen Agents, extending beyond coding agents to custom agents deployable across the entire SDLC. These agents support the Model Context Protocol (MCP) with specialized coding tools, enabling organizations to deploy agents from PRD development through coding, verification, and code review. From an LLMOps perspective, this represents a significant operational challenge: maintaining consistent agent behavior across different SDLC phases, managing context and state across handoffs, and ensuring agents can effectively communicate and coordinate.
Zencoder also announced their own MCP registry with approximately 100 MCP servers available while waiting for Anthropic's official registry. They're building a community aspect with an MIT-licensed GitHub repository for sharing agents, suggesting they understand that LLM applications benefit from ecosystems and reusable components rather than purely proprietary implementations. This community-driven approach could help with the prompt engineering and agent configuration challenges that typically require extensive iteration.
The operational infrastructure implied by their offering is substantial: they need to orchestrate multiple agents, manage tool access and permissions, handle authentication across various development platforms, maintain context across long-running workflows, and provide monitoring and observability for agent actions. While the presentation didn't detail these operational concerns, they're critical for production deployment at scale.
## Gamma: AI-Powered Presentation Generation
Jordan from Gamma presented a focused case study on how specific model improvements directly impacted their key metrics. Gamma builds AI-powered tools for creating presentations, documents, websites, and social media content from natural language prompts. Their LLMOps story centers on two moments where model upgrades significantly moved the needle on user satisfaction for deck generation: the release of Claude 3.5 Sonnet and Claude 3.7 Sonnet.
The most striking metric Jordan shared was an 8% increase in user satisfaction with the 3.7 Sonnet release—an improvement they had spent hundreds of hours attempting to achieve through prompt engineering without success. This observation highlights a critical LLMOps insight: model quality often dominates optimization efforts. Teams can invest enormous resources in prompt engineering, retrieval augmentation, or architectural improvements, but fundamental model capabilities frequently provide larger gains.
The specific feature driving Gamma's improvement was built-in web search in Claude 3.7 Sonnet. The live demonstration powerfully illustrated the difference: generating a presentation about "Code with Claude Conference 2025" without web search produced completely fabricated information (wrong dates, wrong speakers, wrong duration), while the web-search-enabled version correctly identified dates, locations, schedule details, and real technical sessions.
From an LLMOps perspective, Gamma's workflow involves multiple model calls with different responsibilities. First, Claude searches the web and creates an outline based on the user's prompt and search results. Then Claude takes that outline and generates a full presentation with appropriate details, layout, and design. The presentation mentioned using custom themes (they demonstrated an Anthropic theme), suggesting they've built a template system that Claude can work within.
The operational challenge Gamma faces is managing user expectations around accuracy. Their demo acknowledged that generated content "won't be perfect" but should provide a good starting point with correct information. This reflects a pragmatic approach to LLM deployment: positioning the AI as an assistant that accelerates creation rather than a fully autonomous system that requires no human review.
Gamma's decision to rely on Claude's native web search rather than integrating a third-party service simplifies their architecture and reduces operational overhead. Third-party integrations introduce additional failure modes, latency, API rate limits, and costs. By leveraging built-in model capabilities, they can focus on their core product experience rather than infrastructure plumbing.
One aspect not discussed but critical for their LLMOps is prompt engineering for visual design. Generating presentations isn't just about content accuracy—layout, typography, color schemes, image placement, and overall aesthetic quality all matter for user satisfaction. Their mention of spending "hundreds of hours" on prompt engineering suggests substantial investment in getting these elements right, even before the model upgrade provided additional gains.
## Bito: AI Code Review at Scale
Omar Goyel from Bito presented a compelling case for AI-powered code review as the necessary counterpart to AI-powered code generation. His thesis: as developers use tools like Cursor, Windsurf, and Claude to write 10x more code over the next few years, the code review process becomes the bottleneck. "Vibe coding does not equal vibe engineering"—generated code needs to be scalable, reliable, performant, and architecturally consistent, requirements that code review addresses but that won't scale for 10x code volume.
Bito's platform integrates with GitHub, GitLab, and Bitbucket, supporting over 50 languages. Their focus on Claude Sonnet reflects a strategic choice: prioritizing model quality for human-like code reviews that focus on critical issues rather than generating noise. The presentation emphasized "more signal and less noise" and "actionable important suggestions" as core differentiators.
The live demonstration showcased several LLMOps capabilities that distinguish sophisticated code review from simple static analysis:
The system automatically generates PR summaries without requiring documentation or manual comments, analyzing diffs and code to understand changes. This summary capability requires the model to understand code semantics, identify the purpose of changes, and communicate them clearly to human reviewers.
Bito provides an overview of actionable suggestions, categorizing issues by severity and type. The demo showed three suggestions: missing resource cleanup, non-thread-safe cache implementation, and a class cast exception. The categorization and prioritization of issues demonstrates evaluation logic that determines which findings matter most.
The change list feature provides a hierarchical view of modifications, helping reviewers understand the structure of changes without reading every diff. This requires the model to identify logical groupings and dependencies between changes.
Most impressively, Bito demonstrates deep codebase understanding through cross-file analysis. The class cast exception example showed the model tracing through multiple files: identifying a NetworkDataFetcher class being cast to a LinkedList, following the code path to a DataProcessor constructor that casts to an ArrayList, and recognizing the incompatibility. The presentation noted "this is probably an error that most humans wouldn't even find," highlighting how comprehensive codebase understanding enables catching subtle bugs.
From an LLMOps perspective, this cross-file analysis capability requires substantial infrastructure. Bito mentioned using "abstract syntax trees" and a "symbol index" to crawl and understand codebases. This suggests they've built or integrated parsing infrastructure that extracts structured representations of code, enabling the model to reason about relationships and dependencies beyond what's visible in a single file or diff.
The model's reasoning capabilities are crucial for this use case. The demo showed the system explaining why an issue matters, what the consequences are, and how to fix it. This requires not just pattern matching but understanding programming language semantics, runtime behavior, concurrency implications, and architectural patterns.
Bito also offers IDE integration, allowing developers to request reviews of local changes or staged commits before pushing. This "shift left" approach catches issues earlier in the development cycle when they're cheaper to fix. The operational challenge is maintaining consistency between IDE and CI/CD code review—the same agent should produce similar findings regardless of where it runs.
The impact metrics Bito shared are striking: PRs close in one-tenth the time (50 hours to 5 hours), and Bito provides approximately 80% of the feedback a PR receives, with the AI providing feedback in 3-4 minutes versus 1-2 days for human review. These metrics are based on "hundreds of customers" and "hundreds of engineers," suggesting substantial production deployment.
However, these metrics deserve careful interpretation. The dramatic reduction in PR closure time could result from faster feedback loops rather than fewer total reviewer hours. The 80% figure for AI-generated feedback doesn't specify whether this feedback is accepted, acted upon, or found valuable—just that it's provided. High-volume low-value suggestions could inflate this metric while actually harming productivity. That said, the speed advantage is undeniable and likely drives much of the value.
From an operational standpoint, Bito must handle several LLMOps challenges: managing costs for analyzing every PR across hundreds of customers (prompt caching likely helps significantly), ensuring model availability and response times meet SLAs since slow reviews defeat the purpose, handling false positives and maintaining trust so developers don't ignore suggestions, and keeping up with language and framework evolution since code patterns and best practices change over time.
## Diffusion: Generative Music with AI Lyrics
Hike from Diffusion presented a case study outside the traditional software development domain, showcasing Claude's application in creative content generation. Diffusion trains frontier music generation models from scratch—specifically diffusion transformers for producing high-quality, diverse, controllable music. The company claims to have "the most creative music model in the world," a bold assertion that's difficult to verify but speaks to their ambition.
An interesting technical detail: Diffusion compresses 30 seconds of music into a small "square of pixels" in their latent space, representing the extreme compression achieved by their diffusion model. This compression enables efficient generation and manipulation while preserving musical quality.
While Diffusion's core technology is their proprietary music model, they use Claude for song lyric generation through a tool they call "Ghost Writer." The presentation acknowledged that "current LLMs are good at very many things, but writing good song lyrics, they're still pretty cringy"—but Claude is "the best for sure." This candid assessment reflects the reality that even frontier models struggle with certain creative tasks that require specific cultural knowledge, emotional resonance, and artistic sensibility.
Ghost Writer has been used "tens of millions of times" to write song lyrics, indicating substantial production deployment at scale. From an LLMOps perspective, this volume requires careful attention to cost management, latency, and consistency. The presentation mentioned focusing on "diversity, humor, taste, flowing with the music itself," suggesting they've developed evaluation criteria for lyric quality beyond simple grammatical correctness.
The live demo showed users entering high-level concepts like "experimental indie trip hop about the feeling of getting better after being really sick," with the system generating complete songs including lyrics that match the genre and theme. Diffusion's platform includes deep editing workflows for remixing, extending, replacing sections, swapping stems, and even capturing "vibes" (short audio snippets used as prompts instead of text).
The mention of an "iterative process of thinking about the concept of a song, ideating about actually the context of the genre" reveals important LLMOps considerations. Different musical genres have dramatically different lyrical conventions—drum and bass lyrics differ substantially from folk storytelling. This genre-specific knowledge needs to be encoded in prompts or through few-shot examples, requiring careful prompt engineering and potentially fine-tuning.
The challenge of "getting something that actually fits with the music" suggests they're doing multimodal reasoning, coordinating between the generated music (from their proprietary model) and the generated lyrics (from Claude). This coordination likely requires analyzing the music's tempo, mood, structure, and then crafting prompts that guide Claude toward appropriate lyrical content.
One particularly interesting capability demonstrated was adding features like "a spoken word intro in French" to existing lyrics, showing the system can handle multilingual content and specific artistic directions. This flexibility requires robust prompt engineering and potentially multiple rounds of generation and refinement.
From an operational perspective, supporting "tens of millions" of lyric generations requires infrastructure for request queuing, rate limiting, caching of common patterns, and fallback strategies when the API is unavailable. The integration between Claude and their music generation pipeline needs to be seamless to provide good user experience.
While the presentation focused on the creative and product aspects, the operational maturity implied by their scale is significant. They've clearly invested in making Claude a reliable component of their production system, handling failures gracefully and maintaining consistent quality across millions of generations.
## Create: No-Code Mobile and Web App Builder
Drew from Create presented an AI-powered no-code platform for building complete software products from natural language prompts. Create started with web apps and recently launched a mobile app builder in beta, positioning themselves as democratizing software development for non-technical users. Claude powers much of their code generation, particularly for the agentic workflows that take prompts end-to-end to working applications.
The live demo showed creating an iOS app for a "family memory app" from a single sentence prompt. Create's agent begins by generating an outline of core pages and backend functionality, then builds out the frontend interface, defines database schemas, deploys a full database, and connects all the functions. This end-to-end automation represents significant LLMOps complexity: orchestrating multiple generation steps, maintaining consistency across frontend and backend, ensuring generated code follows platform conventions, and handling errors at any stage.
A notable technical detail is that Create "comes built in with backends and frontends and everything you need from the database to the actual core auth." This suggests they've built substantial scaffolding and template infrastructure that Claude populates and customizes based on user prompts. Rather than generating everything from scratch, they likely have architectural patterns and boilerplate that ensure generated apps follow best practices and work reliably.
The ability to "fully submit from create" directly to the App Store represents a significant operational achievement. Mobile app submission involves code signing, provisioning profiles, build configuration, asset management, and compliance with App Store guidelines. Automating this process while ensuring generated apps meet Apple's requirements demonstrates sophisticated understanding of the full deployment pipeline.
The demo referenced Draw Daily, an app built in one day using Create that generates AI images from drawings, now available in the App Store. This rapid development timeline showcases the potential of their platform but also raises questions about testing, quality assurance, and maintenance. Apps built in a day may work initially but face challenges with edge cases, performance, security, and updates.
Create reports "hundreds of thousands" of non-technical users building apps on their platform, indicating substantial market traction. The presentation showcased several examples from a recent demo day:
- A memory app for storing meaningful connections and details about people's lives, useful for sales calls and relationship management
- A scholarship app for automatically filling out grants and applications built by a Berkeley student
- A basketball coaching app replacing spreadsheets and paper drills with digital lesson plans and animated drill demonstrations
- A personal AI money coach for Gen Z with full RAG providing personalized financial recommendations based on monthly income
These diverse applications demonstrate the flexibility of Create's platform and the range of use cases their LLMOps infrastructure must support. Each application domain has different requirements: the scholarship app needs form-filling and document processing, the basketball app needs animation and media handling, the finance app needs RAG for knowledge retrieval and data analysis.
The mention of "full RAG" and "Claude-powered assistant" in the finance app example suggests Create provides higher-level AI primitives beyond basic code generation. They likely offer components for adding conversational interfaces, retrieval-augmented generation, and domain-specific agents to generated applications.
From an LLMOps perspective, Create faces the challenge of generating production-quality code that non-technical users can maintain and extend. Generated code needs to be clean, well-structured, documented, and follow platform conventions. When users inevitably want to customize beyond what the AI can generate, they need to be able to understand and modify the codebase.
The presentation mentioned using "prompt caching, tool calling, and a lot of the core primitives that Anthropic makes available" to achieve success. Prompt caching is particularly important for Create's use case—they likely cache common architectural patterns, component templates, and framework-specific knowledge, significantly reducing cost and latency when generating similar apps.
Tool calling enables Create's agents to interact with external services: deploying databases, configuring authentication, integrating third-party APIs, and managing version control. Orchestrating these tool calls reliably requires careful error handling, retry logic, and state management.
## Cross-Cutting LLMOps Themes
Several operational patterns and challenges emerge across these case studies:
**Model version sensitivity**: Multiple companies (Gamma, Zencoder) explicitly cited specific Claude releases as inflection points. This sensitivity to model updates creates operational challenges—companies need to test new versions thoroughly, manage gradual rollouts, and potentially support multiple model versions simultaneously to handle regressions.
**Native model capabilities vs. third-party integrations**: Gamma's preference for Claude's built-in web search over third-party services reflects a broader principle. Native model capabilities reduce architectural complexity, failure modes, and operational overhead. However, they also create vendor lock-in and dependency on model provider roadmaps.
**Tool use and orchestration**: Zencoder, Create, and Bito all leverage tool calling to interact with external systems (GitHub, databases, App Stores, build systems). Managing tool reliability, permissions, error handling, and state consistency across tool calls represents a significant operational challenge.
**Prompt caching**: Create explicitly mentioned using prompt caching, and it's likely critical for all these applications given their scale. Caching common patterns, architectural knowledge, and framework-specific information dramatically reduces cost and latency for repetitive tasks.
**Context management**: Bito's cross-file code analysis and Zencoder's full SDLC agents require managing substantial context—entire codebases, conversation history, prior agent decisions. Strategies for prioritizing relevant context, summarizing when approaching limits, and maintaining coherence across long interactions are crucial.
**Evaluation and quality assurance**: While metrics were shared (Gamma's 8% satisfaction increase, Bito's 80% feedback contribution, Tempo's 10-15% PR contribution), the underlying evaluation frameworks weren't detailed. Production LLM systems require rigorous evaluation strategies covering accuracy, helpfulness, safety, and domain-specific quality criteria.
**Human-in-the-loop vs. full automation**: The companies take different approaches to autonomy. Tempo enables designers to directly commit code but likely has review processes. Gamma explicitly positions their output as a "starting point" requiring human refinement. Bito provides automated suggestions but humans make final decisions. Create generates full applications but users need to test and potentially modify them. Calibrating the right level of automation for each use case is a key LLMOps decision.
**Cost management at scale**: Supporting millions or tens of millions of operations (Diffusion's lyrics, Create's users) requires careful cost optimization. Strategies likely include prompt caching, request batching, model selection (using smaller/faster models where appropriate), and user-based rate limiting.
These six companies collectively demonstrate the maturity and diversity of Claude deployments in production. They're not running toy demos or proofs-of-concept—they're serving hundreds of thousands to millions of users, generating measurable business value, and navigating the operational complexities of production LLM systems. Their experiences provide valuable insights into the current state of LLMOps and the practical considerations for deploying frontier models at scale.