ZenML

Building Production-Grade Generative AI Applications with Comprehensive LLMOps

Block (Square) 2023
View original source

Block (Square) implemented a comprehensive LLMOps strategy across multiple business units using a combination of retrieval augmentation, fine-tuning, and pre-training approaches. They built a scalable architecture using Databricks' platform that allowed them to manage hundreds of AI endpoints while maintaining operational efficiency, cost control, and quality assurance. The solution enabled them to handle sensitive data securely, optimize model performance, and iterate quickly while maintaining version control and monitoring capabilities.

Industry

Tech

Technologies

Overview

This case study captures insights from a joint presentation between Databricks and Block (the parent company of Square, Cash App, TIDAL, and TBD) at a Databricks conference. The presentation featured Ina Koleva from Databricks’ Product Team and Bradley Axen from Block’s Engineering and Machine Learning Team. The discussion provides a comprehensive look at how Block built and operationalized generative AI applications at enterprise scale, emphasizing the importance of treating LLM applications as versioned models and building a flexible yet robust platform infrastructure.

Block’s situation is particularly instructive because they operate multiple distinct business units—Square (payments and point-of-sale), Cash App (peer-to-peer payments and banking), TIDAL (music streaming), and TBD (blockchain technologies)—each with different Gen AI requirements. This diversity necessitated a platform approach that could generalize across use cases while avoiding over-engineering in a rapidly evolving space.

The Platform Philosophy

Block’s approach to LLMOps was driven by two key tensions: the need for a flexible platform that could support diverse AI solutions across business units, versus the reality that the generative AI space evolves so rapidly that over-building risks creating infrastructure that becomes obsolete within months. Their solution was to invest in platform capabilities that enable quick iteration and deployment while maintaining operational control.

A critical insight from Block’s experience was recognizing early that without treating the entire LLM stack as a versioned model, teams quickly lose track of performance across different prompt versions and model configurations. This realization led them to standardize on MLflow for experiment tracking and model versioning, allowing them to perform A/B testing and compare versions systematically.

Architecture and Infrastructure

Model Serving and Endpoint Management

Block’s infrastructure grew to encompass approximately two hundred serving endpoints, creating significant operational challenges around visibility, cost management, and rate limiting. Their architecture separates the application-specific chains (which handle prompts, validation, and orchestration) from the foundational models that power them. This separation is crucial because while they have many application endpoints, they don’t want to host separate GPU instances for each one—that would be prohibitively expensive.

The solution involves using MLflow AI Gateway as a central routing layer. Multiple serving endpoints call out to the AI Gateway, which then routes requests to various foundational models—whether self-hosted Llama, OpenAI’s ChatGPT, Anthropic’s Claude, or other vendors. This architecture provides several operational benefits:

Handling Sensitive Data with Self-Hosted Models

Given Block’s position in financial services, data security is paramount. For their most sensitive data, they recognized early that relying on self-hosted models was essential rather than sending data to third-party APIs. However, naively self-hosting large models using standard PyTorch containers resulted in dramatically slower inference compared to optimized vendor solutions.

The breakthrough came from optimized GPU serving endpoints (leveraging technology from Databricks’ acquisition of MosaicML). This optimization was described as “night and day”—without it, self-hosted latency was simply too high for production use cases. The optimized serving reduced latency per token to acceptable levels, enabling secure, self-hosted inference for sensitive financial data.

Use Case Examples

Action-Based Conversations (Structured Output Generation)

One of Block’s use cases involves converting natural language input into structured JSON to trigger backend actions—essentially replacing form-filling with conversational interfaces. The implementation progression illustrates common LLMOps patterns:

The initial approach used a powerful model (GPT-4) with few-shot examples and field descriptions. However, GPT-4 wasn’t 100% accurate, so a validation step was added with retry logic and fallback options. Performance concerns led to experimenting with GPT-4 Turbo, new prompts, and alternative models. Within a short time, the team had iterated through six or seven versions of the chain.

This rapid iteration underscored the importance of versioning. Without tracking each configuration as a versioned model, comparing performance becomes impossible. By implementing proper MLflow tracking, teams could conduct A/B tests and make data-driven improvements while applications only needed to know how to call the stable serving endpoint.

Customer-Facing Chat with RAG

Block’s customer-facing chat implementations rely heavily on retrieval-augmented generation (RAG) to ensure responses reference accurate, company-specific content rather than the base model’s training data. The architecture involves:

Early implementations stored vector embeddings within the model serving container itself. While this worked for small scale, it created operational headaches because any context update required model redeployment. By decoupling vector search into a separate endpoint, Block achieved two independent iteration cycles: one for prompts and model selection, another for adding new context and documents.

Quality Assurance and Safety

Block implemented pre- and post-processing steps to ensure quality and safety in customer-facing applications:

These safeguards are implemented as additional steps in their Python-based model chains, benefiting from the same versioning and A/B testing infrastructure. Teams can add new requirements, compare performance with and without various safeguards, and iterate confidently.

Fine-Tuning Strategy

Block’s fine-tuning efforts focus primarily on latency reduction and cost optimization rather than improving raw model capability. The insight is that even complex chains spend the vast majority of their latency budget and cost on the generation step—the large foundational models.

Their approach: if a 13-billion parameter model works adequately, fine-tuning might enable equivalent performance from a 3-billion parameter model, dramatically reducing inference costs and latency. Block has prior experience successfully building BERT-based chat models under 1 billion parameters, giving them confidence that smaller fine-tuned models can work—they just needed operationally scalable tooling.

Key considerations for fine-tuning at Block:

Observability and Monitoring

Block’s monitoring strategy centers on inference tables—logging model inputs and outputs to structured data stores. This data is then joined with customer feedback (logged through Kafka pipelines to Delta tables) to measure actual business outcomes. This infrastructure enables:

The emphasis on unified monitoring across data and ML systems (Lakehouse Monitoring in Databricks terminology) addresses a common failure mode: when data quality monitoring is separate from ML monitoring, lineage breaks and root cause analysis becomes impossible. If a served endpoint produces bad results due to upstream data issues, connected lineage allows engineers to trace back to which table had unexpected data patterns.

Key Takeaways and Lessons Learned

Block’s experience yields several practical insights for production LLM systems:

The maturity curve Block and Databricks recommend—from prompt engineering to RAG to fine-tuning to pre-training—isn’t strictly linear. These techniques complement each other: a RAG application might use fine-tuned embedding models for retrieval and a fine-tuned LLM for generation. The key is establishing evaluation baselines and systematically measuring the impact of each enhancement.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57

Scaling Customer Support, Compliance, and Developer Productivity with Gen AI

Coinbase 2025

Coinbase, a cryptocurrency exchange serving millions of users across 100+ countries, faced challenges scaling customer support amid volatile market conditions, managing complex compliance investigations, and improving developer productivity. They built a comprehensive Gen AI platform integrating multiple LLMs through standardized interfaces (OpenAI API, Model Context Protocol) on AWS Bedrock to address these challenges. Their solution includes AI-powered chatbots handling 65% of customer contacts automatically (saving ~5 million employee hours annually), compliance investigation tools that synthesize data from multiple sources to accelerate case resolution, and developer productivity tools where 40% of daily code is now AI-generated or influenced. The implementation uses a multi-layered agentic architecture with RAG, guardrails, memory systems, and human-in-the-loop workflows, resulting in significant cost savings, faster resolution times, and improved quality across all three domains.

customer_support regulatory_compliance fraud_detection +50