DoorDash's Summer 2025 intern program showcased several sophisticated LLMOps implementations that demonstrate how large-scale food delivery platforms leverage language models to solve operational challenges at scale. The case studies present practical applications of LLMs in production environments, focusing on automation of manual processes and infrastructure for conversational AI systems.
## Never-Delivered Order Feature Extraction System
The first major project tackled a significant operational challenge facing DoorDash's massive scale of approximately 2 billion annual orders. When customers report never-delivered (ND) orders despite Dashers marking them as delivered, the company faced a manual review bottleneck that was both expensive and slow. The traditional process required human agents to analyze conversations between Dashers and consumers to determine the root cause of delivery issues, severely limiting the number of cases that could be processed daily.
The technical solution involved developing a custom in-house language model specifically designed for automated feature extraction from customer-Dasher conversation data. The team conducted a comprehensive model comparison that provides valuable insights into LLMOps decision-making processes. They evaluated several approaches including Meta's Llama 3 models in both 70B and 8B parameter variants, both in their base forms and fine-tuned versions, against a custom DistilBERT model specifically trained for sequence classification.
The results reveal important considerations for production LLM deployment. While general-purpose reasoning models like Llama 3 demonstrated powerful text understanding capabilities, they proved less effective for the specific binary classification task of determining whether certain events occurred in conversations. The fine-tuned DistilBERT model ultimately achieved superior performance with an F1 score of 0.8289 compared to the best Llama variant's 0.6087, while also delivering significantly lower latency at 0.0936 seconds versus 0.3275 seconds for the fine-tuned Llama 3-70B model.
A critical aspect of this LLMOps implementation was addressing data imbalance challenges. The team discovered that certain classification questions had extremely skewed distributions, with some positive cases representing only 1% of the dataset. Without proper handling, models would default to always predicting the majority class, effectively ignoring rare but important scenarios. The solution involved oversampling rare positive cases and using F1 score as the primary evaluation metric to better balance precision and recall considerations.
The production system demonstrates practical feature extraction capabilities, processing conversation dialogues in near real-time with sub-second inference times. For example, the model can analyze a conversation where a consumer requests delivery to a different address and correctly classify this scenario in 0.159 seconds. This automation enables rapid backfilling of historical data and improves the downstream ND attribution model's performance over time.
## Chatbot-as-a-Service RAG Infrastructure
The second major project addressed infrastructure scalability challenges in DoorDash's conversational AI ecosystem. As chatbot adoption increased across different teams for various use cases including FAQ responses, support issue resolution, and information surfacing, two significant limitations emerged. Teams lacked a centralized, scalable platform for managing knowledge bases, and there was no unified API infrastructure for teams to easily create and deploy their own chatbots.
The legacy approach involved teams storing article embeddings in vector databases while managing metadata separately in spreadsheets, creating a fragmented and error-prone workflow. Teams had to locate articles in spreadsheets and then use associated IDs to query vector databases, making the system unscalable and difficult to maintain.
The solution involved building a comprehensive Retrieval-Augmented Generation (RAG) framework with two main components: a Knowledge Base Management Service and a generalized Chatbot-as-a-Service platform. The architecture demonstrates sophisticated LLMOps principles including automated ingestion, embedding generation, and retrieval of knowledge base articles, with isolated collections and customizable embedding models per bot.
The Knowledge Base Management Service provides several key APIs that showcase mature LLMOps practices. The UpsertArticle API handles automatic embedding generation using specified embedding model providers while storing comprehensive metadata including embedding configurations in the vector database. The ReembedCollectionWithNewModel API addresses a critical production challenge - the ability to migrate existing knowledge bases to new embedding models without data loss, demonstrating forward-thinking infrastructure design for evolving model capabilities.
The GetRagResult API represents the core of the chatbot-as-a-service offering, executing the full RAG workflow by retrieving relevant knowledge base articles, combining them with user conversation history and customizable prompt templates, and generating context-aware responses. The API returns both generated responses and lists of referenced articles, providing transparency and enabling features like guardrails and audit logging.
The platform's onboarding workflow demonstrates practical LLMOps implementation patterns. Teams create isolated vector database collections ensuring schema flexibility and model customization capabilities, upload content through automated ingestion processes, define prompt behavior for both successful retrievals and fallback scenarios, and register with conversation teams using intent-based lookup systems.
## Server-Driven UI and Annotation Pipeline Projects
While the document includes additional projects focusing on server-driven UI and annotation pipeline redesigns, these represent more traditional software engineering approaches with limited direct LLM integration. The server-driven UI project using DoorDash's Mosaic framework enables dynamic interface updates without frontend code changes, and the annotation pipeline redesign improves tag quality through Kafka-based ingestion and Cadence workflow orchestration.
## LLMOps Lessons and Production Considerations
These case studies illustrate several important LLMOps principles for enterprise deployments. The model comparison methodology demonstrates the importance of evaluating models based on specific use case requirements rather than general capabilities. Specialized models like DistilBERT can outperform larger general-purpose models for focused classification tasks while offering better latency characteristics for production deployment.
The data balancing techniques highlight critical considerations for production LLM systems dealing with real-world data distributions. The use of F1 score as a primary metric rather than accuracy shows sophisticated understanding of evaluation approaches for imbalanced datasets.
The RAG infrastructure design demonstrates scalable approaches to knowledge management and conversational AI deployment. The emphasis on isolated collections, customizable embedding models, and transparent article referencing shows mature thinking about multi-tenant LLM systems and operational requirements like auditability and model evolution.
The automation focus throughout both projects reflects practical LLMOps priorities - reducing manual processes, enabling faster iteration cycles, and allowing specialized teams to focus on higher-value activities rather than repetitive tasks. The sub-second inference times and real-time processing capabilities show attention to production performance requirements.
These implementations represent mature LLMOps practices within a large-scale commercial environment, demonstrating how language models can be effectively integrated into existing operational workflows while maintaining scalability, reliability, and performance standards required for enterprise deployment.