Discord shares their comprehensive approach to building and deploying LLM-powered features, from ideation to production. They detail their process of identifying use cases, defining requirements, prototyping with commercial LLMs, evaluating prompts using AI-assisted evaluation, and ultimately scaling through either hosted or self-hosted solutions. The case study emphasizes practical considerations around latency, quality, safety, and cost optimization while building production LLM applications.
Discord, the popular communication platform serving hundreds of millions of users, published a detailed engineering blog post describing their methodology for developing and deploying generative AI features. This case study provides a practical framework for how a large-scale consumer technology company approaches LLMOps, from initial ideation through full production deployment. The post, authored by Shannon Phu, a Senior Machine Learning Engineer on Discord’s Applied Machine Learning team, offers insights into the tradeoffs and decision-making processes involved in bringing LLM-powered features to production.
The case study is notable for its transparency about the iterative nature of LLM development and its balanced treatment of commercial versus self-hosted model options. While it doesn’t detail specific features that were built using this process, it provides a generalizable framework that reflects real production experience at scale.
Discord begins by emphasizing that generative AI should be treated as a tool that must be carefully matched to appropriate problems. They highlight three categories of challenges where generative AI is particularly well-suited:
This framing is important from an LLMOps perspective because it establishes clear criteria for when to pursue LLM-based solutions versus other approaches. The emphasis on unstructured content and scale suggests Discord is likely applying these techniques to content moderation, user support, or similar text-heavy operations.
Before prototyping, Discord defines several key dimensions that influence model selection and system design:
Latency is considered in terms of how fast the system must respond to user input. For a real-time communication platform like Discord, this is critical for user-facing features.
Task Complexity involves assessing the level of understanding required from the LLM and whether the input context is highly domain-specific. This influences whether off-the-shelf models will suffice or fine-tuning is necessary.
Prompt Length determines how much context must be provided, which impacts token costs and model selection based on context window requirements.
Quality establishes the acceptable level of accuracy for generated content, which must be balanced against latency requirements.
Safety addresses the importance of sanitizing user input and preventing harmful content generation or prompt injection attacks—particularly critical for a consumer platform like Discord.
Language Support considers which languages the application must support, relevant for Discord’s global user base.
Estimated QPS (queries per second) forecasts the throughput requirements for full-scale deployment, which directly impacts infrastructure planning and cost projections.
Discord explicitly acknowledges the tension between quality and latency, noting that more capable models typically operate more slowly. They present this as a tradeoff that can be addressed either by accepting higher costs (more compute capacity) or by using smaller models with some quality degradation.
Discord’s prototyping approach favors starting with advanced commercial LLMs like GPT-4 to quickly validate ideas before investing in infrastructure. The rationale is pragmatic: if state-of-the-art foundational models cannot adequately solve a problem, the problem may not be addressable with current generative AI technology. This “fail fast” approach prevents wasted engineering effort on infrastructure for problems that aren’t tractable.
The core technical challenge at the prototyping stage is developing effective prompts. Discord describes an iterative process of refining prompt wording, but notes that after many adjustments it becomes difficult to tell whether changes are improving results. This motivates their use of systematic evaluation.
A key contribution of this case study is Discord’s description of their AI-assisted evaluation methodology. This technique uses a best-in-class LLM (such as GPT-4) to automatically evaluate the outputs of the inference model. Discord draws an analogy to the actor-critic algorithm from reinforcement learning, where the “critic” (GPT-4) evaluates how well the “actor” (the inference model) performed.
The evaluation setup involves two separate prompts:
To streamline evaluation, Discord designs critic prompts that request simple yes/no answers or numerical ratings, making the process more automatable and reducing the need for manual review of every output.
This approach allows rapid iteration on prompt design with quantitative feedback, though it’s worth noting that AI-assisted evaluation has its own limitations—the critic model’s judgments may not always align with human preferences, and the technique may miss subtle quality issues that humans would catch.
Once prompt quality reaches acceptable levels, Discord deploys features through limited releases (A/B tests) to gather real-world performance data. The metrics they track fall into several categories:
User satisfaction metrics include direct user feedback, engagement measures, and identification of problematic responses such as incorrect answers or hallucinations causing strange user experiences.
System health metrics encompass response latency, throughput (tokens per second), and error rates. Discord notes that LLMs sometimes struggle to generate consistently structured output, which is crucial for downstream parsing and service integration.
Cost metrics focus on tokens per second consumption during limited release, which enables forecasting of full-scale launch costs.
This combination of user-facing and operational metrics provides a comprehensive view of whether a feature is production-ready. Discord emphasizes that if metrics indicate issues, the team iterates on the system rather than proceeding to full deployment.
Discord describes their high-level architecture for LLM applications, which consists of several key components:
Prompt preparation converts inputs into prompts that have been tested and evaluated against a robust set of examples.
LLM inference server is the core component responsible for generating responses. For prototyping, this may be a commercial API like OpenAI’s GPT endpoints.
Content safety filters are applied to inference server outputs before they reach users. Discord works with Legal and Safety teams to implement mitigations and can leverage both in-house and third-party trust and safety ML models to detect inappropriate content.
This architecture reflects Discord’s emphasis on user experience, privacy, and safety—essential considerations for a consumer platform serving diverse communities including minors.
Discord provides a thoughtful treatment of the tradeoffs between commercial LLM APIs and self-hosted solutions. Commercial LLMs offer access to state-of-the-art models without infrastructure setup burden, but costs can scale quickly. Privacy considerations may also favor in-house processing.
For self-hosted solutions, Discord mentions open-source models like Llama and Mistral as capable of producing high-quality results even for complex tasks. However, they note that domain-specific or complex tasks may still require fine-tuning to achieve excellent performance. Their recommendation is to start with smaller models and scale up only if quality requirements demand it.
The infrastructure requirements for self-hosted LLMs include:
Discord notes that optimal server configuration is task-specific, depending on models, input/output token lengths, and batching strategies for maximizing throughput.
While Discord’s framework is well-structured and reflects genuine production experience, several aspects merit critical consideration:
The case study is light on specific quantitative results or concrete examples of features built using this methodology. Without knowing what Discord actually shipped using these processes, it’s difficult to assess how well the framework works in practice.
The discussion of AI-assisted evaluation is valuable but doesn’t address potential limitations such as evaluator model bias, the cost of using GPT-4 for evaluation at scale, or cases where critic judgments diverge from human preferences.
The treatment of self-hosted models acknowledges tradeoffs but doesn’t provide specifics on when Discord chooses one approach over the other, or what cost savings they’ve actually achieved.
Despite these limitations, the case study provides a useful template for organizations developing their own LLMOps practices, with particular value in its systematic approach to requirements definition, evaluation, and the staged progression from prototype to production deployment.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Grab developed an AI Gateway to provide centralized, secure access to multiple GenAI providers (including OpenAI, Azure, AWS Bedrock, and Google VertexAI) for their internal developers. The gateway handles authentication, cost management, auditing, and rate limiting while providing a unified API interface. Since its launch in 2023, it has enabled over 300 unique use cases across the organization, from real-time audio analysis to content moderation, while maintaining security and cost efficiency through centralized management.
A panel discussion featuring leaders from multiple enterprises sharing their experiences implementing LLMs in production. The discussion covers key challenges including data privacy, security, cost management, and enterprise integration. Speakers from Box discuss content management challenges, Glean covers enterprise search implementations, Tyace shares content generation experiences, Security AI addresses data safety, and Citibank provides CIO perspective on enterprise-wide AI deployment. The panel emphasizes the importance of proper data governance, security controls, and the need for systematic approach to move from POCs to production.