Company
GoDaddy
Title
From Mega-Prompts to Production: Lessons Learned Scaling LLMs in Enterprise Customer Support
Industry
E-commerce
Year
2024
Summary (short)
GoDaddy has implemented large language models across their customer support infrastructure, particularly in their Digital Care team which handles over 60,000 customer contacts daily through messaging channels. Their journey implementing LLMs for customer support revealed several key operational insights: the need for both broad and task-specific prompts, the importance of structured outputs with proper validation, the challenges of prompt portability across models, the necessity of AI guardrails for safety, handling model latency and reliability issues, the complexity of memory management in conversations, the benefits of adaptive model selection, the nuances of implementing RAG effectively, optimizing data for RAG through techniques like Sparse Priming Representations, and the critical importance of comprehensive testing approaches. Their experience demonstrates both the potential and challenges of operationalizing LLMs in a large-scale enterprise environment.
## Overview GoDaddy, a major domain registrar and web hosting company, shares extensive lessons learned from operationalizing Large Language Models (LLMs) in their customer support infrastructure. The company's Digital Care team leverages LLMs to handle customer interactions across messaging channels including SMS, WhatsApp, and web, processing over 60,000 customer contacts daily. This case study represents a candid, practitioner-focused account of the challenges and solutions encountered when deploying LLMs at scale in a production environment. The article, authored by Richard Clayton, a Director of Engineering at GoDaddy, provides valuable insights from their experience since ChatGPT's release in December 2022. The team acknowledges that while LLMs outperform older natural language understanding systems, operationalizing them is far from effortless. This makes the case study particularly valuable as it doesn't oversell the technology but instead provides a balanced view of both the potential and the practical difficulties. ## Prompt Architecture Evolution One of the most significant lessons GoDaddy learned relates to prompt architecture. Their initial approach used what they term a "mega-prompt" — a single prompt designed to handle all user interactions. Their AI Assistant was designed to classify conversations into one of twenty support topics, ask topic-specific questions, and route conversations to appropriate support queues. As they added more topics and questions, problems emerged. The prompt grew to over 1,500 tokens by their second experiment, leading to high ambient costs and occasionally exceeding token limits during lengthy conversations. The accuracy of responses declined as more instructions and contexts were incorporated. Memory management became increasingly critical when they introduced Retrieval Augmented Generation (RAG) by incorporating associated articles into prompts. The team recognized that task-oriented prompts — focused on single tasks like "collect a coffee order" — could achieve greater efficiency in complicated conversational flows. These prompts use fewer tokens, enhance accuracy, and give authors better control over outputs since the range of viable answers is smaller. However, task-oriented prompts aren't suitable for general, open-ended conversations. Their mature approach drew inspiration from Salesforce's Multi-Agent work, specifically the BOLAA paper. They shifted toward a multi-prompt architecture using the Controller-Delegate pattern, where a mega-prompt serves as a controller that passes conversations to task-oriented delegate prompts. Early results show this approach simplified their codebase while enhancing chatbot capability. The team predicts this type of prompt architecture will become commonplace until models become more precise and large-context models become more affordable. ## Structured Outputs and Validation GoDaddy encountered significant reliability challenges when requesting structured outputs (JSON or code) from LLMs. Before OpenAI introduced function calling, their initial trials with ChatGPT 3.5 Turbo required building a custom parser to handle four to five common failure patterns. Even with ChatGPT functions, they experience invalid output on approximately 1% of ChatGPT 3.5 requests and 0.25% of ChatGPT 4 requests. They developed several strategies to improve structured output reliability: minimizing prompt temperature to boost token predictability by reducing randomness; using more advanced (albeit costly) models for tasks involving structured content; and recognizing that models designed to respond to user queries often produce mixed outputs containing both plain-text and structured formats. For models without native structured responses or when using more affordable models, they recommend deploying two parallel prompts — one for generating structured responses and another for user communication. ## Model Portability Challenges A critical finding is that prompts are not portable across models. Different models (Titan, LLaMA, ChatGPT) and even different versions of the same model (ChatGPT 3.5 0603 versus ChatGPT 3.5 1106) display noticeable performance differences with identical prompts. GoDaddy ran experiments comparing ChatGPT 3.5 Turbo and ChatGPT 4.0 for their AI assistant. Using identical prompts for both, they had to discontinue the first experiment after three days due to ChatGPT 3.5's subpar performance — sometimes counterproductive in managing support cases due to failures in transferring customers and misdiagnosing problems. In subsequent attempts with tuned prompts for each model, they observed improved performance. When they upgraded to the November 2023 releases (gpt-3.5-turbo-1106), the performance gap between 3.5 and 4.0 narrowed noticeably even without modifying prompts. The conclusion is clear: teams must continuously fine-tune and test prompts to validate performance across model versions. ## AI Guardrails Implementation GoDaddy emphasizes that LLM outputs are probabilistic, and prompts that performed well in thousands of tests can fail unexpectedly in production. A critical early mistake was allowing models to determine when to transfer to humans without providing an escape hatch for users, sometimes leaving customers stuck with an LLM that refused to transfer. Their guardrail implementations include: controls to check for personally identifiable information and offensive content in AI responses, user messages, and prompt instructions; using deterministic methods to decide when to transfer conversations to humans (relying on code-identified stop phrases rather than model judgment); limiting bot-customer chat interactions to prevent indefinite loops; requiring approvals through external channels for sensitive actions; and defaulting to human intervention when situations are uncertain. ## Reliability and Latency Challenges The team reports experiencing approximately 1% of chat completions failing at the model provider level. Latency is also a significant concern: ChatGPT 4.0 averages 3-5 seconds for completions under 1,000 tokens, with performance degrading significantly as token sizes increase (calls lasting up to 30 seconds before client timeout). They note with concern that newer models tend to be slower than previous generations. Standard industry practices like retry logic help mitigate reliability issues, though this compounds latency problems. Their system was particularly susceptible because their upstream communication provider imposed a 30-second timeout on integration calls. This is pushing them toward asynchronous responses — acknowledging requests and sending messages to customers via APIs rather than synchronous responses. They recommend adopting streaming APIs from LLM providers for better user experience, despite the implementation complexity. ## Memory Management Strategies Managing LLM context is described as one of the toughest challenges in building conversational AI. While large context models exist (OpenAI GPT up to 32,000 tokens, Anthropic Claude up to 100,000 tokens), their use can be cost-prohibitive at scale. Additionally, more context isn't always better — it may cause models to fixate on repeated concepts or prioritize recent tokens inappropriately. GoDaddy references LangChain's various memory management techniques: buffers (keeping last N messages or tokens), summarization, entity recognition, knowledge graphs, dynamic retrieval by relevancy via vector stores, and combinations thereof. For short conversations, retaining the entire conversation works best — premature summarization can degrade accuracy. For longer conversations, summarizing earlier parts while tracking named entities and retaining recent messages has served them well. A specific ChatGPT insight: removing outcomes of tool usage (function messages) after the model responds can be beneficial, as retaining them sometimes leads to unpredictability and fixation on results. For their multi-agent architecture, they're exploring "stacks" to implement memory — providing ephemeral working memory to delegate prompts while reaping and summarizing results when conversation focus returns to the controller. ## Adaptive Model Selection GoDaddy experienced a multi-hour ChatGPT outage that rendered their chatbots inoperable. This highlighted the need for dynamic model selection to address reliability and cost concerns. Ideally, they would have been able to switch providers and continue operations with degraded capability. Less dramatic scenarios include switching to higher context models when conversations approach memory limits (e.g., from ChatGPT 3.5 Turbo 4k context to 32k context). They're exploring this approach for agent tool usage that returns excessive data. The same concept could minimize support costs during product outages causing contact surges, or leverage more accurate models for dissatisfied customers. While not yet implemented, adaptive model selection is expected to become increasingly important as LLM implementations mature and companies seek to improve effectiveness and economics. ## RAG Implementation Insights Initial RAG implementations executing queries on every prompt invocation based on user messages proved ineffective. GoDaddy found that understanding a customer's problem typically requires three or four messages since initial messages are often pleasantries. Retrieving documents prematurely decreased generation accuracy by focusing the model on wrong content. Subsequent implementations switched to specialized RAG prompts after determining conversation intent, but this proved inflexible, requiring multiple prompts and a state machine. They discovered the LLM Agent pattern with tools — a prompt paired with actions that the model can invoke with parameters (e.g., `getWeatherFor('90210')`), with results provided back as new messages. They identified two essential RAG patterns: including dynamic content to customize prompt behavior (like voice and tone instructions from Conversation Designers, or support questions updatable by operations), and providing content relevant to individual conversations via agent-controlled searches. Using the model to craft search queries improved Knowledge Base search relevancy. ## Data Optimization with Sparse Priming Representations Documents contain flowery language and redundant information that increases token usage and potentially hurts prediction performance. GoDaddy is refining content using Sparse Priming Representations (SPRs) — having LLMs summarize document content into representations optimized for models. SPR versions are stored in vector stores for RAG. Early tests show over 50% reduction in token usage, though additional experiments are needed to confirm performance improvements. They're also addressing the problem of similar content in knowledge bases — queries may return hundreds of documents covering the same topic. Given short model contexts, only a few documents can be used, and these will likely be very similar, arbitrarily narrowing the knowledge space. They're experimenting with document clustering to bucket content and applying SPR to reduce buckets into single documents, aiming to reduce duplication and widen the knowledge space. ## Testing Challenges GoDaddy's final and most important lesson is that testing is often more difficult and labor-intensive than building the LLM integration itself. Minor prompt changes can significantly impact performance. Since natural language inputs are infinite, automated tests beyond initial interactions are nearly impossible. Using LLMs to test other LLMs seems cost-prohibitive when running thousands of tests multiple times daily from CI pipelines. Their recommendations include building reporting systems to aggregate LLM outputs for QA team review, and team swarming — having developers, writers, product managers, business analysts, and QA review transcripts together during the first few days after major releases. This multidisciplinary approach allows rapid detection and fixing of problems. ## Conclusion This case study provides an unusually candid look at LLM operationalization challenges from a major tech company. The lessons span architecture (multi-agent patterns), reliability (guardrails, fallbacks), performance (latency, memory management), and process (testing, monitoring). While some specific metrics are provided (1% completion failures, 3-5 second latency, 50% token reduction with SPR), many insights are qualitative but grounded in real production experience. The emphasis on continuous testing, human oversight, and realistic expectations about current AI capabilities provides a balanced perspective valuable for any organization deploying LLMs at scale.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.