GoDaddy sought to improve their product categorization system that was using Meta Llama 2 for generating categories for 6 million products but faced issues with incomplete/mislabeled categories and high costs. They implemented a new solution using Amazon Bedrock's batch inference capabilities with Claude and Llama 2 models, achieving 97% category coverage (exceeding their 90% target), 80% faster processing time, and 8% cost reduction while maintaining high quality categorization as verified by subject matter experts.
GoDaddy, a major domain registrar and web hosting company serving over 21 million customers, sought to improve their product categorization system that generates metadata categories (such as color, material, category, subcategory, season, price range, product line, gender, and year of first sale) based solely on product names. Their existing solution using an out-of-the-box Meta Llama 2 model was producing incomplete or mislabeled categories for their 6 million product catalog, and the per-product LLM invocation approach was proving cost-prohibitive at scale.
The collaboration with AWS’s Generative AI Innovation Center resulted in a batch inference solution using Amazon Bedrock that significantly improved accuracy, latency, and cost-effectiveness. This case study provides valuable insights into production LLM deployment patterns for high-volume categorization tasks.
The solution architecture leverages several AWS services working together in a streamlined pipeline. Product data is stored in JSONL format in Amazon S3, which triggers an AWS Lambda function when files are uploaded. This Lambda function initiates an Amazon Bedrock batch processing job, passing the S3 file location. The Amazon Bedrock endpoint reads the product name data, generates categorized output, and writes results to another S3 location. A second Lambda function monitors the batch processing job status and handles cleanup when processing completes.
The batch inference approach uses Amazon Bedrock’s CreateModelInvocationJob API, which accepts a JSONL input file and returns a job ARN for monitoring. The job progresses through states including Submitted, InProgress, Failed, Completed, or Stopped. Output files include both summary statistics (processed record count, success/error counts, token counts) and the actual model outputs with categorization results.
The team evaluated multiple models including Meta Llama 2 (13B and 70B variants) and Anthropic’s Claude (Instant and v2). This comparative analysis is particularly valuable for practitioners making model selection decisions.
For Llama 2 models, they used temperature 0.1, top_p 0.9, and max_gen_len of 2,048 tokens. For Anthropic’s Claude models, they used temperature 0.1, top_k 250, top_p 1, and max_tokens_to_sample of 4,096.
The results clearly favored Anthropic’s Claude-Instant with zero-shot prompting as the best performer across latency, cost, and accuracy metrics. Claude-Instant achieved 98.5% parsing recall and 96.8% final content coverage on the 5,000-sample test set. This model processed 5,000 products in approximately 12 minutes (723 seconds), significantly exceeding GoDaddy’s requirement of processing 5,000 products in one hour.
Llama 2 models showed several limitations: they required full format instructions in zero-shot prompts, the 13B variant had worse content coverage and formatting issues (particularly with character escaping), and both variants showed performance instability when varying packing numbers. Claude models demonstrated better generalizability, requiring only partial format instructions and maintaining stable performance across different packing configurations.
A key optimization technique employed was “n-packing” - combining multiple SKU product names into a single LLM query so that prompt instructions are shared across different SKUs. This approach significantly reduced both cost and latency.
The team tested packing values of 1, 5, and 10 products per query. Llama 2 models have an output length limit of 2,048 tokens, fitting approximately 20 products maximum, while Claude models have higher limits allowing for even more aggressive packing.
The results showed that increasing from 1-packing to 10-packing reduced latency significantly for Claude models (from 45s to 27s for Claude-Instant with zero-shot on 20 test samples) while maintaining stable accuracy. Llama 2 showed clear performance drops and instability when packing numbers changed, indicating poorer generalizability.
When scaling from 5,000 to 100,000 samples, only 8x more computation time was needed, whereas performing individual LLM calls for each product would have increased inference time approximately 40x compared to batch processing.
The prompt engineering work was divided into two phases: output generation and format parsing.
For output generation, best practices included providing simple, clear, and complete instructions, using separator characters consistently (newline character), explicitly handling default output values to avoid N/A or missing values, and testing for good generalization on hold-out test sets.
Few-shot prompting (in-context learning) was tested with 0, 2, 5, and 10 examples. For Claude models, techniques included enclosing examples in XML tags, using Human and Assistant annotations, and guiding the assistant prompt. For Llama 2, examples were enclosed in [INST] tags.
For format parsing, the team used refinement techniques including role assumption (having the model act as a “Product Information Manager, Taxonomist, and Categorization Expert”), prompt specificity with detailed instructions, and explicit output format descriptions with JSON schema examples.
The team discovered that few-shot example formatting was critical - LLMs were sensitive to subtle formatting differences, and parsing time was significantly improved through several iterations on formatting. For Claude, a shorter pseudo-example JSON format was sufficient, while Llama 2 required the full JSON schema instruction.
Output parsing used LangChain’s PydanticOutputParser with a defined schema for the categorization fields. The team created a CCData class containing optional string fields for product_name, brand, color, material, price, category, sub_category, product_line, gender, year_of_first_sale, and season. Since n-packing was used, they wrapped the schema with a List type (List_of_CCData).
An OutputFixingParser was employed to handle situations where initial parsing attempts failed, providing resilience against formatting inconsistencies in model outputs.
The evaluation framework included five metrics: content coverage (measuring portions of missing values in output generation), parsing coverage (measuring portions of missing samples in format parsing), parsing recall on product name (exact match as lower bound for parsing completeness), parsing precision on product name, and final coverage (combining output generation and parsing steps).
Human evaluation was also conducted by GoDaddy subject matter experts, focusing on holistic quality assessment including accuracy, relevance, and comprehensiveness of the generated categories.
The ground truth dataset consisted of 30 labeled data points (generated by llama2-7b and verified by human SMEs) and 100,000 unlabeled test data points. The team noted that some ground truth fields had N/A or missing values, which was suboptimal for GoDaddy’s downstream predictive modeling needs where higher coverage provides more business insights.
The final recommended solution (Claude-Instant with zero-shot) achieved 97% category coverage on both 5,000 and 100,000 test sets, exceeding GoDaddy’s 90% requirement. Processing 5,000 products took approximately 12 minutes (723 seconds), 80% faster than the 1-hour requirement. The solution was 8% more affordable than the existing Llama 2-13B proposal while providing 79% more coverage.
Scaling from 5,000 to 100,000 samples showed that accuracy in coverage remained stable, and cost scaled approximately linearly. The near-real-time inference capability achieved approximately 2 seconds per product.
The team observed that using separate LLMs for different steps could be advantageous - Llama 2 didn’t perform well in format parsing but was relatively capable in output generation. For fair comparison, they required the same LLM in both calls, but noted that mixing models (Llama 2 for generation, Claude for parsing) could be suitable for certain use cases.
Format parsing was significantly improved through prompt engineering, particularly JSON format instructions. For Claude-Instant, latency was reduced by approximately 77% (from 90 seconds to 20 seconds) through prompt optimization, eliminating the need for a JSON fine-tuned model variant.
The case study concludes with recommendations for future work including: dataset enhancement with more ground truth examples and normalized product names, increased human evaluations, fine-tuning exploration when more training data becomes available, automatic prompt engineering techniques, dynamic few-shot selection, and knowledge integration through connecting LLMs to internal or external databases to reduce hallucinations.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.
Coinbase, a cryptocurrency exchange serving millions of users across 100+ countries, faced challenges scaling customer support amid volatile market conditions, managing complex compliance investigations, and improving developer productivity. They built a comprehensive Gen AI platform integrating multiple LLMs through standardized interfaces (OpenAI API, Model Context Protocol) on AWS Bedrock to address these challenges. Their solution includes AI-powered chatbots handling 65% of customer contacts automatically (saving ~5 million employee hours annually), compliance investigation tools that synthesize data from multiple sources to accelerate case resolution, and developer productivity tools where 40% of daily code is now AI-generated or influenced. The implementation uses a multi-layered agentic architecture with RAG, guardrails, memory systems, and human-in-the-loop workflows, resulting in significant cost savings, faster resolution times, and improved quality across all three domains.