LyricLens, developed by Music Smatch, is a production AI system that extracts semantic meaning, themes, entities, cultural references, and sentiment from music lyrics at scale. The platform analyzes over 11 million songs using Amazon Bedrock's Nova family of foundation models to provide real-time insights for brands, artists, developers, and content moderators. By migrating from a previous provider to Amazon Nova models, Music Smatch achieved over 30% cost savings while maintaining accuracy, processing over 2.5 billion tokens. The system employs a multi-level semantic engine with knowledge graphs, supports content moderation with granular PG ratings, and enables natural language queries for playlist generation and trend analysis across demographics, genres, and time periods.
LyricLens represents a comprehensive production deployment of large language models for analyzing music lyrics at scale. Developed by Music Smatch, one of Italy’s leading scale-ups, the platform demonstrates sophisticated LLMOps practices including model selection, evaluation frameworks, prompt optimization, and cost management. The presentation was delivered by Eduardo Randazzo from AWS and Bruno Zambolin, Director of Innovation at Music Smatch, at an AWS event.
The core problem LyricLens addresses is extracting deep semantic meaning from music lyrics beyond simple text analysis. Music serves as a universal cultural language that connects generations, geographies, and identities, and the lyrics contain rich information about themes, moods, cultural references, social movements, and emotional states. By analyzing millions of songs using AI and correlating this data with listening trends and demographics, LyricLens enables actionable insights for multiple stakeholders: brands wanting to speak the language of their communities, developers building music applications, and artists understanding their audience’s emotional pulse.
The platform processes over 11 million songs (a number that continues to grow daily) through a multi-level semantic engine. At its core, LyricLens employs Amazon Bedrock’s Nova family of foundation models, specifically designed for this use case after a thorough evaluation period. The selection of Amazon Nova models was driven by several key factors that align with production LLMOps requirements.
Amazon Bedrock provides a fully serverless infrastructure where Music Smatch doesn’t need to manage underlying hardware or scaling concerns. The Nova model family includes several variants optimized for different use cases: Nova Micro (text-only, fastest and most economical), Nova Lite (multimodal with 300K token context window), Nova Pro (higher accuracy with 300K context window, comparable to Claude 3.5 Sonnet), and Nova Premiere (highest accuracy with 1M token context window, comparable to Claude 3.7 Sonnet).
Music Smatch selected Amazon Nova Pro as one of the primary models for song analysis, indicating they prioritized the balance between accuracy and cost for complex document analysis tasks. The long context windows (300K to 1M tokens) proved particularly valuable, as demonstrated in the presentation where the entire Dante’s Inferno (all 34 cantos) was processed in a single request to extract specific information about Ulysses references.
A critical capability highlighted was Amazon Nova’s native video understanding without requiring frame sampling or pre-processing, though this appears less central to the LyricLens use case focused primarily on text analysis. More relevant was the multimodal capability to handle text with rich metadata.
One of the more sophisticated LLMOps techniques employed is model distillation, which addresses a common production challenge: balancing accuracy with cost and latency. The presentation explained how Music Smatch could use larger “teacher” models (like Nova Premiere or Nova Pro) to transfer knowledge to smaller “student” models (like Nova Lite or Nova Micro) for specific use cases.
The distillation process involves providing the teacher model with comprehensive information about a specific use case, then using its outputs to train or guide the student model. This approach allows the system to achieve the accuracy of more performant models while maintaining the cost efficiency and response times of smaller models. For a platform processing billions of tokens, this optimization strategy is critical for sustainable scaling.
The presentation noted that as Nova models continue to evolve, there’s potential to migrate certain workloads to even lower-tier models, further reducing costs, or to employ model distillation more extensively to improve accuracy of smaller models while maintaining cost advantages.
LyricLens functions as a multi-level semantic engine that performs several analytical tasks on each song:
The system scans lyrics and decomposes language into individual words, connecting them through a knowledge graph structure. This graph-based approach enables sophisticated relationship mapping between concepts, entities, and themes. For each song, the platform extracts:
The granularity of extraction is remarkable. Rather than simple binary classifications (explicit/clean), the system provides approximately 1,000 parameters for content rating, identifying specific types of sensitive content. This level of detail enables precise filtering - for example, creating playlists of “low-tempo Italian rap without profanity” or “pop hits about feminism from 2024.”
One of the most interesting LLMOps aspects of the case study is the evaluation methodology employed during the model migration. When Music Smatch decided to evaluate Amazon Nova models as potential replacements for their existing provider, they needed a robust way to assess whether the new models could maintain or improve upon existing accuracy.
Rather than manually annotating thousands of lyrics to create ground truth data - a time-consuming and expensive process - Music Smatch employed an “LLM as judge” methodology. This approach involved:
Multi-model consensus building: Three frontier models were used as judges, each instructed to generate JSON outputs containing all the metadata fields LyricLens extracts (entities, themes, moods, PG ratings, etc.). The outputs from these three models were aggregated using a consensus algorithm to establish ground truth.
Tiered consensus approach: The evaluation dataset (tens of thousands of lyrics) was divided into three tiers:
For Tier 3 cases where automated judges couldn’t reach consensus, Music Smatch leveraged their large community of users to conduct human reviews and surveys. This hybrid approach balanced automation speed with human judgment for edge cases.
The presentation showed a confusion matrix demonstrating very high quality results, with some variance in PG rating edge cases (content on the boundary between rating levels). However, the system proved highly reliable for safety-critical applications - if creating content for children, filtering to PG level 2 and above provided strong guarantees.
Comparative evaluation: Beyond ground truth validation, Music Smatch used a more performant LLM as a judge to compare outputs from their previous provider’s models against Amazon Nova models. With similar input prompts, they could verify that outputs were comparable in quality, giving confidence to proceed with migration.
This evaluation framework demonstrates mature LLMOps practices, recognizing that model assessment for complex semantic tasks requires sophisticated approaches beyond simple accuracy metrics.
The migration to Amazon Nova models required significant prompt rewriting work, despite the relative simplicity of API integration. Each model family has specific characteristics, and the team needed to optimize prompts to leverage Nova’s particular strengths.
Amazon Bedrock’s prompt enhancer service played a crucial role in this process, helping create optimized prompts for Music Smatch’s specific use cases. The presentation covered eight total use cases - two of high complexity and six of medium to simple complexity. Each required its own prompt engineering work to map appropriately to the selected Nova model variants.
The prompt optimization process involved understanding how Nova models interpret instructions, format outputs (particularly structured JSON for metadata extraction), and handle the long context windows for processing complete song catalogs with rich metadata.
A key production feature of LyricLens is real-time semantic search and filtering across the 11+ million song database. Users can express queries in natural language with ChatGPT-like simplicity and receive results in seconds. The knowledge graph architecture enables virtually infinite combinations of filters:
The system supports aggregation queries to discover patterns: “which musical genres discuss sustainability most?” or “which years saw Italian rap use more regional slang versus English borrowings?” or “how does explicitness differ between American, French, and Italian rap?”
This query flexibility is enabled by the detailed semantic indexing created through LLM processing, stored in the knowledge graph structure for efficient retrieval. The presentation demonstrated several examples:
The presentation outlined multiple production applications for LyricLens:
Editorial and content insights: Media companies can analyze top charts to understand cultural trends. For example, analyzing the top 50 US songs might reveal extensive name-dropping of other artists and celebrities (legitimization strategy), frequent luxury brand and automobile mentions (status and ostentation themes), and references to beverages, drugs, and specific territories.
Trend analysis and forecasting: The system tracks temporal patterns, showing how themes evolve weekly or monthly. One example showed “freedom” spiking during summer months, while “love” peaked in February around Valentine’s Day, then returned to baseline. Brands can use this for campaign timing decisions based on historical consistency of trends.
Content moderation at scale: Platforms distributing music need to protect audiences and manage reputational risk. LyricLens goes far beyond simple explicit/clean flags, providing granular topic analysis for violence, drugs, self-harm, profanity with current slang, religious content, and harassment. This enables precise filtering for different audience segments.
Playlist generation: The demonstration showed an API-based application for creating dynamic, theme-based playlists for children - “Space Adventures,” “Superheroes,” “Jungle World,” “Lullabies” - with automatic weekly updates, filtering out inappropriate content based on detailed PG ratings.
Enhanced listening experiences: A mobile application demo showed real-time lyric analysis as songs play. Using Spotify Connect integration, the app displays semantic insights overlaid on lyrics: content ratings, main themes, song meanings (author intent), and entity explanations with Wikipedia integration for cultural references users might not recognize.
Brand intelligence: Companies can discover when and how they’re mentioned in music, understanding which genres and regions cite them most, informing marketing and cultural positioning strategies.
Conversational agent interface: Music Smatch is developing a chat-based agent interface to LyricLens, allowing users to interact conversationally. Examples included asking for information about specific songs (“Apple” by Charli XCX), getting songwriter credits and publishing information, or requesting brand-specific playlists with natural language filters (“songs mentioning Gucci from 2020 onwards”).
The project timeline spanned approximately three months from initial discovery through implementation and evaluation. The phased deployment approach started with simpler use cases before gradually rolling out more complex analytical tasks. This de-risks production deployment and allows for incremental validation.
Key production metrics highlighted in the presentation:
The 30%+ cost reduction is particularly notable for an LLMOps case study. The presentation attributed this to Amazon Nova models being approximately 4x less expensive than comparable models from other providers on Bedrock, while maintaining similar accuracy levels. For applications processing millions of data points with millions of users, cost becomes a fundamental scaling constraint, and this optimization enables sustainable growth.
Music Smatch indicated ongoing work to reduce costs further through two strategies: migrating additional workloads to lower-tier models as they improve, and employing model distillation more extensively to improve smaller model accuracy while maintaining cost advantages.
A critical LLMOps practice highlighted was incorporating human feedback into the production loop. While automated evaluation using LLM judges provided confidence in model performance, the team recognized that true quality assessment requires validation that outputs are valuable not just to other LLMs but to actual end users.
The production system includes multiple people who evaluate outputs to ensure results meet user needs. This feedback loop, combined with the community review process for Tier 3 consensus failures in the evaluation framework, demonstrates a mature approach to maintaining quality in production LLM applications where automated metrics alone are insufficient.
While the presentation showcases an impressive production deployment, several aspects warrant balanced consideration:
Evaluation methodology assumptions: The LLM-as-judge approach assumes that frontier models provide reliable ground truth. However, different LLMs may have systematic biases or blind spots. The three-model consensus approach mitigates this somewhat, but doesn’t eliminate the risk that all three models might share similar limitations compared to human judgment. The Tier 3 escalation to human community review is a good practice that acknowledges this limitation.
Cost claims: The 30%+ cost savings claim is significant but depends on the baseline comparison. The presentation indicates Music Smatch was previously using another provider, but doesn’t specify which models or pricing. Cost comparisons can vary significantly based on usage patterns, context lengths, and optimization strategies. Additionally, the development effort for prompt rewriting and model migration has its own costs not captured in the operational savings metric.
Model selection transparency: While the presentation mentions eight use cases with different complexity levels and indicates Nova Pro as one primary model, it doesn’t provide complete transparency about which specific models handle which tasks, or what the full model ensemble looks like in production. This makes it difficult to fully assess the architecture choices.
Knowledge graph details: The presentation frequently mentions the knowledge graph as a core component for storing semantic relationships and enabling complex queries, but provides limited technical detail about the graph schema, update mechanisms, or how LLM outputs are transformed into graph structures. The interface between LLM analysis and graph storage is a critical LLMOps concern not fully explored.
Accuracy vs. speed tradeoffs: While cost and accuracy are discussed, there’s less detail about latency characteristics and real-time requirements. Processing 11+ million songs suggests extensive batch processing, but real-time query response for user applications implies cached/pre-computed results rather than on-demand LLM inference. The architecture for balancing these modes isn’t fully detailed.
Multilingual considerations: The platform appears to handle multiple languages (Italian, English, French rap were mentioned), but the presentation doesn’t address how well Nova models perform across languages, whether language-specific prompts or models are used, or how evaluation was conducted for non-English content.
Prompt enhancer as solution: The presentation mentions using Amazon Bedrock’s prompt enhancer to optimize prompts for Nova models, positioning this as helpful tooling. However, heavy reliance on automated prompt optimization can sometimes obscure the need for deeper understanding of model behavior and might not always produce optimal results for highly specialized domains like music lyric analysis.
Despite these considerations, LyricLens represents a sophisticated production LLM deployment demonstrating mature LLMOps practices including systematic evaluation, cost optimization through model selection and distillation strategies, prompt engineering, human feedback integration, and real-world scaling to billions of tokens across millions of documents. The migration from an existing provider to Amazon Nova models, while achieving cost savings and maintaining quality, showcases the kind of model comparison and transition work that production LLMOps teams increasingly face as the LLM landscape evolves.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.
Bloomberg Media, facing challenges in analyzing and leveraging 13 petabytes of video content growing at 3,000 hours per day, developed a comprehensive AI-driven platform to analyze, search, and automatically create content from their massive media archive. The solution combines multiple analysis approaches including task-specific models, vision language models (VLMs), and multimodal embeddings, unified through a federated search architecture and knowledge graphs. The platform enables automated content assembly using AI agents to create platform-specific cuts from long-form interviews and documentaries, dramatically reducing time to market while maintaining editorial trust and accuracy. This "disposable AI strategy" emphasizes modularity, versioning, and the ability to swap models and embeddings without re-engineering entire workflows, allowing Bloomberg to adapt quickly to evolving AI capabilities while expanding reach across multiple distribution platforms.