LyricLens: AI-Powered Music Lyric Analysis and Semantic Search Platform

Company

LyricLens

Title

AI-Powered Music Lyric Analysis and Semantic Search Platform

Industry

Media & Entertainment

Link

https://www.youtube.com/watch?v=NChjgnMgpHg

Year

2025

Summary (short)

LyricLens, developed by Music Smatch, is a production AI system that extracts semantic meaning, themes, entities, cultural references, and sentiment from music lyrics at scale. The platform analyzes over 11 million songs using Amazon Bedrock's Nova family of foundation models to provide real-time insights for brands, artists, developers, and content moderators. By migrating from a previous provider to Amazon Nova models, Music Smatch achieved over 30% cost savings while maintaining accuracy, processing over 2.5 billion tokens. The system employs a multi-level semantic engine with knowledge graphs, supports content moderation with granular PG ratings, and enables natural language queries for playlist generation and trend analysis across demographics, genres, and time periods.

Tags

knowledge_distillation

## Overview LyricLens represents a comprehensive production deployment of large language models for analyzing music lyrics at scale. Developed by Music Smatch, one of Italy's leading scale-ups, the platform demonstrates sophisticated LLMOps practices including model selection, evaluation frameworks, prompt optimization, and cost management. The presentation was delivered by Eduardo Randazzo from AWS and Bruno Zambolin, Director of Innovation at Music Smatch, at an AWS event. The core problem LyricLens addresses is extracting deep semantic meaning from music lyrics beyond simple text analysis. Music serves as a universal cultural language that connects generations, geographies, and identities, and the lyrics contain rich information about themes, moods, cultural references, social movements, and emotional states. By analyzing millions of songs using AI and correlating this data with listening trends and demographics, LyricLens enables actionable insights for multiple stakeholders: brands wanting to speak the language of their communities, developers building music applications, and artists understanding their audience's emotional pulse. ## Technical Architecture and Model Selection The platform processes over 11 million songs (a number that continues to grow daily) through a multi-level semantic engine. At its core, LyricLens employs Amazon Bedrock's Nova family of foundation models, specifically designed for this use case after a thorough evaluation period. The selection of Amazon Nova models was driven by several key factors that align with production LLMOps requirements. Amazon Bedrock provides a fully serverless infrastructure where Music Smatch doesn't need to manage underlying hardware or scaling concerns. The Nova model family includes several variants optimized for different use cases: Nova Micro (text-only, fastest and most economical), Nova Lite (multimodal with 300K token context window), Nova Pro (higher accuracy with 300K context window, comparable to Claude 3.5 Sonnet), and Nova Premiere (highest accuracy with 1M token context window, comparable to Claude 3.7 Sonnet). Music Smatch selected Amazon Nova Pro as one of the primary models for song analysis, indicating they prioritized the balance between accuracy and cost for complex document analysis tasks. The long context windows (300K to 1M tokens) proved particularly valuable, as demonstrated in the presentation where the entire Dante's Inferno (all 34 cantos) was processed in a single request to extract specific information about Ulysses references. A critical capability highlighted was Amazon Nova's native video understanding without requiring frame sampling or pre-processing, though this appears less central to the LyricLens use case focused primarily on text analysis. More relevant was the multimodal capability to handle text with rich metadata. ## Model Distillation Strategy One of the more sophisticated LLMOps techniques employed is model distillation, which addresses a common production challenge: balancing accuracy with cost and latency. The presentation explained how Music Smatch could use larger "teacher" models (like Nova Premiere or Nova Pro) to transfer knowledge to smaller "student" models (like Nova Lite or Nova Micro) for specific use cases. The distillation process involves providing the teacher model with comprehensive information about a specific use case, then using its outputs to train or guide the student model. This approach allows the system to achieve the accuracy of more performant models while maintaining the cost efficiency and response times of smaller models. For a platform processing billions of tokens, this optimization strategy is critical for sustainable scaling. The presentation noted that as Nova models continue to evolve, there's potential to migrate certain workloads to even lower-tier models, further reducing costs, or to employ model distillation more extensively to improve accuracy of smaller models while maintaining cost advantages. ## Data Processing and Semantic Analysis LyricLens functions as a multi-level semantic engine that performs several analytical tasks on each song: The system scans lyrics and decomposes language into individual words, connecting them through a knowledge graph structure. This graph-based approach enables sophisticated relationship mapping between concepts, entities, and themes. For each song, the platform extracts: - **Keywords and key phrases**: Core vocabulary and linguistic patterns - **Emotions and moods**: From nostalgia to protest, romanticism to cultural pride, creating an emotional taxonomy - **Themes**: High-level conceptual content like feminism, freedom, resilience, heartbreak, empowerment - **Entities**: Named entities including people, places, brands, products, and cultural references - **Cultural references**: Historical figures, events, geographical locations - **Content moderation signals**: Detailed PG ratings covering violence, profanity, drug references, harassment, and other sensitive topics - **Metadata correlation**: Genre, year of publication, popularity metrics, territories, and more The granularity of extraction is remarkable. Rather than simple binary classifications (explicit/clean), the system provides approximately 1,000 parameters for content rating, identifying specific types of sensitive content. This level of detail enables precise filtering - for example, creating playlists of "low-tempo Italian rap without profanity" or "pop hits about feminism from 2024." ## Production Evaluation Framework One of the most interesting LLMOps aspects of the case study is the evaluation methodology employed during the model migration. When Music Smatch decided to evaluate Amazon Nova models as potential replacements for their existing provider, they needed a robust way to assess whether the new models could maintain or improve upon existing accuracy. Rather than manually annotating thousands of lyrics to create ground truth data - a time-consuming and expensive process - Music Smatch employed an "LLM as judge" methodology. This approach involved: **Multi-model consensus building**: Three frontier models were used as judges, each instructed to generate JSON outputs containing all the metadata fields LyricLens extracts (entities, themes, moods, PG ratings, etc.). The outputs from these three models were aggregated using a consensus algorithm to establish ground truth. **Tiered consensus approach**: The evaluation dataset (tens of thousands of lyrics) was divided into three tiers: - Tier 1: All three judge models agree - high confidence ground truth - Tier 2: Two models agree, one disagrees - moderate confidence - Tier 3: No consensus - likely subjective interpretations where opinions may legitimately differ For Tier 3 cases where automated judges couldn't reach consensus, Music Smatch leveraged their large community of users to conduct human reviews and surveys. This hybrid approach balanced automation speed with human judgment for edge cases. The presentation showed a confusion matrix demonstrating very high quality results, with some variance in PG rating edge cases (content on the boundary between rating levels). However, the system proved highly reliable for safety-critical applications - if creating content for children, filtering to PG level 2 and above provided strong guarantees. **Comparative evaluation**: Beyond ground truth validation, Music Smatch used a more performant LLM as a judge to compare outputs from their previous provider's models against Amazon Nova models. With similar input prompts, they could verify that outputs were comparable in quality, giving confidence to proceed with migration. This evaluation framework demonstrates mature LLMOps practices, recognizing that model assessment for complex semantic tasks requires sophisticated approaches beyond simple accuracy metrics. ## Prompt Engineering and Optimization The migration to Amazon Nova models required significant prompt rewriting work, despite the relative simplicity of API integration. Each model family has specific characteristics, and the team needed to optimize prompts to leverage Nova's particular strengths. Amazon Bedrock's prompt enhancer service played a crucial role in this process, helping create optimized prompts for Music Smatch's specific use cases. The presentation covered eight total use cases - two of high complexity and six of medium to simple complexity. Each required its own prompt engineering work to map appropriately to the selected Nova model variants. The prompt optimization process involved understanding how Nova models interpret instructions, format outputs (particularly structured JSON for metadata extraction), and handle the long context windows for processing complete song catalogs with rich metadata. ## Real-Time Querying and Filtering Capabilities A key production feature of LyricLens is real-time semantic search and filtering across the 11+ million song database. Users can express queries in natural language with ChatGPT-like simplicity and receive results in seconds. The knowledge graph architecture enables virtually infinite combinations of filters: - Filter by theme: "songs about female empowerment" - Filter by mood: "only uplifting songs" - Filter by brand mentions: "songs that mention Nike" - Filter by PG rating: "only clean content for children" - Complex combinations: "pop hits about feminism from 2024 in the US" The system supports aggregation queries to discover patterns: "which musical genres discuss sustainability most?" or "which years saw Italian rap use more regional slang versus English borrowings?" or "how does explicitness differ between American, French, and Italian rap?" This query flexibility is enabled by the detailed semantic indexing created through LLM processing, stored in the knowledge graph structure for efficient retrieval. The presentation demonstrated several examples: - Analyzing the top 100 US songs in June to determine what percentage are explicit versus clean - valuable for publishers and advertisers - Extracting all proper names from Ed Sheeran's lyrics - Identifying which brands are mentioned in songs and analyzing the audience demographics - Tracking how themes like "passion" or "reflection" trend over time ## Production Use Cases and Applications The presentation outlined multiple production applications for LyricLens: **Editorial and content insights**: Media companies can analyze top charts to understand cultural trends. For example, analyzing the top 50 US songs might reveal extensive name-dropping of other artists and celebrities (legitimization strategy), frequent luxury brand and automobile mentions (status and ostentation themes), and references to beverages, drugs, and specific territories. **Trend analysis and forecasting**: The system tracks temporal patterns, showing how themes evolve weekly or monthly. One example showed "freedom" spiking during summer months, while "love" peaked in February around Valentine's Day, then returned to baseline. Brands can use this for campaign timing decisions based on historical consistency of trends. **Content moderation at scale**: Platforms distributing music need to protect audiences and manage reputational risk. LyricLens goes far beyond simple explicit/clean flags, providing granular topic analysis for violence, drugs, self-harm, profanity with current slang, religious content, and harassment. This enables precise filtering for different audience segments. **Playlist generation**: The demonstration showed an API-based application for creating dynamic, theme-based playlists for children - "Space Adventures," "Superheroes," "Jungle World," "Lullabies" - with automatic weekly updates, filtering out inappropriate content based on detailed PG ratings. **Enhanced listening experiences**: A mobile application demo showed real-time lyric analysis as songs play. Using Spotify Connect integration, the app displays semantic insights overlaid on lyrics: content ratings, main themes, song meanings (author intent), and entity explanations with Wikipedia integration for cultural references users might not recognize. **Brand intelligence**: Companies can discover when and how they're mentioned in music, understanding which genres and regions cite them most, informing marketing and cultural positioning strategies. **Conversational agent interface**: Music Smatch is developing a chat-based agent interface to LyricLens, allowing users to interact conversationally. Examples included asking for information about specific songs ("Apple" by Charli XCX), getting songwriter credits and publishing information, or requesting brand-specific playlists with natural language filters ("songs mentioning Gucci from 2020 onwards"). ## Production Deployment and Scaling The project timeline spanned approximately three months from initial discovery through implementation and evaluation. The phased deployment approach started with simpler use cases before gradually rolling out more complex analytical tasks. This de-risks production deployment and allows for incremental validation. Key production metrics highlighted in the presentation: - **Scale**: Over 11 million songs processed (continuously growing) - **Token volume**: More than 2.5 billion tokens processed (and growing daily as new songs are added) - **Cost savings**: Over 30% reduction in costs compared to previous provider - **Performance**: Maintained response time and accuracy targets while reducing costs The 30%+ cost reduction is particularly notable for an LLMOps case study. The presentation attributed this to Amazon Nova models being approximately 4x less expensive than comparable models from other providers on Bedrock, while maintaining similar accuracy levels. For applications processing millions of data points with millions of users, cost becomes a fundamental scaling constraint, and this optimization enables sustainable growth. Music Smatch indicated ongoing work to reduce costs further through two strategies: migrating additional workloads to lower-tier models as they improve, and employing model distillation more extensively to improve smaller model accuracy while maintaining cost advantages. ## Human Feedback Integration A critical LLMOps practice highlighted was incorporating human feedback into the production loop. While automated evaluation using LLM judges provided confidence in model performance, the team recognized that true quality assessment requires validation that outputs are valuable not just to other LLMs but to actual end users. The production system includes multiple people who evaluate outputs to ensure results meet user needs. This feedback loop, combined with the community review process for Tier 3 consensus failures in the evaluation framework, demonstrates a mature approach to maintaining quality in production LLM applications where automated metrics alone are insufficient. ## Critical Assessment and Tradeoffs While the presentation showcases an impressive production deployment, several aspects warrant balanced consideration: **Evaluation methodology assumptions**: The LLM-as-judge approach assumes that frontier models provide reliable ground truth. However, different LLMs may have systematic biases or blind spots. The three-model consensus approach mitigates this somewhat, but doesn't eliminate the risk that all three models might share similar limitations compared to human judgment. The Tier 3 escalation to human community review is a good practice that acknowledges this limitation. **Cost claims**: The 30%+ cost savings claim is significant but depends on the baseline comparison. The presentation indicates Music Smatch was previously using another provider, but doesn't specify which models or pricing. Cost comparisons can vary significantly based on usage patterns, context lengths, and optimization strategies. Additionally, the development effort for prompt rewriting and model migration has its own costs not captured in the operational savings metric. **Model selection transparency**: While the presentation mentions eight use cases with different complexity levels and indicates Nova Pro as one primary model, it doesn't provide complete transparency about which specific models handle which tasks, or what the full model ensemble looks like in production. This makes it difficult to fully assess the architecture choices. **Knowledge graph details**: The presentation frequently mentions the knowledge graph as a core component for storing semantic relationships and enabling complex queries, but provides limited technical detail about the graph schema, update mechanisms, or how LLM outputs are transformed into graph structures. The interface between LLM analysis and graph storage is a critical LLMOps concern not fully explored. **Accuracy vs. speed tradeoffs**: While cost and accuracy are discussed, there's less detail about latency characteristics and real-time requirements. Processing 11+ million songs suggests extensive batch processing, but real-time query response for user applications implies cached/pre-computed results rather than on-demand LLM inference. The architecture for balancing these modes isn't fully detailed. **Multilingual considerations**: The platform appears to handle multiple languages (Italian, English, French rap were mentioned), but the presentation doesn't address how well Nova models perform across languages, whether language-specific prompts or models are used, or how evaluation was conducted for non-English content. **Prompt enhancer as solution**: The presentation mentions using Amazon Bedrock's prompt enhancer to optimize prompts for Nova models, positioning this as helpful tooling. However, heavy reliance on automated prompt optimization can sometimes obscure the need for deeper understanding of model behavior and might not always produce optimal results for highly specialized domains like music lyric analysis. Despite these considerations, LyricLens represents a sophisticated production LLM deployment demonstrating mature LLMOps practices including systematic evaluation, cost optimization through model selection and distillation strategies, prompt engineering, human feedback integration, and real-world scaling to billions of tokens across millions of documents. The migration from an existing provider to Amazon Nova models, while achieving cost savings and maintaining quality, showcases the kind of model comparison and transition work that production LLMOps teams increasingly face as the LLM landscape evolves.

Start deploying reproducible AI workflows today