A lead data scientist at Mastercard presents a comprehensive approach to implementing LLMs in production by focusing on linguistic features rather than just metrics. The case study demonstrates how understanding and implementing linguistic principles (syntax, morphology, semantics, pragmatics, and phonetics) can significantly improve LLM performance. A practical example showed how using pragmatic instruction with Falcon 7B and the guidance framework improved biology question answering accuracy from 35% to 85% while drastically reducing inference time compared to vanilla ChatGPT.
This case study is derived from a conference presentation by Chris Brousseau, a lead data scientist at Mastercard, discussing “linguistically informed LLMs.” The presentation offers a unique perspective on improving LLM performance by grounding model development and deployment in linguistic theory. Brousseau frames this as practical guidance for teams working with LLMs in production, drawing from insights he is developing in a co-authored book with Matthew Sharp on clean code for data scientists and the intersection of LLMOps and MLOps.
The core thesis is that LLMs are solving for language (not just mathematics or statistics), and therefore practitioners should leverage linguistic knowledge across five key dimensions: syntax, morphology, semantics, pragmatics, and phonetics. Brousseau uses an extended metaphor comparing LLM development to growing and maintaining a beard—emphasizing the importance of having clear goals, understanding growth phases, and potentially seeking expert guidance rather than just optimizing for metrics blindly.
Brousseau identifies a common mistake in LLM development: teams optimizing for metrics (precision, recall, F1 scores) without having clear goals in mind. He argues that if your KPIs are your goals, you are either in a late stage of development (which is fine) or you don’t actually know where you’re going. This is particularly problematic with LLMs because, unlike hairstyles, it’s much harder to “trim back” and course-correct once you’ve gone in the wrong direction.
For production LLM systems, this insight has significant implications. Teams need to define what success looks like for their specific use case before diving into model selection, training, or fine-tuning. At Mastercard, for example, the team works with financial language that “doesn’t really change all that fast,” which affects their considerations for model longevity and maintenance cadence.
Brousseau expresses the opinion that LLMs have essentially “solved” syntax through their implementation of transformational generative grammar (referencing Chomsky’s work). Modern LLMs can generate infinitely varied combinations while maintaining grammatical structure. This is considered a relatively solved problem in the context of LLM capabilities.
This area is estimated to be about 75-80% solved but still presents significant challenges. The presentation highlights several tokenization problems that affect production LLM systems:
The Yeet Problem: Statistical tokenization methodologies like Byte Pair Encoding (BPE), SentencePiece, or the ChatGPT encoding determine token boundaries based on frequency rather than linguistic morpheme boundaries. This creates issues particularly with numbers and arithmetic. Brousseau cites the example of Goat 7B outperforming GPT-4 (1.7 trillion parameters) on arithmetic tasks precisely because GPT-4’s statistical tokenization groups commonly co-occurring numbers together, making it difficult for the model to “see” mathematical problems correctly.
The word “yeet” (popularized on Vine in 2014) illustrates how new words emerge. English has predictable sets of sounds and letters that can appear together, and these phonotactic constraints change much more slowly than vocabulary. Understanding these constraints can improve tokenization strategies.
The Kimono Problem: When tokenizing borrowed words, models may incorrectly identify morpheme boundaries. “Kimono” might be split into “ki” and “mono” because “mono” is a recognizable morpheme in English (meaning “one”), but this is linguistically incorrect as kimono is a borrowed Japanese word with different morpheme structure. This highlights how tokenization that treats language as existing in a vacuum can lead to problems.
Brousseau notes that multilingual models consistently outperform monolingual models on the same tasks, attributing this to better tokenization that comes from exposure to multiple languages and their different morphological patterns.
The presentation addresses a fundamental challenge with semantic meaning and model maintenance over time. Dictionaries are not authorities on word meaning but rather “snapshots in time of popular usage.” Major dictionaries (Dictionary.com, Merriam-Webster, Oxford English Dictionary) all have weekly updates to their corpora and yearly hard updates to represent current language usage.
For LLMOps, this raises critical questions about model longevity:
For specialized domains like Mastercard’s financial applications, the language may change more slowly, potentially reducing the urgency of continuous semantic updates. This is an important consideration for production deployment strategies and maintenance schedules.
This section provides the most concrete production results in the presentation. Pragmatics refers to meaning derived from context rather than literal word definitions. Brousseau demonstrates the power of pragmatic instruction using a biology question-answering benchmark.
Baseline (Vanilla ChatGPT):
Optimized Approach (Falcon 7B Instruct with Guidance):
This represents a dramatic improvement in both accuracy and speed while using a much smaller model (7B parameters vs. GPT-4’s 1.7T parameters). The key insight is that pragmatic context—providing the model with rules, structure, and guidance for interpretation—can unlock knowledge that may already exist in the model but isn’t easily accessible through naive prompting.
Brousseau recommends that teams should be using tools like:
He expresses the opinion that pragmatic instruction at inference time is “one of the things that I’m really looking to explode in the next little while” and strongly recommends adoption if teams aren’t already using these techniques.
While the video demonstrations didn’t play during the presentation, Brousseau discussed the challenges of preserving phonetic information in language models. He uses the sentence “I never said I loved you” to illustrate how emphasis on different words completely changes meaning—information that is lost when reducing speech to text.
The presentation compared different approaches:
This has implications for multimodal LLM applications and suggests that production systems dealing with speech should consider architectures that preserve phonetic information rather than reducing everything to text.
The overarching message is that teams should have clear, linguistically-informed goals before optimizing for standard ML metrics. Understanding what linguistic capabilities your application requires (heavy syntax manipulation? semantic precision? pragmatic reasoning?) should drive architecture and model selection decisions.
The dictionary problem highlights that language is constantly evolving, and production LLM systems need maintenance strategies. For some domains (like Mastercard’s financial language), change may be slow enough that less frequent updates suffice. For others, continuous updating may be necessary. This should be factored into operational planning and budgets.
The tokenization problems discussed suggest that off-the-shelf tokenization may not be optimal for all use cases. Teams should consider:
The dramatic improvements demonstrated with Guidance and Chain of Thought prompting suggest significant untapped potential in inference-time optimization. This is particularly relevant for LLMOps because:
For applications involving speech or audio, the discussion of phonetic preservation suggests that text-only pipelines may lose important information. Production systems should consider whether speech-to-speech or phonetic-aware architectures are more appropriate than pure text-based approaches.
It’s worth noting that this presentation is primarily conceptual and educational rather than a detailed production case study. While Brousseau works at Mastercard, specific details about their production LLM deployments are not provided. The biology question example appears to be a demonstration rather than a Mastercard production system. The claims about performance improvements (7/20 to 17/20) are impressive but would benefit from more rigorous benchmarking across multiple runs and datasets.
The linguistic framework presented is valuable for thinking about LLM capabilities, but the degree to which teams can operationalize these insights in practice will vary. Some suggestions (like custom tokenization or multilingual training) may be out of reach for teams using commercial APIs or pre-trained models.
Nevertheless, the presentation offers useful heuristics for LLMOps practitioners: think about what linguistic capabilities you need, set goals beyond metrics, leverage pragmatic instruction at inference time, and consider the long-term maintenance implications of your model choices.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
Moody's Analytics, a century-old financial institution serving over 1,500 customers across 165 countries, transformed their approach to serving high-stakes financial decision-making by evolving from a basic RAG chatbot to a sophisticated multi-agent AI system on AWS. Facing challenges with unstructured financial data (PDFs with complex tables, charts, and regulatory documents), context window limitations, and the need for 100% accuracy in billion-dollar decisions, they architected a serverless multi-agent orchestration system using Amazon Bedrock, specialized task agents, custom workflows supporting up to 400 steps, and intelligent document processing pipelines. The solution processes over 1 million tokens daily in production, achieving 60% faster insights and 30% reduction in task completion times while maintaining the precision required for credit ratings, risk intelligence, and regulatory compliance across credit, climate, economics, and compliance domains.