Propel is developing a comprehensive evaluation framework for testing how well different LLMs handle SNAP (food stamps) benefit-related queries. The project aims to assess model accuracy, safety, and appropriateness in handling complex policy questions while balancing strict accuracy with practical user needs. They've built a testing infrastructure including a Slackbot called Hydra for comparing multiple LLM outputs, and plan to release their evaluation framework publicly to help improve AI models' performance on SNAP-related tasks.
Propel is a company that builds mobile applications and tools for participants in the SNAP (Supplemental Nutrition Assistance Program, commonly known as food stamps) program in the United States. Their Director of Emerging Technology, Dave Guarino, authored this detailed case study about building domain-specific LLM evaluations for assessing AI model performance on SNAP-related questions. This represents a thoughtful approach to LLMOps in a high-stakes domain where incorrect information could materially harm vulnerable users.
The case study is notable for its transparency about methodology and its explicit goal of sharing knowledge to help others in similar domains (healthcare, disability benefits, housing, legal aid) build their own evaluations. It’s presented as Part 1 of a series, meaning this covers the foundational thinking and early-stage exploration rather than production deployment results.
Propel operates in the safety net benefits domain, specifically focused on SNAP. This presents unique LLMOps challenges that differ from typical enterprise AI deployments:
Niche Domain Coverage: Unlike general reasoning, math, or programming, SNAP is unlikely to have significant coverage in existing AI model evaluations. The domain experts (government workers, legal aid attorneys, policy analysts) rarely intersect with AI researchers.
High Stakes for Users: SNAP participants are by definition lower-income Americans who depend on these benefits. Providing incorrect eligibility information, procedural guidance, or policy interpretation could lead to missed benefits, improper denials, or other material harms.
Complex Nuance: The SNAP domain involves federal law, federal regulations, state-level policy variations, and practical navigation advice that differs from formal policy. A technically accurate answer may not be practically helpful.
State Variability: With 50+ state-level implementations of federal SNAP policy, answers often depend on jurisdiction, creating a combinatorial challenge for evaluation.
The case study articulates a clear philosophy: evaluations should be empirical tests of AI capabilities in specific domains, not vibe-based assessments. Guarino references Amanda Askell’s guidance on test-driven development for prompts: “You don’t write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.”
Propel identifies two primary goals for their SNAP eval:
Model Selection and Routing: The eval enables comparison of foundation models (OpenAI, Anthropic, Google, open-source alternatives like Llama and Deepseek) specifically for SNAP performance. This has practical implications:
Product Development Through Test-Driven Iteration: With a defined eval, Propel can rapidly test different implementation approaches:
The case study includes a compelling example: they hypothesize that providing both federal and state policy documents might actually confuse models and produce more errors than providing only state policy. An automated eval allows rapid testing of such hypotheses.
One of the most valuable parts of this case study is the detailed walkthrough of the SNAP asset limit question as an evaluation case. This exemplifies the kind of domain expertise that makes evaluations meaningful.
The question “What is the SNAP asset limit?” or “I have $10,000 in the bank, can I get SNAP?” has a technically correct answer ($2,750, or $4,250 for elderly/disabled households) that is practically misleading. Due to Broad-Based Categorical Eligibility (BBCE) policy options, only 13 states still apply asset limits. For most applicants, assets are irrelevant to eligibility.
Guarino argues that a good answer must include this nuance. This represents a higher-level evaluation principle: strict accuracy must be balanced against providing information that actually helps people navigate SNAP. A response that recites the federal asset limit without mentioning that most people face no asset limit fails this standard.
The case study includes screenshots comparing model outputs:
The case study emphasizes starting with domain experts using models extensively before writing code. Guarino (self-described as a SNAP domain expert, though humble about the depth of his expertise) has been systematically querying multiple AI models and taking notes on output quality.
The exploration categories include:
Propel built an internal tool called Hydra - a Slackbot that allows anyone in the organization to prompt multiple frontier language models simultaneously from within Slack. This lowers the barrier to model comparison and enables broader organizational participation in evaluation development.
The case study previews upcoming technical content about:
A simple code example is provided showing string-matching evaluation:
def test_snap_max_benefit_one_person():
test_prompt = "What is the maximum SNAP benefit amount for 1 person?"
llm_response = llm.give_prompt(test_prompt)
if "$292" in llm_response:
print("Test passed")
else:
print("Test failed")
However, Guarino acknowledges that more nuanced evaluations (like checking if a response substantively conveys that asset limits don’t apply in most states) require more sophisticated approaches, likely using LLMs as judges.
A notable aspect of this case study is Propel’s commitment to publishing portions of their SNAP eval publicly. The rationale is that if all AI models improve at SNAP knowledge through better evaluation coverage, that benefits everyone building products on top of these models.
This represents a collaborative approach to LLMOps in specialized domains: by making evaluation data public, domain experts can contribute to improving foundation models even without direct access to model training.
This case study provides valuable methodological guidance for domain-specific LLM evaluation but has some limitations worth noting:
Strengths:
Limitations:
The case study represents a thoughtful, methodical approach to a challenging LLMOps problem in a high-stakes domain. It provides a useful template for other organizations working on domain-specific AI applications, particularly in areas affecting vulnerable populations where the cost of errors is significant.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Sandra Fulof from HackAPrompt and LearnPrompting presents a comprehensive case study on developing the first AI red teaming competition platform and educational resources for prompt engineering in production environments. The case study covers the creation of LearnPrompting, an open-source educational platform that trained millions of users worldwide on prompt engineering techniques, and HackAPrompt, which ran the first prompt injection competition collecting 600,000 prompts used by all major AI companies to benchmark and improve their models. The work demonstrates practical challenges in securing LLMs in production, including the development of systematic prompt engineering methodologies, automated evaluation systems, and the discovery that traditional security defenses are ineffective against prompt injection attacks.
Snorkel developed a comprehensive benchmark dataset and evaluation framework for AI agents in commercial insurance underwriting, working with Chartered Property and Casualty Underwriters (CPCUs) to create realistic scenarios for small business insurance applications. The system leverages LangGraph and Model Context Protocol to build ReAct agents capable of multi-tool reasoning, database querying, and user interaction. Evaluation across multiple frontier models revealed significant challenges in tool use accuracy (36% error rate), hallucination issues where models introduced domain knowledge not present in guidelines, and substantial variance in performance across different underwriting tasks, with accuracy ranging from single digits to 80% depending on the model and task complexity.