ZenML

Building a Systematic SNAP Benefits LLM Evaluation Framework

Propel 2025
View original source

Propel is developing a comprehensive evaluation framework for testing how well different LLMs handle SNAP (food stamps) benefit-related queries. The project aims to assess model accuracy, safety, and appropriateness in handling complex policy questions while balancing strict accuracy with practical user needs. They've built a testing infrastructure including a Slackbot called Hydra for comparing multiple LLM outputs, and plan to release their evaluation framework publicly to help improve AI models' performance on SNAP-related tasks.

Industry

Government

Technologies

Overview

Propel is a company that builds mobile applications and tools for participants in the SNAP (Supplemental Nutrition Assistance Program, commonly known as food stamps) program in the United States. Their Director of Emerging Technology, Dave Guarino, authored this detailed case study about building domain-specific LLM evaluations for assessing AI model performance on SNAP-related questions. This represents a thoughtful approach to LLMOps in a high-stakes domain where incorrect information could materially harm vulnerable users.

The case study is notable for its transparency about methodology and its explicit goal of sharing knowledge to help others in similar domains (healthcare, disability benefits, housing, legal aid) build their own evaluations. It’s presented as Part 1 of a series, meaning this covers the foundational thinking and early-stage exploration rather than production deployment results.

The Problem Space

Propel operates in the safety net benefits domain, specifically focused on SNAP. This presents unique LLMOps challenges that differ from typical enterprise AI deployments:

Evaluation Philosophy and Approach

The case study articulates a clear philosophy: evaluations should be empirical tests of AI capabilities in specific domains, not vibe-based assessments. Guarino references Amanda Askell’s guidance on test-driven development for prompts: “You don’t write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.”

Dual Purpose of Evaluations

Propel identifies two primary goals for their SNAP eval:

Model Selection and Routing: The eval enables comparison of foundation models (OpenAI, Anthropic, Google, open-source alternatives like Llama and Deepseek) specifically for SNAP performance. This has practical implications:

Product Development Through Test-Driven Iteration: With a defined eval, Propel can rapidly test different implementation approaches:

The case study includes a compelling example: they hypothesize that providing both federal and state policy documents might actually confuse models and produce more errors than providing only state policy. An automated eval allows rapid testing of such hypotheses.

The Asset Limit “Sniff Test”

One of the most valuable parts of this case study is the detailed walkthrough of the SNAP asset limit question as an evaluation case. This exemplifies the kind of domain expertise that makes evaluations meaningful.

The question “What is the SNAP asset limit?” or “I have $10,000 in the bank, can I get SNAP?” has a technically correct answer ($2,750, or $4,250 for elderly/disabled households) that is practically misleading. Due to Broad-Based Categorical Eligibility (BBCE) policy options, only 13 states still apply asset limits. For most applicants, assets are irrelevant to eligibility.

Guarino argues that a good answer must include this nuance. This represents a higher-level evaluation principle: strict accuracy must be balanced against providing information that actually helps people navigate SNAP. A response that recites the federal asset limit without mentioning that most people face no asset limit fails this standard.

The case study includes screenshots comparing model outputs:

Technical Implementation Details

Current Stage: Expert-Driven Exploration

The case study emphasizes starting with domain experts using models extensively before writing code. Guarino (self-described as a SNAP domain expert, though humble about the depth of his expertise) has been systematically querying multiple AI models and taking notes on output quality.

The exploration categories include:

Internal Tooling: Hydra Slackbot

Propel built an internal tool called Hydra - a Slackbot that allows anyone in the organization to prompt multiple frontier language models simultaneously from within Slack. This lowers the barrier to model comparison and enables broader organizational participation in evaluation development.

Planned Evaluation Mechanics

The case study previews upcoming technical content about:

A simple code example is provided showing string-matching evaluation:

def test_snap_max_benefit_one_person():
   test_prompt = "What is the maximum SNAP benefit amount for 1 person?"
   llm_response = llm.give_prompt(test_prompt)
   if "$292" in llm_response:
       print("Test passed")
   else:
       print("Test failed")

However, Guarino acknowledges that more nuanced evaluations (like checking if a response substantively conveys that asset limits don’t apply in most states) require more sophisticated approaches, likely using LLMs as judges.

Public Good Orientation

A notable aspect of this case study is Propel’s commitment to publishing portions of their SNAP eval publicly. The rationale is that if all AI models improve at SNAP knowledge through better evaluation coverage, that benefits everyone building products on top of these models.

This represents a collaborative approach to LLMOps in specialized domains: by making evaluation data public, domain experts can contribute to improving foundation models even without direct access to model training.

Critical Assessment

This case study provides valuable methodological guidance for domain-specific LLM evaluation but has some limitations worth noting:

Strengths:

Limitations:

The case study represents a thoughtful, methodical approach to a challenging LLMOps problem in a high-stakes domain. It provides a useful template for other organizations working on domain-specific AI applications, particularly in areas affecting vulnerable populations where the cost of errors is significant.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Large-Scale AI Red Teaming Competition Platform for Production Model Security

HackAPrompt, LearnPrompting 2025

Sandra Fulof from HackAPrompt and LearnPrompting presents a comprehensive case study on developing the first AI red teaming competition platform and educational resources for prompt engineering in production environments. The case study covers the creation of LearnPrompting, an open-source educational platform that trained millions of users worldwide on prompt engineering techniques, and HackAPrompt, which ran the first prompt injection competition collecting 600,000 prompts used by all major AI companies to benchmark and improve their models. The work demonstrates practical challenges in securing LLMs in production, including the development of systematic prompt engineering methodologies, automated evaluation systems, and the discovery that traditional security defenses are ineffective against prompt injection attacks.

chatbot question_answering content_moderation +21

AI Agent Development and Evaluation Platform for Insurance Underwriting

Snorkel 2025

Snorkel developed a comprehensive benchmark dataset and evaluation framework for AI agents in commercial insurance underwriting, working with Chartered Property and Casualty Underwriters (CPCUs) to create realistic scenarios for small business insurance applications. The system leverages LangGraph and Model Context Protocol to build ReAct agents capable of multi-tool reasoning, database querying, and user interaction. Evaluation across multiple frontier models revealed significant challenges in tool use accuracy (36% error rate), hallucination issues where models introduced domain knowledge not present in guidelines, and substantial variance in performance across different underwriting tasks, with accuracy ranging from single digits to 80% depending on the model and task complexity.

healthcare fraud_detection customer_support +29