Propel developed an AI system to help SNAP (food stamp) recipients better understand official notices they receive. The system uses LLMs to analyze notice content and provide clear explanations of importance and required actions. The prototype successfully interprets complex government communications and provides simplified, actionable guidance while maintaining high safety standards for this sensitive use case.
Propel, a company that builds technology for SNAP (Supplemental Nutrition Assistance Program, commonly known as food stamps) recipients, is developing an AI-powered tool to help users understand government benefit notices. This case study represents an early-stage LLMOps project that is notable for its careful, safety-conscious approach to deploying AI for vulnerable populations who face “extreme negative consequences from failures or problems of a tool.”
The company’s mission involves helping low-income Americans navigate the complex SNAP benefits system. They identified a specific pain point through user research and social media observation: official notices from SNAP agencies are often confusing, leading to missed deadlines, unnecessary benefit denials, and high call volumes to already-strained state agencies.
SNAP notices serve as formal communications about critical matters such as benefit denials, approvals, amount changes, document requests, and missed appointments. These notices are legally mandated to contain specific information, but this very requirement creates several user experience problems:
The team validated this problem by observing that people are already posting their SNAP notices on Reddit and Facebook asking for help understanding them. As the article notes, “People are already walking this route — we’re paving it for them.”
Propel is testing “a variety of models, prompts, designs, external context with real SNAP notices to see what generates helpful output.” The primary example shown uses Anthropic’s Claude 3.5 Sonnet (newest version), though the team is clearly evaluating multiple options in their development process.
The choice to use LLMs for this problem is justified by the observation that “fundamentally most of the problems people have with notices are about language friction, and LLMs have strong capabilities when it comes to processing and manipulating language.”
The article provides transparency into the actual prompt being used in the prototype:
The prompt establishes the model’s persona as “a legal aid attorney specializing in SNAP benefits” and structures the output into two sections: importance assessment (high/medium/low with explanation) and action items. Key prompt engineering decisions include:
This approach demonstrates thoughtful prompt design that considers both the functional requirements and the emotional context of users navigating a stressful benefits system.
The tool is designed around two primary value propositions:
This structured output approach is notable from an LLMOps perspective because it creates measurable dimensions for evaluation and allows for consistent user experience across different notice types.
The team is using Streamlit for rapid prototyping, which is described as “an open source tool enabling rapid iteration on applications using AI models.” This choice reflects a common pattern in early-stage LLM application development where speed of iteration is prioritized over production-grade infrastructure.
This case study is particularly valuable for its explicit treatment of safety considerations before production deployment. The team outlines several key risk areas:
The team explicitly states “We are not deploying to users in this state.” They are instead gathering feedback from SNAP program experts and using real examples from social media to “harden this tool and proactively identify risks before moving forward.”
The document argues that notice interpretation may be inherently safer than open-ended chatbot applications: “Because most of the use of AI here is on processing information already included on the notice itself, this may be safer than other deployment paths, such as a chatbot generating novel answers to open-ended questions. Hallucination risk is lower by definition due to this.”
They also mention potential additional safeguards through prompt chaining: “we could have a separate model call evaluate the response to ensure all provided information is found on the original notice in some form.”
There’s an interesting tension identified between simplifying information (the core value proposition) and potentially hiding important edge-case information. Proposed mitigations include always reinforcing that users should read the entire notice and proactively offering additional information on less common but potentially important topics.
The team acknowledges the challenge of using external model APIs with documents containing personally identifying information. They mention Microsoft’s open-source Presidio as a potential local redaction/deidentification solution, and note that many users already redact information when posting notices online.
An interesting edge case is addressed: what happens if the original notice itself contains errors? The team is “considering whether we can include additional information or run additional checks in highly consequential situations that could inform the user if the notice itself appears erroneous or in violation of policy.”
The article hints at more ambitious future capabilities:
The team envisions “an agent processing notices in the background and triaging the person’s attention to just those highest-importance notices.” This represents a shift from reactive (user-initiated) to proactive (system-initiated) AI assistance.
They discuss “bringing in external information in-context with the notice” to provide more complete assistance—for example, if a phone number on a notice is known to be frequently unavailable, the tool could provide alternative contact methods.
The notice tool is positioned as potentially complementary to broader benefits navigation assistance, since “many problems’ first step is assessing any notices received recently.”
The team is taking a multi-stakeholder approach to evaluation:
This case study represents a thoughtful, safety-conscious approach to deploying LLMs for a vulnerable population. The team demonstrates awareness of the heightened risks involved when AI failures could result in people losing access to food assistance.
However, it’s worth noting that this is still a prototype and the actual production deployment challenges remain ahead. Key questions that would need to be answered before production include:
The transparency about being in an early prototype phase, combined with the explicit safety framework, suggests a responsible development approach. The choice to seek expert feedback before user deployment is particularly notable in a landscape where many AI applications rush to production.
The use of social media posts as a source of both problem validation (people are confused and seeking help) and test cases (real notices with real questions) is a pragmatic approach to building evaluation datasets in a domain where labeled data may be scarce.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Rippling, an enterprise platform providing HR, payroll, IT, and finance solutions, has evolved its AI strategy from simple content summarization to building complex production agents that assist administrators and employees across their entire platform. Led by Anker, their head of AI, the company has developed agents that handle payroll troubleshooting, sales briefing automation, interview transcript summarization, and talent performance calibration. They've transitioned from deterministic workflow-based approaches to more flexible deep agent paradigms, leveraging LangChain and LangSmith for development and tracing. The company maintains a dual focus: embedding AI capabilities within their product for customers running businesses on their platform, and deploying AI internally to increase productivity across all teams. Early results show promise in handling complex, context-dependent queries that traditional rule-based systems couldn't address.
This podcast discussion between Galileo and Crew AI leadership explores the challenges and solutions for deploying AI agents in production environments at enterprise scale. The conversation covers the technical complexities of multi-agent systems, the need for robust evaluation and observability frameworks, and the emergence of new LLMOps practices specifically designed for non-deterministic agent workflows. Key topics include authentication protocols, custom evaluation metrics, governance frameworks for regulated industries, and the democratization of agent development through no-code platforms.