## Overview
ZenCity operates at the intersection of civic technology and advanced AI, providing local governments across the United States, Canada, and the UK with platforms to capture, understand, and act on community voices. The company works with cities, counties, states, and law enforcement agencies to ensure that resident feedback actively informs government decisions. Their platform addresses what they call the "same 10 people problem" - the phenomenon where traditional public engagement methods like in-person community meetings tend to hear from the same vocal minority rather than representing the diverse perspectives of entire communities.
The ZenCity team includes Shapes (SVP of R&D), Andrew Terio (VP of Data Science overseeing applied AI and methodology), and Noah (Product Lead). Together, they've built a sophisticated AI infrastructure that processes approximately one million social media items per day alongside surveys, 311 requests, news articles, and various forms of direct resident engagement. Their platform represents a compelling case study in LLMOps because it demonstrates how to deploy generative AI and traditional ML models in a production environment where accuracy, consistency, and trust are paramount.
## The Problem Space and Use Cases
Local government officials face several key workflows where community input is critical but traditionally difficult to gather and synthesize at scale. These workflows include annual budget planning (which can take six months), five-year strategic planning, performance management across departments like parks and recreation or public health, resident communications, crisis management (such as responding to school shootings or natural disasters), and law enforcement oversight. Each workflow has different stakeholders with different needs - a city manager requires a comprehensive brief across all government services, while a parks and recreation director needs insights specific to their domain.
The challenge isn't just collecting data but making it accessible and actionable for government officials who, as Noah points out, are often not particularly tech-savvy and work within very traditional workflows that have existed for centuries. The platform must deliver insights in formats ranging from digital dashboards to printed reports, and must reach different personas from hands-on analysts to executive stakeholders who consume information primarily through email or in prepared briefing packets.
## Data Architecture and Enrichment Layers
ZenCity has built what they describe as a multi-layered data architecture that progressively refines raw data into actionable insights. This architecture was built over years before the current wave of generative AI, but as Shapes notes, they "built the company to be ready for the AI era without knowing it" by collecting diverse data sources and creating use cases where AI tools could be applied.
The foundational layer consists of raw data from multiple sources: online surveys distributed through web and social media advertising, survey panels for hard-to-reach populations, SMS pilots, social media posts and comments, news articles, 311 service requests and internal operational records, post-interaction surveys following 311 requests, and direct resident engagements around specific planning projects. This data flows into a centralized data warehouse where the enrichment process begins.
The first enrichment layer applies AI-powered labeling to every data point. For sentiment analysis, ZenCity developed a custom model specifically tuned for local government contexts rather than using generic sentiment analysis. This is crucial because a comment like "this is a really big problem" in response to a city's post about fixing an issue isn't actually negative toward the government - it's acknowledging a problem the government is addressing. Their sentiment model understands this nuance and classifies sentiment relative to the government's actions.
Topic modeling represents the "glue" that binds different data sources together. Topics range from high-level categories like "police and law enforcement" or "parks and recreation" to more granular themes. The system applies both standard topic sets and custom topics on demand. When a survey covers a brand new area, the system can generate new labels specific to that content. Crucially, the same topic labels apply across all data sources - a survey response, a social media post, a news article, and a 311 request can all be tagged with the same topic, enabling cross-source synthesis. The platform also uses semantic search with text embeddings, allowing searches based on natural language content rather than just keyword matching.
Andrew describes an emerging "layer one and a half" they call the "elements" layer, which represents a critical optimization for LLM-driven analysis. They discovered that while LLMs excel at synthesis, they struggle with mathematical calculations. Their initial approach of feeding raw data directly to LLMs didn't work well. When they moved to using API calls through MCP servers, they found that works well for on-demand single queries but becomes inefficient for massive content summarization exercises that need to examine thousands of data points. The elements layer pre-computes all possible permutations of API calls - every combination of demographics, time frames, geographic breakdowns, etc. This creates a broader set of pre-calculated results that can be quickly accessed without real-time computation.
Above the elements layer sits the "highlights" layer, which algorithmically curates the most notable findings using predefined rule sets. For example, a highlight might state "discourse about homelessness grew by 30% last week." These highlights have been produced for years based on deterministic algorithms that identify meaningful trends.
The "insights" layer takes all the raw data and highlights for a specific topic within a specific time period and generates a detailed, valuable summary. As Shapes describes it, if the elements layer is a number and the highlights layer is a sentence putting that number in context, the insights layer is a paragraph providing comprehensive context around those highlights. This layer synthesizes information for a single topic.
Finally, the "briefs" or "golden layer" combines multiple topics relevant to a specific workflow. A budget planning brief might pull together insights on multiple government services, performance metrics, and resident priorities to create a comprehensive document that informs the entire budget process.
## The AI Assistant and Agent Architecture
ZenCity developed an AI assistant that allows government users to make ad-hoc queries against their data. This went through two major iterations. The first version didn't adequately solve the underlying data access problem, so they rebuilt it after Anthropic introduced the MCP (Model Context Protocol). This represents a significant architectural decision that demonstrates the value of the tools-based approach to LLM agents.
The assistant operates as a fully agentic system. When a user submits a query like "I work in parks and recreation, what are the most popular programs that people also love?", the system follows an iterative, agent-driven workflow. First, it sanitizes the input to ensure it's a reasonable request - filtering out queries about recipes, politically problematic content, or information the system doesn't have. This sanitization happens as a separate query before the main agentic workflow begins.
One particularly telling anecdote: when the CIO of Boston tested the assistant by asking "what's the favorite ice cream flavor from Bostonians?", the assistant correctly responded "I don't know, I have no such information." This demonstrates effective guardrailing against hallucination - the assistant doesn't generate plausible-sounding but fabricated answers.
Once a query passes sanitization, the agent determines what data sources are available. For a parks and recreation query, it would check whether any surveys asked about specific programs. It then formulates specific queries through the MCP servers to pull top-level results, followed by additional queries for demographic breakdowns or geographic subgroups. The entire process is 100% agentic - the agent decides what information it needs and iteratively retrieves it.
The move to MCP servers was transformative for several reasons. Previously, ZenCity tried building their own protocol where the model would request data, they'd collect it behind the scenes, and return it in one shot. This single-shot approach was limiting. With MCP, the model can negotiate for data through multiple cycles of communication. It can first ask "what surveys were conducted?", receive a list, then request "give me results from surveys 1, 2, and 7." This multi-turn negotiation happens between the model and the MCP server without custom middleware, significantly reducing response latency.
Additionally, MCP servers built on top of their existing APIs provide strong multi-tenancy guarantees. The APIs were originally built for their dashboard interface and already enforce tenant isolation. By exposing these same APIs through MCP servers, they ensure that data from different cities never mingles. As Shapes emphasizes, this is critical for maintaining data security and privacy in a multi-tenant government platform.
The analogy Andrew uses is illuminating: imagine giving a super-user analyst access to their dashboard and letting them click around for hours pulling information. The agent essentially does this in seconds, rapidly navigating through different data sources, applying filters, switching between surveys and social media, and synthesizing results.
## Workflow Automation and Orchestration
While the AI assistant handles ad-hoc queries, ZenCity recognized that many information needs follow predictable patterns tied to government workflows. Rather than requiring every user to know how to query the system, they created workflow-based briefs that automatically generate the right information at the right time for the right person.
For annual budget planning, the system generates an intensive brief covering everything heard from the community over the past year about different government services - satisfaction levels, priorities, emerging themes. This brief goes to the city manager at the start of the six-month budget process. Simultaneously, department-specific briefs go to individual department heads, giving them relevant insights for their areas. Similar workflows exist for crisis management, performance reviews (CompStat or similar frameworks in law enforcement), strategic planning, and resident communications.
The next evolution they're building is even more sophisticated orchestration. Since government meeting schedules and agendas are publicly available online, they plan to automatically identify upcoming council meetings, parse the agenda, and trigger their brief generation engine to create appropriate preparation materials for each participant based on their role and the agenda items. A city manager would receive a comprehensive prep packet while the parks director would receive only materials relevant to parks-related agenda items. This represents true workflow integration where AI-powered insights are injected into existing processes without requiring users to change how they work.
## LLM Implementation and Quality Assurance
The team uses multiple LLM models at different points in their pipeline for different purposes. Sentiment analysis and categorization happen at the granular data item level. Trend detection and anomaly identification occur at the broader aggregation level. Summarization and format generation happen at the highest level when producing final deliverables. This multi-model approach allows them to optimize for different tasks rather than forcing a single model to handle all use cases.
For quality assurance and avoiding hallucination, ZenCity employs several strategies. Their system prompts for the AI assistant include explicit instructions: "Don't guess. Don't try to make up data. Use only the MCP servers we give you. For every fact you put in the text, cite the exact data point that supports it." This citation requirement serves as both a user trust mechanism and a quality check - if every piece of information links back to source data, human reviewers can validate accuracy.
They're developing what they call an "LLM as judge" system for quality evaluation. The first use case targets their topic modeling, where one model applies topic labels and a different model evaluates whether those labels are appropriate. The evaluating model receives the data item, the category definition with specific examples, and the applied label, then confirms or disputes the categorization. Disputed cases route to human review, both for immediate correction and to identify systematic patterns the models miss. This same workflow can apply to the AI assistant by evaluating whether inputs, extracted data, and outputs make sense together.
However, the team is realistic about evaluation trade-offs. As Shapes notes, running extensive evals in-flight would double response times, which is unacceptable when a human is waiting for an answer. Their current approach performs weekly annotations to track accuracy, precision, and recall over time, ensuring consistency without impacting user experience. This differs from use cases like Neple's email responses where latency isn't an issue and every output can be evaluated before sending.
They've also learned hard lessons about what not to give LLMs. When generating multi-thread summaries, they initially passed anecdotes to the LLM to include in reports. Sometimes it would quote verbatim, sometimes it would slightly reword, and sometimes it would fabricate something that sounded similar. They realized they needed to provide exact quotes and links deterministically rather than letting the LLM handle this. This reflects a broader principle: being thoughtful about which steps should be LLM-driven versus deterministic. The LLM can suggest which anecdotes are relevant, but inserting the actual quotes should be a deterministic operation that doesn't risk alteration or fabrication.
The team discusses evaluation topology - different types of evals at different layers. They need to evaluate correctness of elements, noteworthiness and correctness of highlights, accuracy of insights, and the holistic value of complete briefs for specific use cases. As Noah points out, evaluating whether a budget brief is valuable enough to drive budget conversations or kick off strategic offsites requires different methodology than evaluating whether a data point is correctly categorized. They acknowledge this is still an area where they have much to explore, recognizing that the ideal world of full eval coverage remains aspirational.
## Survey Design and Methodology Innovation
Beyond data analysis, ZenCity is applying AI to survey creation itself. Currently, their analysts manually design custom surveys using best practices accumulated over years. They're working to automate repetitive aspects of this process, using AI to ensure surveys are designed consistently and efficiently while incorporating established methodologies.
They maintain two core survey templates: community surveys covering broad aspects of life (healthcare access, schools, public facility maintenance, etc.) typically used by city managers and mayor's offices, and "Blockwise" public safety surveys used by law enforcement to track personal safety perceptions and community-police relationships. Beyond these, they offer "pulse surveys" - custom surveys for specific needs. Andrew described a recent example working with a client to understand price pressure on renters across the city, building a map of displacement risk that captures not just whether rent increased but how renters think about the impact on their financial situation and likelihood of moving.
The survey distribution approach uses mixed-mode recruitment primarily through direct web and social media advertising, supplemented with survey panels for hard-to-reach populations. They're piloting SMS distribution and constantly exploring new channels because, as Andrew notes, "there's no one single place" to reach everyone anymore. This multi-channel approach combined with social listening aims to capture voices beyond those who traditionally show up to in-person meetings.
## Technical Stack and Implementation Details
While the conversation doesn't exhaustively detail their technology stack, several key technical choices emerge. They use semantic search with embeddings for content-based search rather than keyword matching. They built MCP servers on top of REST APIs originally designed for dashboard interactions. Their data warehouse centralizes diverse sources before enrichment. They employ custom-trained sentiment models rather than off-the-shelf solutions. They use multiple LLM models for different tasks rather than a single model for everything.
The multi-tenancy architecture built into their APIs extends to their AI features, ensuring data isolation between government clients. This is particularly important given the sensitive nature of constituent data and the privacy considerations Noah emphasizes - they deliberately extract community voices rather than individual voices, obfuscating personal identifiers to focus on aggregate themes and trends.
## Results and Impact
While the transcript doesn't provide quantitative metrics on model performance or user adoption, several qualitative impacts are evident. The platform processes approximately one million social media items daily alongside surveys and other data sources. They successfully launched a second version of their AI assistant to general availability after the previous version didn't meet quality standards. They're working with major cities including Boston and serving clients across the US, Canada, and UK.
The impact on government workflows appears substantial. Where previously officials might need to hire analysts to spend hours reading through community feedback and generating reports, the system produces comprehensive summaries in seconds or minutes. Budget briefs that synthesize a year of community input across multiple departments can be automatically generated and personalized for different roles. Crisis communications can draw on real-time community sentiment analysis. Performance management can incorporate ongoing feedback rather than relying solely on periodic surveys.
Perhaps most importantly, the platform expands whose voices are heard. By combining surveys designed for representativeness with organic social media discussions and direct engagement, ZenCity helps governments hear from people who can't attend evening meetings due to work or childcare, who don't follow government social media accounts, or who engage through 311 calls rather than formal feedback mechanisms.
## Balanced Assessment
ZenCity presents a compelling case of production LLM deployment in a high-stakes domain where accuracy and trust are paramount. Their multi-layered architecture that progressively refines data shows sophisticated thinking about how to make diverse data sources accessible to LLMs. The decision to rebuild their AI assistant around MCP servers rather than persist with a suboptimal initial version demonstrates appropriate quality standards.
However, several areas warrant scrutiny. The lack of in-flight evaluation for the AI assistant relies heavily on prompt engineering and citation requirements rather than systematic validation. While they justify this based on latency concerns, and their weekly annotation process provides oversight, there's inherent risk in production LLM outputs without real-time validation. Their acknowledgment that LLMs "aren't great at math still" led to architectural decisions that work around this limitation rather than solve it, which is pragmatic but means certain capabilities remain constrained.
The effectiveness of their sentiment analysis model tuned for government contexts is claimed but not independently validated in the transcript. The actual accuracy of their topic modeling, the precision of their multi-source synthesis, and the quality of their automated briefs compared to human-generated alternatives aren't quantified. These are understandable limitations in a podcast interview format, but prospective users should seek this data.
The workflow automation approach of automatically generating briefs for predictable government processes is elegant and shows deep understanding of their users' jobs to be done. However, this also creates dependency on ZenCity's interpretation of what information is relevant for each workflow. If the AI selects the wrong data sources or emphasizes the wrong themes, it could systematically bias how officials understand community sentiment.
Their emphasis on privacy and extracting community voices rather than individual voices is laudable and appropriate for government contexts. Yet the mechanism for this obfuscation isn't detailed - are individuals truly un-identifiable in the underlying data, or could analysts with access to the platform potentially de-anonymize specific voices? This matters particularly for controversial issues where identifying individual critics could pose risks.
The platform's value proposition assumes that broader community input leads to better government decisions, which is normatively appealing but not always empirically demonstrated. Some policy questions may benefit more from expert analysis than from broad constituent input. The system doesn't appear to distinguish between domains where community voice should be decisive versus informative versus supplementary to expertise.
Overall, ZenCity demonstrates sophisticated LLMOps practices including multi-layered data architecture, agentic workflows with MCP integration, workflow-based automation, multi-model deployment for different tasks, and pragmatic quality assurance balancing thoroughness with latency requirements. Their work shows how traditional ML (sentiment analysis, topic modeling) and generative AI (summarization, content generation) can be productively combined in production systems serving a non-technical user base in a domain where accuracy and trust are critical.