Val Town's journey in implementing and evolving code assistance features showcases the challenges and opportunities in productionizing LLMs for code generation. Through iterative improvements and fast-following industry innovations, they progressed from basic ChatGPT integration to sophisticated features including error detection, deployment automation, and multi-file code generation, while addressing key challenges like generation speed and accuracy.
Val Town is a code hosting service that has been on a multi-year journey to integrate cutting-edge LLM-powered code generation into their platform. This case study is particularly valuable because it offers an honest, retrospective account of their “fast-following” strategy—deliberately copying and adapting innovations from industry leaders while occasionally contributing their own improvements. The narrative provides insight into the practical challenges of building and maintaining LLM-powered development tools in a rapidly evolving landscape.
The company launched in 2022, and since then has navigated through multiple paradigm shifts in AI code assistance: from GitHub Copilot-style completions, through ChatGPT-era chat interfaces, to the current generation of agentic code assistants exemplified by Cursor, Windsurf, and Bolt. Their primary product, “Townie,” has evolved through multiple versions as they adapted to these shifts.
Val Town’s initial foray into LLM-powered features was autocomplete functionality similar to GitHub Copilot. Their first implementation used Asad Memon’s open-source codemirror-copilot library, which essentially prompted ChatGPT to “cosplay” as an autocomplete service. This approach had significant limitations: it was slow, occasionally the model would break character and produce unexpected outputs, and it lacked the accuracy of purpose-built completion models.
The technical insight here is important for LLMOps practitioners: using a general-purpose chat model for a specialized task like code completion introduces latency and reliability issues that purpose-trained models avoid. The solution was migrating to Codeium in April 2024, which offered a properly trained “Fill in the Middle” model with documented APIs. They open-sourced their codemirror-codeium integration component, demonstrating a commitment to ecosystem contribution even while primarily consuming others’ innovations.
The first version of Townie was a straightforward ChatGPT-powered chat interface with a pre-filled system prompt and one-click code saving functionality. However, this proved inadequate because the feedback loop was poor—users needed iterative conversations to refine code, but the interface wasn’t optimized for this workflow.
Their subsequent experiment with OpenAI’s function calling (now “tool use”) provides a cautionary tale. Despite investing in cleaning up their OpenAPI spec and rebuilding Townie around structured function calling, the results were disappointing. The LLM would hallucinate functions that didn’t exist even with strict function definitions provided. While function calling has improved with Structured Outputs, Val Town concluded that the interface was “too generic”—capable of doing many things poorly rather than specific things well.
This is a crucial LLMOps lesson: the promise of giving an LLM your API specification and expecting intelligent orchestration often underdelivers. The magic comes from carefully constraining what actions an agent can take and how it chains them together, rather than providing maximum flexibility.
The launch of Claude 3.5 Sonnet and Claude Artifacts in mid-2024 represented a turning point. Val Town observed that Claude 3.5 Sonnet was “dramatically better at generating code than anything we’d seen before” and that the Artifacts paradigm solved the tight feedback loop problem they had struggled with.
After about a month of prototyping, they launched the current version of Townie in August 2024. This version can generate fullstack applications—frontend, backend, and database—and deploy them in minutes. The architecture includes hosted runtime, persistent data storage (via @std/sqlite), and included LLM API access (@std/openai) so users can create AI-powered applications without managing their own API keys.
One of Val Town’s notable contributions to the space is their work on diff-based code generation, inspired by Aider. The motivation is clear: regenerating entire files for every iteration is slow and expensive. By having the LLM produce diffs instead, iteration cycles can be dramatically faster.
Their system prompt (which they keep publicly visible) includes specific instructions for handling diff versus full-code generation based on user requests. When users explicitly request diff format, Townie generates valid unified diffs based on existing code. However, this feature is currently off by default because reliability wasn’t sufficient—the model would sometimes produce malformed diffs or misapply changes.
The team expresses hope that Anthropic’s rumored “fast-edit mode” or OpenAI’s Predicted Outputs might solve this problem more robustly. They also point to faster inference hardware (Groq, Cerebras) and more efficient models (citing DeepSeek’s near-Sonnet-level model trained for only $6M) as potential paths to making the iteration speed problem less critical.
Val Town’s potentially-novel contribution is their automatic error detection system. The implementation has two components:
When errors are detected, Townie proactively asks users if they’d like it to attempt a fix. While the team modestly notes this isn’t particularly novel in concept, they suggest it may have influenced similar features in Anthropic’s tools and Bolt.
The case study reveals ongoing tension between model capability and speed/cost. Claude 3.5 Sonnet provides the best code generation quality, but inference is slow and expensive for iterative workflows. The team has explored alternatives like Cerebras-hosted models for near-instant feedback loops, and they express excitement about DeepSeek’s cost-efficient training approaches potentially enabling Sonnet-level quality at much lower costs.
Looking forward, Val Town envisions more autonomous behavior inspired by tools like Windsurf and Devin. Current ideas include:
They also mention interest in giving Townie access to search capabilities—across public vals, npm packages, and the broader internet—to find relevant code, documentation, and resources.
An interesting strategic tension emerges in the case study: should Val Town compete with dedicated AI editors like Cursor and Windsurf, or integrate with them? Their current approach is both—continuing to develop Townie while also improving their local development experience and API so external tools can “deploy to Val Town” similar to Netlify integrations.
The article is refreshingly candid about limitations and failed experiments. The tool-use version of Townie was explicitly called “a disappointment.” Diff generation doesn’t work reliably enough to be enabled by default. The first ChatGPT-based Townie “didn’t get much use” because the feedback loop was poor.
This transparency is valuable for LLMOps practitioners because it illustrates that even simple-seeming features often have subtle reliability challenges that only emerge in production. The team’s willingness to keep their system prompt open and blog about technical choices suggests a collaborative rather than secretive approach to competitive development in this space.
Val Town’s experience offers several lessons for teams building LLM-powered developer tools:
The case study demonstrates that building production LLM features is as much about iteration and learning from failures as it is about initial implementation, and that keeping pace with rapid model improvements requires continuous adaptation of product architecture and prompting strategies.
Raindrop's CTO Ben presents a comprehensive framework for building reliable AI agents in production, addressing the challenge that traditional offline evaluations cannot capture the full complexity of real-world user behavior. The core problem is that AI agents fail in subtle ways without concrete errors, making issues difficult to detect and fix. Raindrop's solution centers on a "discover, track, and fix" loop that combines explicit signals like thumbs up/down with implicit signals detected semantically in conversations, such as user frustration, task failures, and agent forgetfulness. By clustering these signals with user intents and tracking them over time, teams can identify the most impactful issues and systematically improve their agents. The approach emphasizes experimentation and production monitoring over purely offline testing, drawing parallels to how traditional software engineering shifted from extensive QA to tools like Sentry for error monitoring.
Anthropic's Boris Churnney, creator of Claude Code, describes the journey from an accidental terminal prototype in September 2024 to a production coding tool used by 70% of startups and responsible for 4% of all public commits globally. Starting as a simple API testing tool, Claude Code evolved through continuous user feedback and rapid iteration, with the entire codebase rewritten every few months to adapt to improving model capabilities. The tool achieved remarkable productivity gains at Anthropic itself, with engineers seeing 70% productivity increases per capita despite team doubling, and total productivity improvements of 150% since launch. The development philosophy centered on building for future model capabilities rather than current ones, anticipating improvements 6 months ahead, and minimizing scaffolding that would become obsolete with each new model release.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.