## Overview
This case study is drawn from a live stream conversation featuring Joe, the founder and CEO of Blueprint AI, discussing his journey with LLMs since the private beta of GPT-3 in July 2020 and the practical lessons learned from building LLM-powered applications for production use. Blueprint AI is a startup incorporated in December 2022 that focuses on using large language models to solve problems in software development and product management, specifically around improving communication between business and technical teams and providing automated visibility into development progress.
The conversation offers a unique perspective from someone who has been actively experimenting with LLMs for over three years—well before ChatGPT brought widespread attention to the technology. Joe's background as a serial entrepreneur (previously founder and CEO of Crowd Cow, an e-commerce company) combined with his hands-on technical approach provides practical insights into both the opportunities and challenges of deploying LLMs in production environments.
## Early LLM Experimentation and Production Use Cases
Joe's journey with LLMs began with the GPT-3 private beta in July 2020, making him an early practitioner with years of accumulated insights. He describes the early experience as working with a system that required extensive prompt engineering, particularly few-shot prompting with many examples to achieve reliable results. His early prompts were "really super long with tons and tons of examples" followed by the actual instruction.
Several concrete applications emerged from this experimentation period:
**Calorie Tracking Application**: One of his first experiments involved creating a calorie estimation app using few-shot prompting. He loaded the prompt with many examples of food items paired with their calorie counts and descriptions of healthiness. The key insight was that the model had essentially "already crawled the internet" so he didn't need to do the traditional data work of scraping, cleaning, and storing nutritional information. The app was further enhanced with Siri shortcut integration for voice input and response. While acknowledging the 80/20 limitation (the app works well most of the time but errors matter in calorie counting), he notes this was built in "half a day" with a better user experience than existing App Store alternatives.
**Home Automation Voice Interface**: Another early project involved creating a voice interface for home automation. The core challenge was getting the LLM to generate proper JSON to call existing home automation APIs. Using few-shot examples of commands like "turn off the lights in the kitchen" or "unlock the front door" paired with the corresponding JSON, he built a wrapper that would take the generated JSON and call the appropriate APIs. The breakthrough moment came when he tested compound commands like "turn off the kitchen lights and play music in the den" without having explicitly included array examples—the model correctly generated JSON with an array structure. Even more impressively, when given commands for features he hadn't implemented (like "open the windows in the garage"), the model would generate syntactically correct JSON using the patterns it had learned.
## Blueprint AI: The Core Product
Blueprint AI's thesis centers on the observation that LLMs can be viewed as "very educated experts with years of experience who are willing to do the grunt work"—looking at every log file, every commit, tirelessly and consistently. The company targets several pain points in software development:
**The Business-Technical Divide**: There's significant tension between business stakeholders specifying requirements based on business goals and engineers implementing technical specifications. LLMs excel at translating between domains, helping these experts communicate more effectively and understand tradeoffs in real-time (e.g., understanding that a seemingly simple feature request might take a month to implement).
**Visibility and Status Reporting**: As organizations grow beyond small teams, it becomes exponentially harder to understand what's happening even at a high level. Reading git commits and release notes is time-consuming and requires technical knowledge that not everyone has.
The first product Blueprint AI built is an automated report generator that connects to Jira and GitHub (including code changes and GitHub Issues). The system runs 24/7, processing volumes of data that "would never fit even in a 100K prompt" through recursive summarization and inference. Key capabilities include:
- Surfacing what happened yesterday and last week
- Identifying relationships between Jira items and conversations elsewhere
- Detecting when teams might be blocked
- Sending real-time alerts when stakeholders need to take action (e.g., notifying when a customer-requested feature that will unlock revenue is ready)
The reports are currently delivered via Slack, with plans for more personalization in delivery methods. The company has also packaged this functionality as a free service monitoring popular open source projects like DuckDB, LangChain, and BabyAGI to demonstrate value and drive interest.
## Infrastructure and Tooling Approaches
Joe has built considerable personal infrastructure over time, maintaining a repo since June 2020 that has evolved to support rapid experimentation. Key components include:
**Prompt Templating System**: A database-backed system storing prompts with names and variables. This enables prompt engineering work to be stored and recalled into a web UI where only variable substitutions need to be typed. The system also exposes these prompts as APIs, so any application can call the prompt with a hash table of variables. This reusable infrastructure accelerates development across all projects.
**Mega Prompts**: A pattern where GPT-4's 32K context window is loaded with extensive context: entire bios, resumes, LinkedIn profiles, company elevator pitches, brand style guides, and frequently asked questions. Each section is delimited with headers. The prompt ends with instructions like "on behalf of that guy up there, follow these instructions and then [blank]" where the blank is the only variable. This enables use cases like having an "AI Tinkerers organizer" that has all the context about the meetup and can write emails or respond to questions on its behalf.
**Document Summarization Pipeline**: Built to learn query-augmented retrieval and recursive summarization, this tool has expanded to include PDF ingestion, YouTube video downloading with speech-to-text, and notably a slideshow output feature. The slideshow output extracts key points, longer summaries, and super short summaries into a format that can be quickly flipped through. Joe admits he initially thought he'd "never use" the slideshow feature but finds it "super useful" for consuming long-form content like three-hour videos. This has increased his usage of the app 10x.
## Production Challenges and Lessons Learned
The conversation reveals several critical challenges when deploying LLMs in production:
**Performance, Speed, and Reliability**: This is identified as "the biggest problem right now." Demand has caused significant slowdowns and unreliability, with "on certain days every fifth call is just timing out." Ironically, GPT-4 32K was found to be much faster than standard GPT-4, presumably because fewer people are using it. This observation led to reconsidering cost-benefit calculations—using the more expensive model became worthwhile because it was faster and users were waiting. Recommendations include implementing streaming for any user-facing applications (providing "a huge psychological boost to your performance"), using caching and approximate caches for instant results, pre-computing where possible, and building systems to detect failures and retry or failover.
**Prompt Engineering and Regression Testing**: This is described as an "Art and Science" where no existing solutions feel satisfactory. The typical workflow involves extensive prompt engineering upfront, limited systematic testing across input ranges, then deployment. When edge cases are discovered (the "20%" that doesn't work well), tweaking the prompt raises concerns about regressions in the cases that were working. Joe notes there isn't a great "out of the box framework for detecting those regressions," and the problem is complicated by the diversity of prompts across applications—from trivial classification to creative open-ended analysis. He mentions that at an AI Tinkerers meetup of 100+ builders, only two people had actually built anything to address this problem. Solutions mentioned include building tools to look at logs and rerun them with new prompts.
**Context Window Performance**: Joe references work by a community member who systematically tested performance across different context window sizes using synthetic "needle in haystack" data. The results showed performance degradation around 6-7K tokens, with Claude actually outperforming GPT-4 for longer contexts above that threshold. This has practical implications for deciding when to break problems into multiple calls versus using longer contexts.
**Model Selection and Routing**: The observation about Claude vs. GPT-4 performance at different context lengths suggests value in infrastructure that can route to the optimal model based on task characteristics. Joe expresses interest in platforms that could "help me decipher and figure that and move it to the right model at the right time" automatically.
**A/B Testing of Prompts**: Joe describes a workflow where he writes a specific prompt, then runs it back through GPT-4 asking it to "make that prompt better." The improved prompt often performs better. He suggests this recursive prompt engineering approach should be built into platforms—deploying code and waking up to find the system has run prompt testing overnight against real data from logs and suggests a better-performing version.
**Fine-tuning and User Feedback**: Incorporating user feedback (thumbs up, edits) to improve models is identified as a gap. While ML practitioners may have solutions for fine-tuning, there's a lack of accessible tooling for "app people" to push feedback through UI interactions back to model improvement. This is particularly challenging when using models like GPT-4 that don't yet offer fine-tuning.
## Philosophy and Approach
Blueprint AI's current philosophy is "no training before product-market fit." They're monitoring open source developments and cheering on progress in models like LLaMA 2, but their startup focus is on rapid experimentation using the highest quality available model (GPT-4 via OpenAI). The founder may experiment with open source and fine-tuning on side projects "just to be current" but the company's resources are focused on finding product-market fit first.
The broader philosophy emphasizes "AI-native" workflows—starting any project work with a sandbox, loading context, and using the LLM as a "thought partner" by default. The attitude is "try not to do that [work manually]—why bother? It's maybe faster and probably better" to use the AI assistance.
## Educational Applications
A notable side project demonstrates the potential for LLM-powered education tools. Joe built a reading comprehension test generator for his 12-year-old son. The system ingests text documents (from sources like Project Gutenberg), uses prompting to generate SAT-style question sets with multiple choice answers covering facts, style, and analogies. The JSON output is then run back through the model to generate answer sets with explanations for both correct and incorrect answers. Finally, GPT-4 was asked to generate an interactive HTML/JavaScript app that creates a UI for taking the tests with immediate feedback. The entire system was built in less than one day, and his son's reading improvement over "a month or two" exceeded a year of school progress.
## Community and Ecosystem
Joe runs AI Tinkerers, a technical meetup focused on practitioners actively building with LLMs. The meetup format emphasizes sharing technical problems and solutions rather than pitches or networking. Lightning talks are encouraged to show code and terminals rather than slides ("if your company name needs to be in your presentation, make it as part of a stack trace"). The community is expanding to multiple cities including London, Berlin, New York, San Francisco, Denver, and Los Angeles, with infrastructure to help organizers launch in new locations.