Thoughtworks built Boba, an experimental AI co-pilot for product strategy and ideation, to explore effective patterns for LLM-powered applications beyond simple chat interfaces. The team developed and documented key patterns including templated prompts, structured responses, real-time progress streaming, context management, and external knowledge integration. The case study provides detailed implementation insights for building sophisticated LLM applications with better user experiences.
Thoughtworks, a global technology consultancy, developed an experimental AI co-pilot called “Boba” designed to augment product strategy and generative ideation processes. The project, published in June 2023, serves as both a practical tool and a learning platform for understanding how to build LLM-powered generative applications that go beyond simple chat interfaces. The team documented their learnings in the form of eight reusable patterns that address common challenges when building production-ready LLM applications.
Boba is positioned as an “AI co-pilot” — an AI-powered assistant designed to help users with specific domain tasks, in this case early-stage strategy ideation and concept generation. The application mediates interactions between human users and OpenAI’s GPT-3.5/4 models, adding UI elements and prompt orchestration logic that help users who may not be skilled prompt engineers get better results from the underlying LLM.
The application is built as a web application with a frontend that communicates with a backend service that interfaces with OpenAI’s API. Key technology choices include:
The team noted a significant observation about development time allocation: approximately 80% of effort went into user interface development, while only 20% went into the AI/prompt engineering aspects. This suggests that building production LLM applications involves substantial frontend and UX work beyond just the model integration.
The first pattern addresses the need to enrich simple user inputs with additional context and structure before sending them to the LLM. Using LangChain’s templating capabilities (similar to JavaScript templating engines like Nunjucks or Handlebars), the team built prompt templates that incorporate user selections from the UI along with domain-specific context.
For example, when generating future scenarios, a user might simply enter “Show me the future of payments,” but the template enriches this with parameters for time horizon, optimism level, and realism constraints. The team emphasized keeping templates simple and avoiding complex conditional logic within templates — instead using different template files for substantially different use cases.
A key prompt engineering technique mentioned is the “Adopt a Persona” approach, where the prompt begins by telling the LLM to act as a specific role (e.g., “You are a visionary futurist”). The team found this technique particularly effective for producing relevant completions.
Almost all production LLM applications need to parse LLM output into structured data for further processing. The team focused on getting GPT to return well-formed JSON and reported being “quite surprised by how well and consistently GPT returns well-formed JSON based on the instructions.”
They documented two approaches for achieving structured output:
The team observed an interesting side effect: by repeating row and column values before generating ideas in a Creative Matrix scenario, they successfully nudged the quality of responses. This aligns with the concept that “LLMs think in tokens” — providing more contextual tokens before generation leads to better outputs.
They also mentioned OpenAI’s Function Calling feature (released around the time of writing) as an alternative approach for structured responses, particularly useful when invoking external tools.
A critical UX challenge in LLM applications is latency. The team noted that “a user can only wait on a spinner for so long before losing patience” and recommended showing real-time progress for any operation taking more than a few seconds.
Their implementation uses LangChain’s streaming callbacks:
const chat = new ChatOpenAI({
streaming: true,
callbackManager: CallbackManager.fromHandlers({
async handleLLMNewToken(token) {
onTokenStream(token)
},
})
});
However, they acknowledge this adds significant complexity, requiring best-effort JSON parsing during streaming and temporal state management during LLM calls. They mention the Vercel AI SDK as a promising library for simplifying streaming in web applications.
An important UX benefit of streaming is the ability to let users stop a generation mid-completion if the initial results don’t match expectations, improving the overall interactive experience.
This pattern addresses the limitation of single-threaded context in chat interfaces. By allowing users to select specific elements (scenarios, strategies, concepts) and perform actions on them, the application can narrow or broaden the scope of interaction dynamically.
Implementation varies in complexity depending on context size:
The team recommends watching Linus Lee’s talk “Generative Experiences Beyond Chat” for deeper exploration of this pattern.
While Boba aims to break out of the chat interface paradigm, the team found it valuable to maintain a “fallback” channel for direct LLM conversation within specific contexts. This supports interactions not explicitly designed in the UI and cases where natural language conversation is genuinely the best UX.
Key implementation details include providing example messages/templates to help users understand the types of conversations possible, and rendering LLM responses as formatted Markdown for readability.
Based on the principle that “LLMs ‘think’ in tokens” (attributed to Andrej Karpathy), this pattern uses Chain of Thought (CoT) prompting to improve response quality. By asking the LLM to generate intermediate reasoning steps (such as questions that expand on the user’s prompt) before producing final answers, the team achieved higher-quality and more relevant outputs.
The team offers two variants:
They recommend creating UI affordances for toggling visibility of the reasoning process, giving users control over the level of detail they see.
Acknowledging that LLMs will inevitably misunderstand user intent or generate unsatisfactory responses, this pattern emphasizes building robust back-and-forth interaction capabilities. Approaches include:
A concrete example is Boba’s storyboarding feature, where users can iterate on Stable Diffusion image prompts for individual scenes without regenerating the entire storyboard.
The team mentions working on reinforcement learning-style feedback mechanisms (thumbs up/down, natural language feedback) to improve recommendations over time, similar to GitHub Copilot’s approach of demoting ignored suggestions.
This pattern addresses LLM knowledge cutoff limitations by combining LLMs with external data sources. The team’s implementation for the “Research Signals” feature follows a classic RAG (Retrieval-Augmented Generation) pipeline:
The implementation is notably concise with LangChain:
const vectorStore = await HNSWLib.fromDocuments(docs, new OpenAIEmbeddings());
const chain = VectorDBQAChain.fromLLM(model, vectorStore);
const res = await chain.call({
input_documents: docs,
query: prompt + ". Be detailed in your response.",
});
For larger-scale or long-term memory use cases, the team recommends external vector databases like Pinecone or Weaviate instead of in-memory solutions.
An important benefit of this approach is providing proper source links and references — since search results come from a real search engine, the references won’t be hallucinations (the team humorously notes “as long as the search engine isn’t partaking of the wrong mushrooms”).
Several cross-cutting observations are valuable for LLMOps practitioners:
The article represents a practical, experience-based perspective on building LLM applications, with the patterns offering reusable approaches that other teams can adapt. The team acknowledges this is “just scratching the surface” and that many principles, patterns, and practices for LLM-powered applications are still being discovered.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.
Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.