Tech
Weights & Biases
Company
Weights & Biases
Title
Building a Voice Assistant from Open Source LLMs: A Home Project Case Study
Industry
Tech
Year
2023
Summary (short)
A developer built a custom voice assistant similar to Alexa using open-source LLMs, demonstrating the journey from prototype to production-ready system. The project used Whisper for speech recognition and various LLM models (Llama 2, Mistral) running on consumer hardware, with systematic improvements through prompt engineering and fine-tuning to achieve 98% accuracy in command interpretation, showing how iterative improvement and proper evaluation frameworks are crucial for LLM applications.
## Overview This case study comes from a talk by the founder of Weights & Biases, an AI developer platform company that helps engineers build ML and GenAI applications. The presentation addresses the fundamental challenge facing organizations today: AI applications are remarkably easy to demonstrate but significantly more difficult to productionize. The speaker notes that in their audience, over 70% already have LLM applications in production, yet many are still struggling with the process—highlighting the widespread nature of this challenge across the industry. Weights & Biases works with a broad spectrum of customers, from foundation model builders (including most major LLM providers like OpenAI, Mistral, and Meta's Llama) to AI engineers working on custom applications across healthcare, agtech, manufacturing, and other industries. The company has observed that the democratization of AI through conversational interfaces has enabled software developers—who vastly outnumber ML engineers—to build AI applications, creating both opportunities and challenges for productionization. ## The Core Challenge: Demo vs. Production The speaker articulates a fundamental tension in AI development that drives much of LLMOps: AI is exceptionally easy to demo but extraordinarily hard to productionize. This gap is larger than in traditional software development for several reasons: The software development process is fundamentally linear—you add features, add code, and things generally improve over time. In contrast, the AI development process is experimental and non-deterministic. When working with LLMs, most activities involve trying something and seeing what happens. You cannot create CI/CD tests that are meaningful or that pass 100% of the time. This represents a fundamentally different workflow that requires different tooling. This distinction has profound implications for intellectual property and knowledge management. In software development, the code itself is the IP. In AI development, the learning is the IP—not the final model or prompt, but all the experiments, failed approaches, and insights accumulated along the way. If this learning isn't captured systematically, when an engineer leaves, the IP leaves with them. ## Case Study: Building a Custom Voice Assistant The speaker presents a personal project as an illustrative example that mirrors what enterprise customers experience. The goal was to build a custom voice assistant (an alternative to Alexa) that could understand natural language commands and execute skills like playing music, checking weather, doing math problems, and reading news. ### Architecture and Technical Stack The system architecture involves: - Speech-to-text transcription using Whisper (open-source model) - On-device LLM processing to convert natural language to function calls - A library of skills (simple Python functions that call APIs) The technical challenge was translating natural speech into structured Python-like code. For example, "What's the weather in Boston?" needs to become `weather(location="Boston")` to call the appropriate skill function. The stack included: - Whisper for speech transcription - Llama 2 (later Mistral) for language understanding - Llama.cpp for running models on commodity hardware (specifically a Raspberry Pi-like device costing around $200) Latency was a critical constraint—the entire pipeline needed to complete in a couple hundred milliseconds to provide acceptable user experience. ### The Iterative Improvement Journey The project demonstrates the typical journey from demo to production: **Starting Point (0% Accuracy):** Using Llama 2 with a default prompt produced zero working function calls. This reflects the common experience where initial attempts with LLMs fail completely. **Prompt Engineering Phase:** Common-sense improvements to the prompt—laying out available functions, providing clearer instructions—improved performance. The model began producing outputs closer to the desired format (e.g., "call: weather location equals Boston") but still not in a parseable format. This is the typical first step in LLMOps: optimizing prompts before considering more complex interventions. **Model Selection:** Switching from Llama 2 base to Llama 2 Chat (fine-tuned for conversations) improved accuracy to 11%. Later, switching to Mistral (which was released mid-project) provided a free accuracy boost to 79%. This illustrates an important LLMOps principle: staying current with model releases can provide significant improvements with minimal effort. **Error Analysis and Iteration:** Examining the specific errors the model was making and incorporating that feedback into prompt refinements brought accuracy to 75%. This feedback loop is essential to production AI development. **Fine-Tuning Phase:** With 75-79% accuracy still being "incredibly annoying" in practice, the project turned to fine-tuning. Using QLoRA (Quantized Low-Rank Adaptation) made this tractable on consumer hardware (a 4080 GPU). The training data was generated using a larger model (ChatGPT) to create synthetic examples, with manual curation (about 95% of generated examples were usable). Fine-tuning plus Mistral achieved 98% accuracy—finally production-viable. **Unexpected Benefits:** The multilingual capabilities of both Whisper and the LLMs meant the system worked across languages without explicit training, demonstrating how modern LLMs can provide unexpected value. ## Key LLMOps Lessons and Best Practices ### Reproducibility and Experiment Tracking Reproducibility is emphasized as critical but incredibly hard to achieve. The speaker argues that tracking must happen automatically through background processes—you cannot rely on humans to manually document everything. Without reproducibility: - Knowledge walks out the door when engineers leave - Teams cannot effectively collaborate - Iteration speed suffers The real ROI of tracking tools comes from enabling faster iteration through better collaboration. The project logged all experiments (including failures) in Weights & Biases, making them available for others to learn from. ### The Evaluation Framework Imperative The talk identifies building an evaluation framework as the foundational requirement for productionizing LLM applications. The speaker recounts interviewing a prominent CEO who admitted to "testing by vibes"—while there's some value in qualitative assessment, it creates an insurmountable barrier to releasing improved versions because you cannot distinguish between 75% and 79% accuracy by feel. Enterprise customers who successfully reach production typically implement: - **Hard guardrails:** Tests for things that must never fail (100% pass rate required) - **Fast feedback loops:** Evaluations that run in 20-30 seconds for rapid iteration - **Comprehensive nightly evaluations:** Larger test suites that provide deeper feedback The speaker notes that mature production systems often track thousands or tens of thousands of metrics, with organizations sometimes needing regex search across their metrics because they have so many. This reflects the reality that user experience can fail in countless ways, each requiring monitoring. ### Correlation with User Experience A persistent challenge in LLMOps is ensuring that metrics actually correlate with user value. This has been a challenge in MLOps for decades and remains so with LLMs. The speaker emphasizes this is entirely application-dependent and requires significant investment. ### Practical Development Patterns The talk identifies several anti-patterns and best practices: **Anti-pattern:** Building step-by-step in sequence without getting something into users' hands quickly. This is basic agile development but people frequently forget it with GenAI. **Best practices:** - Start with lightweight prototypes - Incorporate end-user feedback continuously - Iterate rapidly - Use multiple techniques in combination (prompt engineering, RAG, fine-tuning)—production systems rarely rely on just one approach When asked "should I use RAG or fine-tuning or prompt engineering?", the speaker notes this question usually indicates the person doesn't have a good evaluation system in place. With proper evaluation, you can quickly determine what works for your specific application. ## Tools and Technology Landscape The presentation acknowledges the explosion of LLMOps tooling, noting an entire "LLM Ops tools track" at the conference and many companies emerging to address these challenges. The speaker suggests that traditional software tools (observability, CI/CD, code versioning) have launched "AI versions" but don't fully address the experimental nature of AI development. Weights & Biases positions itself as supporting the full spectrum from foundation model builders to software developers new to AI. The platform was used for training major models including GPT variants, Mistral, and Llama. Their new tool "Weave" is mentioned as specifically targeting the challenges of getting AI applications into production. ## Market Observations The talk includes interesting market observations: the audience showed that about 70% have LLM applications in production, with roughly 30% using custom solutions and significantly fewer using purchased solutions (unless GitHub Copilot is counted). This suggests a strong trend toward custom development, likely driven by the specific needs of production applications. The speaker also notes that AI democratization has expanded the pool of people building AI applications from ML specialists to general software developers, which is both an opportunity (larger market, more applications possible) and a challenge (less experience with ML-specific production considerations).

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.