Taralli: Improving LLM Food Tracking Accuracy through Systematic Evaluation and Few-Shot Learning

LLMOps Database

Tech

Taralli

Company

Taralli

Title

Improving LLM Food Tracking Accuracy through Systematic Evaluation and Few-Shot Learning

Industry

Tech

Link

https://duarteocarmo.com/blog/evals-are-all-you-need

Year

2025

Summary (short)

A case study of Taralli's food tracking application that initially used a naive approach with GPT-4-mini for calorie and nutrient estimation, resulting in significant accuracy issues. Through the implementation of systematic evaluation methods, creation of a golden dataset, and optimization using DSPy's BootstrapFewShotWithRandomSearch technique, they improved accuracy from 17% to 76% while maintaining reasonable response times with Gemini 2.5 Flash.

Tags

This case study explores the evolution and improvement of a production LLM system at Taralli, focusing on their journey from a basic implementation to a sophisticated evaluation and improvement pipeline for food tracking and nutritional analysis. The study provides valuable insights into real-world LLM deployment challenges and practical solutions for improving system accuracy. The initial implementation at Taralli exemplifies a common "start simple" approach in LLM deployments. They began with a straightforward system using GPT-4-mini to parse food descriptions and return structured nutritional information using Pydantic models. This implementation, while quick to deploy, demonstrated common challenges in production LLM systems: inconsistent outputs, numerical reasoning errors, and reliability issues. The system would sometimes produce wildly inaccurate results, such as calculating 58,000 calories for 100g of peanut butter. A key aspect of their LLMOps journey was the implementation of comprehensive logging and monitoring. By making the app free while logging all inputs and outputs using Weights & Biases (W&B) Weave, they collected valuable real-world usage data. This approach to data collection demonstrates good practices in LLM system development: gathering actual user interaction data rather than relying solely on synthetic examples. The development of their evaluation framework showcases several important LLMOps practices: * Creation of a golden dataset using multiple LLM models (OpenAI's o3 and Google's Gemini 2.5 Pro) to establish ground truth * Development of a custom visualization tool for dataset management and quality control * Implementation of a clear, domain-specific evaluation metric focusing on both calorie accuracy (within 10% tolerance) and food group classification * Use of concurrent processing for efficient evaluation of the system Their optimization approach using DSPy's BootstrapFewShotWithRandomSearch technique is particularly noteworthy. The process involved: * Establishing baseline performance metrics for both their production system (17% accuracy) and a zero-shot approach with Gemini 2.5 Flash (25% accuracy) * Optimizing few-shot examples automatically to improve performance * Achieving a significant improvement to 76% accuracy while maintaining reasonable response times The production deployment architecture demonstrates several good practices in LLM system design: * Async processing support in FastAPI * API key authentication for security * Proper error handling and type validation * Integration with monitoring systems * Structured output validation using Pydantic models One of the most valuable aspects of their approach is the creation of a continuous improvement flywheel. As users interact with the system, new data is collected and can be reviewed, corrected, and added to the golden dataset. This creates a sustainable path for ongoing system improvement, which is crucial for long-term success in production LLM systems. The case study also highlights important trade-offs in LLM system design: * Speed vs. Accuracy: Using Gemini 2.5 Flash instead of more powerful models to maintain reasonable response times * Complexity vs. Simplicity: Starting with a simple system and gradually adding complexity as needed * Cost vs. Performance: Managing API costs while improving system accuracy * Prompt Size vs. Response Time: Balancing the number of few-shot examples against response time requirements The implementation details show careful consideration of production requirements: * Use of structured output formats with clear validation * Proper error handling and type checking * Asynchro```nous processing support * Integration with monitoring and logging systems * Clear API documentation and versioning Their approach to continuous improvement demonstrates good LLMOps practices: * Regular evaluation against a growing dataset * Systematic approach to measuring and improving accuracy * Balance between automation and human oversight * Clear metrics for success * Continuous data collection and system refinement The case study concludes with valuable insights about the importance of putting systems into production quickly to gather real-world data, then using that data to drive systematic improvements. This approach of "launch and learn" combined with rigorous evaluation and improvement methodology provides a valuable template for other organizations deploying LLM systems in production. Technical implementation details are well documented, including code examples for key components like the evaluation metric, model deployment, and API integration. The use of modern tools and frameworks (DSPy, FastAPI, Pydantic, W&B Weave) demonstrates a robust approach to building production-ready LLM systems.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source