Company
Nvidia
Title
Data Flywheels for Cost-Effective AI Agent Optimization
Industry
Tech
Year
2025
Summary (short)
NVIDIA implemented a data flywheel approach to optimize their internal employee support AI agent, addressing the challenge of maintaining accuracy while reducing inference costs. The system continuously collects user feedback and production data to fine-tune smaller, more efficient models that can replace larger, expensive foundational models. Through this approach, they achieved comparable accuracy (94-96%) with significantly smaller models (1B-8B parameters instead of 70B), resulting in 98% cost savings and 70x lower latency while maintaining the agent's effectiveness in routing employee queries across HR, IT, and product documentation domains.
## Overview This case study from Nvidia's generative AI platforms team presents a compelling approach to maintaining and improving AI agents in production through what they call "data flywheels." The presentation, delivered by Silendrin from Nvidia, focuses on their internal employee support agent called NV Info Agent, which serves Nvidia employees with access to enterprise knowledge across multiple domains including HR benefits, financial earnings, IT help, and product documentation. The core thesis is that continuous improvement of AI agents doesn't require constantly upgrading to larger models, but rather implementing systematic feedback loops that enable smaller, more efficient models to match or approach the accuracy of larger ones. ## The Problem: Scaling AI Agents in Production The presentation identifies several key challenges when deploying AI agents at scale in enterprise environments: - **Rapidly changing data**: Enterprise customers constantly see new business intelligence and data flowing into their systems - **Evolving user preferences**: Customer needs and expectations shift over time - **Cost scaling with usage**: Deploying larger language models to maintain accuracy drives up inference costs proportionally with usage - **Model drift**: Production models can become less accurate over time as the underlying data and use cases evolve The specific technical challenge highlighted was the router agent within their multi-agent system. This router agent needs to understand user intent and context, then route queries to the appropriate expert agent (HR, IT, Finance, etc.). The goal was to ensure accurate routing while using faster and more cost-effective LLMs. ## The Data Flywheel Concept Nvidia presents the data flywheel as a continuous loop comprising several interconnected stages: - **Data processing and curation**: Collecting and refining training data from production systems - **Model customization**: Fine-tuning models using techniques like LoRA, P-tuning, and full supervised fine-tuning - **Evaluation**: Benchmarking against academic standards, institutional benchmarks, and using LLM-as-a-judge approaches - **Guardrailing**: Ensuring safer interactions through privacy, security, and safety measures - **RAG pipelines**: Building retrieval-augmented generation pipelines for accurate responses using enterprise data The key insight is that as AI agents run in production, this flywheel cycle triggers continuous improvement by curating ground truth from inference data, business intelligence, and user feedback. This enables teams to surface smaller, more efficient models that maintain accuracy expectations while offering lower latency and reduced total cost of ownership. ## Architecture and Implementation The NV Info Agent architecture consists of several components working together. Users (Nvidia employees) submit queries through a frontend that is guardrailed for safety and security. Behind this sits a router agent orchestrated by an LLM, which routes queries to multiple expert agents. Each expert agent excels in its specific domain and is augmented with a RAG pipeline to fetch relevant information. The data flywheel is set up to continuously build on user feedback and production inference logs. Using subject matter experts and human-in-the-loop feedback, ground truth data is continuously curated. Nvidia's Nemo Customizer and Evaluator tools are used to constantly evaluate multiple models and promote the most effective one to power the router agent. ## Quantitative Results and Model Comparison The case study provides concrete performance metrics that demonstrate the value of the data flywheel approach: **Baseline Performance (Pre-trained models without fine-tuning):** - Llama 3.1 70B / 3.3 70B: 96% routing accuracy, 26 seconds latency to first token - Llama 8B variant: Only 14% accuracy, but with approximately 70% lower latency The presentation makes an important observation about common enterprise evaluation mistakes: organizations often see the 96% vs 14% accuracy gap and conclude they must use the larger model, assuming fine-tuning won't bridge such a significant gap. This is precisely where data flywheels prove their value. **Data Collection Process:** The team ran the 70B Llama variant in production and circulated feedback forms among Nvidia employees. They collected 1,224 data points total: - 729 satisfactory responses - 495 unsatisfactory responses Using Nemo Evaluator with LLM-as-a-judge, they investigated the 495 unsatisfactory responses and identified 140 as due to incorrect routing. Further manual analysis with subject matter experts confirmed 32 cases of truly incorrect routing. This process yielded a curated ground truth dataset of 685 data points, split 60/40 for training and testing. **Post-Fine-tuning Results:** - 8B variant: Matched the 96% accuracy of the 70B model after fine-tuning - 1B variant: Achieved 94% accuracy (only 2% below the 70B baseline) - 1B deployment: 98% savings in inference cost, 70x model size reduction, 70x lower latency This demonstrates that with relatively small amounts of curated, high-quality data (just 685 data points), significant improvements in smaller model performance are achievable. ## Nvidia Nemo Microservices Platform The presentation introduces Nvidia's Nemo microservices platform, which provides tooling for each stage of the data flywheel: - **Nemo Curator**: Curates high-quality training datasets including multimodal data - **Nemo Customizer**: Enables fine-tuning using LoRA, P-tuning, full SFT, and other techniques - **Nemo Evaluator**: Benchmarks models against academic and institutional benchmarks, supports LLM-as-a-judge evaluation - **Nemo Guardrails**: Provides guardrail interactions for privacy, security, and safety - **Nemo Retriever**: Builds state-of-the-art RAG pipelines The platform exposes these capabilities as simple API endpoints, allowing model customization, evaluation, and guardrailing with just a few API calls. The microservices can run on-premises, in the cloud, in data centers, or on edge devices. It's worth noting that this presentation is inherently promotional of Nvidia's tooling, so claims about ease of use and enterprise-grade stability should be evaluated in the context of actual implementation experience. However, the conceptual framework of data flywheels and the demonstrated results with the routing agent provide valuable insights regardless of the specific tools used. ## Framework for Building Data Flywheels The presentation concludes with a four-step framework for implementing data flywheels: **Monitor User Feedback**: Implement intuitive ways to collect user feedback signals, ensuring privacy compliance while capturing both implicit and explicit signals. The goal is to detect model drift or inaccuracies in the models powering the agentic system. **Analyze and Attribute Errors**: Spend time classifying errors and attributing failures to understand why the agent behaves in certain ways. Create ground truth datasets that can be used for subsequent steps. **Plan**: Identify different models for experimentation, generate synthetic datasets where needed, fine-tune candidate models, and optimize for resource and cost constraints. **Execute**: This encompasses not just triggering data flywheel cycles, but establishing regular cadences for tracking accuracy, latency, and monitoring performance. The presenter emphasizes the importance of managing the end-to-end GenAI ops pipeline. ## Critical Assessment While the case study presents impressive results, several considerations merit attention: The 685 data points used for fine-tuning represent a relatively small dataset. While the results are compelling, the generalizability to more complex routing scenarios or other use cases may vary. The presentation focuses primarily on a classification task (routing), which may be more amenable to fine-tuning with small datasets than more complex generation tasks. The presentation is inherently promotional, introducing Nvidia's Nemo microservices platform as the solution. Organizations should evaluate whether these specific tools fit their infrastructure and requirements, or whether similar approaches could be implemented with alternative tooling. The latency figures (26 seconds to first token for 70B) seem notably high and may reflect specific infrastructure constraints or measurement methodology. Production implementations may see different results depending on hardware, optimization, and serving infrastructure. Despite these caveats, the core concept of data flywheels—continuously learning from production data to improve models—represents a sound LLMOps practice. The demonstrated ability to achieve comparable accuracy with significantly smaller models has substantial implications for cost management and scalability in production AI systems.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.