NVIDIA developed Agent Morpheus, an AI-powered system that automates the analysis of software vulnerabilities (CVEs) at enterprise scale. The system combines retrieval-augmented generation (RAG) with multiple specialized LLMs and AI agents in an event-driven workflow to analyze CVE exploitability, generate remediation plans, and produce standardized security documentation. The solution reduced CVE analysis time from hours/days to seconds and achieved a 9.3x speedup through parallel processing.
Nvidia presents Agent Morpheus, an internal production system designed to address the growing challenge of software vulnerability management at enterprise scale. With the CVE database hitting record highs (over 200,000 cumulative vulnerabilities reported by end of 2023), traditional approaches to scanning and patching have become unmanageable. The solution demonstrates a sophisticated LLMOps implementation that combines multiple LLMs, RAG, and AI agents in an event-driven architecture to automate the labor-intensive process of CVE analysis and exploitability determination.
The core innovation here is distinguishing between a vulnerability being present (a CVE signature detected) versus being exploitable (the vulnerability can actually be executed and abused). This nuanced analysis previously required security analysts to manually synthesize information from multiple sources—a process that could take hours or days per container. Agent Morpheus reduces this to seconds while maintaining the quality of analysis through intelligent automation and human-in-the-loop oversight.
The system employs four distinct Llama3 large language models, with three of them being LoRA (Low-Rank Adaptation) fine-tuned for specific tasks within the workflow:
Planning LLM: A LoRA fine-tuned model specifically trained to generate unique investigation checklists based on the CVE context. This model takes vulnerability and threat intelligence data and produces actionable task lists tailored to each specific CVE.
AI Agent LLM: Another LoRA fine-tuned model that executes checklist items within the context of a specific software project. This agent can autonomously retrieve information and make decisions by accessing project assets including source code, SBOMs (Software Bill of Materials), documentation, and internet search tools.
Summarization LLM: A LoRA fine-tuned model that combines all findings from the agent’s investigation into coherent summaries for human analysts.
VEX Formatting LLM: The base Llama3 model that standardizes justifications for non-exploitable CVEs into the common machine-readable VEX (Vulnerability Exploitability eXchange) format for distribution.
This multi-model architecture represents a thoughtful LLMOps design decision—rather than using a single general-purpose model for all tasks, Nvidia chose to specialize models through fine-tuning for their specific roles, likely improving accuracy and reliability for each stage of the pipeline.
The deployment leverages NVIDIA NIM inference microservices, which serves as the core inference infrastructure. A key architectural decision was hosting all four model variants (three LoRA adapters plus base model) using a single NIM container that dynamically loads LoRA adapters as needed. This approach optimizes resource utilization while maintaining the flexibility to serve different specialized models.
The choice of NIM was driven by several production requirements:
OpenAI API compatibility: NIM provides an API specification compatible with OpenAI’s interface, simplifying integration with existing tooling and agent frameworks.
Dynamic LoRA loading: The ability to serve multiple LoRA-customized models from a single container reduces infrastructure complexity and costs.
Variable workload handling: Agent Morpheus generates approximately 41 LLM queries per CVE on average. With container scans potentially generating dozens of CVEs per container, the system can produce thousands of outstanding LLM requests for a single container scan. NIM is designed to handle this bursty, variable workload pattern that would be challenging for custom LLM services.
The system is fully integrated into Nvidia’s container registry and security toolchain using the Morpheus cybersecurity framework. The workflow is triggered automatically when containers are uploaded to the registry, making it truly event-driven rather than batch-processed.
The pipeline flow operates as follows: A container upload event triggers a traditional CVE scan (using Anchore or similar tools). The scan results are passed to Agent Morpheus, which retrieves current vulnerability and threat intelligence for the detected CVEs. The planning LLM generates investigation checklists, the AI agent executes these autonomously, the summarization LLM consolidates findings, and finally results are presented to human analysts through a security dashboard.
One notable aspect of this architecture is that the AI agent operates autonomously without requiring human prompting during its analysis. The agent “talks to itself” by working through the generated checklist, retrieving necessary information, and making decisions. Human analysts are only engaged when sufficient information is available for them to make final decisions—a design that optimizes analyst time and attention.
The case study reveals practical approaches to overcoming known LLM limitations in production. The AI agent has access to multiple tools beyond just data retrieval:
Version comparison tool: The team discovered that LLMs struggle to correctly compare software version numbers (e.g., determining that version 1.9.1 comes before 1.10). Rather than attempting to solve this through prompting or fine-tuning, they built a dedicated version comparison tool that the agent can invoke when needed.
Calculator tools: A well-known weakness of LLMs is mathematical calculations. The system provides calculator access to overcome this limitation.
This pragmatic approach—using tools to handle tasks LLMs are poor at rather than trying to force LLMs to do everything—represents mature LLMOps thinking.
Using the Morpheus framework, the team built a pipeline that orchestrates the high volume of LLM requests asynchronously and in parallel. The key insight is that both the checklist items for each CVE and the CVEs themselves are completely independent, making them ideal candidates for parallelization.
The performance results are significant: processing a container with 20 CVEs takes 2842.35 seconds when run serially, but only 304.72 seconds when parallelized using Morpheus—a 9.3x speedup. This transforms the practical utility of the system from something that might take nearly an hour per container to completing in about 5 minutes.
The pipeline is exposed as a microservice using HttpServerSourceStage from Morpheus, enabling seamless integration with the container registry and security dashboard services.
The system implements a continuous improvement loop that leverages human analyst output. After Agent Morpheus generates its analysis, human analysts review the findings and may make corrections or additions. These human-approved patching exemptions and changes to the Agent Morpheus summaries are fed back into LLM fine-tuning datasets.
This creates a virtuous cycle where the models are continually retrained using analyst output, theoretically improving system accuracy over time based on real-world corrections. This approach addresses a common LLMOps challenge: how to maintain and improve model performance in production when ground truth labels are expensive to obtain.
The complete production workflow demonstrates enterprise-grade integration:
This end-to-end automation, from container upload to VEX document publication, represents a mature production deployment rather than a proof-of-concept.
While the case study presents impressive results, it’s worth noting several caveats:
Nevertheless, the technical architecture demonstrates sophisticated LLMOps practices including multi-model orchestration, LoRA fine-tuning for task specialization, tool augmentation for LLM limitations, parallel inference optimization, event-driven microservices architecture, and continuous learning from human feedback—all running in a production environment at enterprise scale.
Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Amazon teams faced challenges in deploying high-stakes LLM applications across healthcare, engineering, and e-commerce domains where basic prompt engineering and RAG approaches proved insufficient. Through systematic application of advanced fine-tuning techniques including Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and cutting-edge reasoning optimizations like Group-based Reinforcement Learning from Policy Optimization (GRPO) and Direct Advantage Policy Optimization (DAPO), three Amazon business units achieved production-grade results: Amazon Pharmacy reduced dangerous medication errors by 33%, Amazon Global Engineering Services achieved 80% human effort reduction in inspection reviews, and Amazon A+ Content improved quality assessment accuracy from 77% to 96%. These outcomes demonstrate that approximately one in four high-stakes enterprise applications require advanced fine-tuning beyond standard techniques to achieve necessary performance levels in production environments.