Vanguard, a global investment management firm, faced challenges with financial analysts requiring SQL expertise and long wait times (several days) to query complex datasets. To address this, they built a Virtual Analyst solution—a conversational AI system powered by foundation models that enables business users to access financial data through natural language queries. The implementation focused on establishing "AI-ready data" through eight guiding principles including metadata cataloging, semantic layers, governance, and data quality checks. Built on AWS services including Amazon Bedrock for foundation models, Amazon Redshift for data warehousing, and AWS Glue for cataloging, the solution reduced time-to-insight from days to minutes, enabled non-technical users to access data independently, achieved high accuracy in AI-generated SQL queries, and established a reusable framework being adopted across multiple business units.
Vanguard’s Virtual Analyst case study presents a comprehensive example of deploying large language models in production for enterprise financial analytics. The use case centers on enabling financial analysts and business stakeholders to query complex datasets using natural language instead of requiring SQL expertise and data team support. What makes this case study particularly valuable from an LLMOps perspective is its focus on the data infrastructure foundations required to make LLMs work reliably in production, rather than just the model selection itself. The team’s key insight was that building effective conversational AI “wasn’t a machine learning challenge—it was a data architecture challenge.”
The Virtual Analyst system transforms the analyst workflow from one requiring SQL expertise and multi-day turnaround times into a conversational interface delivering insights within minutes. However, the real contribution of this case study lies in Vanguard’s articulation of eight guiding principles for “AI-ready data” that emerged from real-world production challenges. This implementation required cross-functional collaboration between data engineers, business analysts, compliance officers, security teams, and business stakeholders—breaking down traditional organizational silos that often hinder enterprise AI deployments.
Vanguard built the Virtual Analyst on AWS infrastructure, leveraging multiple integrated services. Amazon Bedrock provides the foundation models that power natural language understanding, serving as the core LLM inference layer. The architecture employs Amazon Bedrock Guardrails to secure AI inputs and outputs, which is particularly critical for protecting sensitive financial data in a regulated industry. This demonstrates thoughtful consideration of the security requirements specific to production LLM deployments in financial services.
For compute infrastructure, Vanguard uses Amazon Elastic Container Service (Amazon ECS) for scalable deployment, suggesting a containerized microservices architecture. Amazon DynamoDB handles conversation persistence across a horizontally scalable architecture with minimal latency, which is an important design choice for maintaining conversational context in a production chatbot system. The storage layer uses Amazon S3, while Amazon SageMaker provides experimentation capabilities—indicating that Vanguard maintains a separate environment for model testing and development before production deployment.
The data layer architecture centers on Amazon Redshift as the centralized data warehouse, with AWS Glue providing both data cataloging and powering numerous ETL jobs to consolidate accurate data. This integration between the catalog (AWS Glue) and the warehouse (Redshift) creates a foundation for the metadata-driven approach that enables the LLM to generate accurate SQL queries.
The core LLMOps contribution of this case study lies in Vanguard’s articulation of eight principles that extend foundational data capabilities to support AI systems in production:
Clear Data Product and Operating Models: Vanguard established accountability through defined data product owners responsible for business alignment and engineering stewards maintaining technical quality. They implemented service-level agreements for data freshness and reconciliation tolerance, with established support models for downstream consumers. This operational framework ensures that data products remain well-managed and designed to deliver outcomes, which is essential when LLMs depend on these data sources for generating accurate responses.
Governance and Security Measures: Working with compliance and security teams early in the process, Vanguard established enterprise identity management, role-based data access controls, query-level authorization, and retention policies. They implemented logging of authorization events to meet regulatory requirements while supporting business agility. The implementation includes row-level and column-level security where needed, mapped to the AI system’s access patterns. This is particularly important for LLM systems that generate queries dynamically, as the governance framework must evaluate permissions for programmatically generated data access rather than just human-initiated queries.
Unified Metadata Catalog: Vanguard implemented a metadata catalog as a control plane that centralizes both technical and business metadata while exposing them via APIs. The technical metadata layer includes table and column descriptions with data types, data lineage across transformations, synonyms and categorical indicators, and relationship mappings between datasets. This layer is defined by technical domain experts and data stewards. The business metadata layer captures business definitions and rules for specific attributes, domain-specific terminology and ontologies, business ownership information, and usage context contributed by business users and domain experts. The integration of these two metadata types enables the LLM system to generate queries that align with both technical database structure and business meaning. The case study emphasizes versioning metadata and measuring mapping accuracy to maintain discoverability and precision.
Semantic Layer Implementation: The semantic layer operationalizes the business metadata by transforming complex data structures into user-friendly formats. This implementation layer translates business definitions, rules, and ontologies into executable logic that standardizes metric definitions and relationships between data elements. For example, Vanguard’s semantic layer maintains the definition of “customer lifetime value” consistently across departments and systems by implementing business rules defined by business users. This layer is critical for LLM operations because it provides the standardized vocabulary and calculation logic that the model uses when translating natural language questions into SQL queries. Without this layer, the same business question asked in different ways might generate inconsistent queries.
Ground Truth Examples (Exemplars): Vanguard built a library of over 50 question-to-SQL pairs that serve multiple purposes in their LLMOps workflow. These exemplars function as few-shot prompts for the AI model, providing example question-answer pairs to guide the model’s responses through in-context learning. They also serve as evaluation benchmarks for measuring accuracy against known correct answers, and as regression tests for verifying that new changes don’t break existing functionality. This approach demonstrates a sophisticated understanding of few-shot learning techniques for LLMs, where providing relevant examples in the prompt significantly improves performance on domain-specific tasks. The exemplars essentially encode domain expertise into reusable artifacts that guide model behavior without requiring fine-tuning.
Automated Data Quality Checks: Vanguard implemented observability tools to monitor data reliability through automated checks including distributional checks (detecting anomalies in data patterns), referential checks (verifying relationships between tables remain valid), reconciliation checks (confirming data consistency across systems), and freshness checks (confirming data updates occur on schedule). These quality checks are essential for LLMOps because the accuracy of LLM-generated queries depends entirely on the underlying data quality. If the data doesn’t meet quality standards, even perfectly generated SQL will produce unreliable results.
Change Control Processes: Treating semantic definitions, exemplars, and configurations as code under version control, Vanguard implements CI/CD processes with staged deployments and gated approvals. This approach requires stakeholder sign-off for changes affecting KPIs or SLAs while enabling safe, rapid deployment of improvements. The change control process manages the dynamic nature of the data landscape, confirming Virtual Analyst can adapt to changes effectively. This is a critical LLMOps practice because the prompt engineering artifacts (semantic definitions, exemplars) and configurations are essentially the “code” that determines system behavior, and they require the same rigor in version control and deployment as traditional software.
Continuous Evaluation Mechanisms: Vanguard defines business metrics including analyst hours saved, time-to-insight improvements, user satisfaction, and measurable revenue or profit impacts where possible. The system maintains continuous regression suites and user feedback loops to evolve examples and semantics, with automated alerts for model degradation and business impact tracking. This evaluation framework demonstrates mature LLMOps practices by connecting technical metrics (query accuracy) to business outcomes (analyst productivity) and establishing feedback loops for continuous improvement.
The Virtual Analyst implementation demonstrates several important LLMOps practices worth examining critically. The use of few-shot learning through exemplars is a practical approach that avoids the complexity and cost of fine-tuning foundation models, but it does require careful curation of examples to cover the diversity of query patterns users might request. Vanguard’s approach of starting with 20-30 examples covering common patterns and expanding based on user feedback and edge cases is a pragmatic deployment strategy.
The architecture’s use of Amazon Bedrock Guardrails for securing inputs and outputs addresses the important security considerations for LLMs in production, particularly in financial services where data leakage or prompt injection could have serious regulatory consequences. However, the case study doesn’t provide specific details about what types of guardrails are implemented or how they balance security with system usability.
The conversation persistence layer using DynamoDB suggests that Virtual Analyst maintains context across multi-turn conversations, which is important for natural user interactions but also introduces complexity around managing conversation state, determining when to clear context, and ensuring consistent behavior across horizontally scaled instances. The case study doesn’t detail how conversation context is managed or how the system determines which previous context to include when generating new queries.
One notable aspect of the implementation is the separation of concerns between Amazon SageMaker for experimentation and the production deployment on ECS with Bedrock. This suggests that Vanguard has established separate environments for development and production, which is an important LLMOps practice for testing changes before they impact end users. However, the case study doesn’t elaborate on how experiments in SageMaker translate to production deployments or what the promotion process looks like.
The case study describes continuous evaluation through regression suites and user feedback loops, with automated alerts for model degradation. The regression testing approach using the exemplar library as test cases is practical because it provides a growing set of known-correct query pairs that can be used to validate that system changes don’t introduce regressions. However, the case study doesn’t specify what metrics are used to measure query accuracy or how the system detects when generated SQL is incorrect.
The emphasis on business metrics (analyst hours saved, time-to-insight improvements, user satisfaction) alongside technical metrics shows a mature approach to demonstrating value, but these business metrics may be difficult to attribute solely to the Virtual Analyst system versus other concurrent improvements. The case study claims “high accuracy in AI-generated SQL queries” but doesn’t provide specific accuracy percentages or define what constitutes a correct query.
The user feedback loop mechanism isn’t detailed, but it presumably allows analysts to flag incorrect queries, which can then be added to the exemplar library or used to identify gaps in the semantic layer. This human-in-the-loop approach is important for continuous improvement but requires operational processes for reviewing feedback and updating system artifacts.
What distinguishes this case study from typical LLM deployment stories is the emphasis on data architecture as the primary challenge. The team’s realization that “building effective conversational AI wasn’t a machine learning challenge—it was a data architecture challenge” reflects a mature understanding of what’s required to make LLMs work reliably with enterprise data. The text-to-SQL use case is particularly dependent on high-quality metadata, clear semantic definitions, and well-managed data products because the LLM must understand not just the structure of the database but the business meaning behind the data.
The unified metadata catalog serving as a “control plane” that exposes metadata via APIs suggests an architecture where the LLM can programmatically access information about available tables, columns, relationships, and business definitions when constructing queries. This metadata-driven approach is more scalable than hard-coding database knowledge into prompts and allows the system to adapt as the data warehouse schema evolves.
The semantic layer serves a critical function by standardizing how business concepts map to database queries. Without this layer, the same business question could be interpreted differently depending on which tables or calculation methods the LLM chooses, leading to inconsistent results. By enforcing business definitions documented in the metadata catalog, the semantic layer provides guardrails that constrain the LLM’s query generation to produce consistent, business-aligned results.
The case study emphasizes the collaborative imperative of bringing together traditionally siloed teams, with cross-functional collaboration between data engineers, business analysts, compliance officers, security teams, and business stakeholders. This organizational aspect is often underemphasized in LLMOps discussions but is critical for success. Each stakeholder group contributes essential expertise: data engineers understand technical infrastructure, business analysts know the semantic meaning of metrics, compliance teams ensure regulatory requirements are met, and business users provide real-world context.
The establishment of clear ownership models, semantic definitions, and quality standards that all teams can understand and contribute to creates a shared vocabulary and governance framework. This is particularly important for LLM systems because the system’s behavior depends on artifacts (metadata, semantic definitions, exemplars) that must be maintained by multiple teams with different expertise.
The case study notes that “the Virtual Analyst project served as a catalyst for new processes and frameworks that provide benefits far beyond the initial AI use case,” suggesting that the data infrastructure improvements enabled by focusing on AI-ready data have value independent of the Virtual Analyst system itself. This is an important consideration for organizations evaluating whether to invest in comprehensive data infrastructure—the benefits extend beyond any single AI application.
Vanguard is evaluating opportunities to explore knowledge graphs and Retrieval-Augmented Generation (RAG) to further enhance Virtual Analyst. Knowledge graphs could provide explicit entity relationships, canonical resolution, and cross-domain context that would improve fuzzy matching, join inference, and explainability for generated queries. This suggests limitations in the current approach around understanding complex entity relationships that aren’t fully captured in the metadata catalog and semantic layer.
The mention of RAG systems using Amazon Bedrock Knowledge Bases to leverage the exemplar library suggests a potential architecture evolution where relevant question-SQL pairs are retrieved dynamically based on the user’s question, rather than including a fixed set of examples in every prompt. This could improve accuracy by providing more relevant examples for each query while managing prompt token limits. The reference to “intelligent feedback systems that will progressively refine model quality and reliability” indicates plans for more sophisticated feedback mechanisms beyond the current user feedback loops.
While the case study provides valuable insights into building production LLM systems for enterprise analytics, several aspects deserve critical examination. The text is published on an AWS blog and naturally emphasizes AWS services, but the architecture choices aren’t compared to alternatives or justified beyond general statements about AWS’s “comprehensive suite of integrated services” and “stringent requirements of the financial services industry.” It’s unclear whether other cloud platforms or architectures were seriously evaluated.
The business results claim to reduce time-to-insight “from days to minutes” but don’t specify what percentage of queries achieve this performance or how complex the queries are. The “high accuracy in AI-generated SQL queries” isn’t quantified with specific percentages or error rates. User satisfaction is mentioned as a metric but no satisfaction scores are provided. While these claims are plausible, the lack of specific metrics makes it difficult to assess the actual impact or compare to alternative approaches.
The case study doesn’t discuss failure modes, limitations, or challenges encountered during deployment. For example, what types of questions does Virtual Analyst struggle with? What percentage of generated queries require human review or correction? How do users know when to trust the generated SQL versus verifying it manually? These practical considerations are important for organizations evaluating similar implementations.
The emphasis on the eight guiding principles is valuable, but the principles are presented somewhat prescriptively without discussing tradeoffs or alternative approaches. For example, the recommendation to start with “20-30 examples covering your most common query patterns” is specific but not justified—why this number rather than more or fewer? The guidance to “define 3-5 key metrics that matter to your business stakeholders” similarly lacks justification for why this range is optimal.
The cross-functional collaboration requirements are mentioned but not explored in depth. How does Vanguard handle disagreements between teams about semantic definitions or data quality standards? What governance processes resolve conflicts? How are priorities set when different stakeholder groups have competing requirements? These organizational challenges are often the most difficult aspects of enterprise AI deployments but receive limited attention.
Despite these limitations in the public case study presentation, Vanguard’s Virtual Analyst provides a valuable example of production LLM deployment focused on the data infrastructure foundations required for reliable operation. The articulation of eight principles for AI-ready data, the emphasis on metadata cataloging and semantic layers, the use of exemplars for few-shot learning, and the establishment of continuous evaluation mechanisms all represent mature LLMOps practices. The case study’s core insight—that conversational AI for enterprise analytics is primarily a data architecture challenge rather than a machine learning challenge—is an important framing that shifts focus to the foundational capabilities that enable LLMs to work reliably with complex enterprise data.
The technical architecture demonstrates thoughtful integration of AWS services for model serving (Bedrock), data warehousing (Redshift), cataloging (Glue), conversation persistence (DynamoDB), and security (Bedrock Guardrails), though the case study would benefit from more detailed discussion of architectural decisions and tradeoffs. The organizational emphasis on cross-functional collaboration and establishing shared governance frameworks highlights the people and process dimensions of LLMOps that are often overshadowed by technical considerations. Overall, this case study contributes meaningfully to the understanding of what’s required to deploy LLMs successfully in regulated enterprise environments, even while leaving some important details about implementation specifics, quantitative results, and limitations unexplored.
Prudential Financial, in partnership with AWS GenAI Innovation Center, built a scalable multi-agent platform to support 100,000+ financial advisors across insurance and financial services. The system addresses fragmented workflows where advisors previously had to navigate dozens of disconnected IT systems for client engagement, underwriting, product information, and servicing. The solution features an orchestration agent that routes requests to specialized sub-agents (quick quote, forms, product, illustration, book of business) while maintaining context and enforcing governance. The platform-based microservices architecture reduced time-to-value from 6-8 weeks to 3-4 weeks for new agent deployments, enabled cross-business reusability, and provided standardized frameworks for authentication, LLM gateway access, knowledge management, and observability while handling the complexity of scaling multi-agent systems in a regulated financial services environment.
Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.
Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.