Q4 Inc. developed a chatbot for Investor Relations Officers to query financial data using Amazon Bedrock and RAG with SQL generation. The solution addresses challenges with numerical and structured datasets by using LLMs to generate SQL queries rather than traditional RAG approaches, achieving high accuracy and single-digit second response times. The system uses multiple foundation models through Amazon Bedrock for different tasks (SQL generation, validation, summarization) optimized for performance and cost.
Q4 Inc. is a capital markets access platform headquartered in Toronto with offices in New York and London. The company provides solutions for investor relations including IR website products, virtual events, engagement analytics, CRM, shareholder analysis, and ESG tools. Their goal was to build a Q&A chatbot that would help Investor Relations Officers (IROs) efficiently access information from diverse datasets including CRM records, ownership data, and stock market information in a user-friendly format.
This case study is particularly interesting from an LLMOps perspective because it demonstrates the limitations of conventional RAG approaches when dealing with numerical and structured data, and shows how the team iterated through multiple approaches before arriving at a production-worthy solution using SQL generation.
The Q4 team initially attempted several common approaches to building their chatbot, each with significant challenges:
Pre-training was quickly ruled out as impractical. The team recognized that pre-training an LLM on their own dataset would be resource-intensive, cost-prohibitive, and require continuous incremental training as new time-series data arrived. This would have required a dedicated cross-disciplinary team with data science, ML, and domain expertise.
Fine-tuning a pre-trained foundation model showed initial promise but struggled with hallucination issues. The model had difficulty understanding nuanced contextual cues and frequently returned incorrect results.
Conventional RAG with semantic search was the most extensively tested approach. The team experimented with search, semantic search, and embeddings to extract context. When using embeddings, the dataset was converted to vectors, stored in a vector database, and matched with question embeddings. However, this approach proved fundamentally unsuitable for Q4’s dataset because embeddings generated from numbers struggled with similarity ranking. Financial data consisting of numbers, transactions, stock quotes, and dates simply did not work well with embedding-based retrieval, leading to incorrect information being retrieved as context.
This is a critical insight for LLMOps practitioners: RAG with semantic search works well for text-based content with natural language, but numerical and structured datasets require different approaches for accurate context retrieval.
Q4’s hypothesis was that to achieve higher recall for context retrieval on numerical datasets, they needed to generate SQL from user questions. This would not only increase accuracy but also keep the context within the business domain. The solution architecture involves multiple stages:
The first stage translates the user’s natural language question into a SQL query. To generate accurate SQL, the LLM needs to be fully context-aware of the dataset structure, which means the prompt includes the database schema, sample data rows (2-5 rows were found to be sufficient), and human-readable explanations for fields that are not self-explanatory.
The second stage (optionally) verifies the generated SQL for correct syntax and contextual sense. However, after testing, Q4 found that the SQL generation quality was consistently high enough that this validation step could be eliminated, improving latency and reducing costs.
The third stage executes the verified SQL query against the database to retrieve the relevant context.
The final stage sends the user’s original question along with the retrieved context and instructions to the LLM for summarization, producing a contextual and accurate answer.
Amazon Bedrock serves as the fully managed, serverless foundation model platform. Its key advantage for Q4 was providing access to multiple LLMs through a single API, enabling the team to switch between models for different tasks without infrastructure changes. This flexibility was crucial for optimizing accuracy, performance, and cost at each stage of the pipeline.
LangChain was adopted as the orchestration framework, providing pre-built modules for coordinating tasks between foundation models, data sources, and tools. This significantly reduced the development effort that would have been required to build the orchestration logic from scratch.
SQLDatabaseChain from langchain_experimental simplified the creation, implementation, and execution of SQL queries through text-to-SQL conversions. It handles the interactions between the database and the LLM, managing schema retrieval and query construction.
A key LLMOps lesson from this case study is the importance of selecting the right model for each task rather than using a single model for everything. Q4’s approach involved:
For SQL generation, Claude V2 (Anthropic) was selected for its advanced reasoning capabilities. However, testing revealed that Claude Instant could produce comparable results for simpler, more direct user questions. For sophisticated input, Claude V2 remained necessary to achieve the desired accuracy.
For SQL validation (when used), a smaller model like Claude Instant was considered sufficient since the task is simpler than generation.
For summarization, Titan Text Express or Claude Instant were found to provide adequate accuracy at better performance and cost points compared to larger models like Claude V2.
This multi-model approach allowed Q4 to optimize each step of the pipeline independently, balancing accuracy, latency, and cost based on task requirements.
The team developed specific prompt engineering techniques optimized for Claude models:
They used the proper human/assistant syntax that Claude is trained to understand. XML tags were employed extensively because Claude respects and understands them well. These were used to frame instructions, additional hints, database schema, table explanations, and example rows.
Clear instructions were added to prevent hallucination, including telling the model to say “sorry, I am unable to help” if it cannot produce relevant SQL. The prompt explicitly instructs the model to provide only SQL with no additional comments and not to make up answers.
The template structure includes sections for instructions, database_schema, table_description (explaining abbreviated column names), example_rows, the user question, and additional_hints. This structured approach improved both the quality and consistency of generated SQL.
Several optimizations were implemented to achieve the target of single-digit second response times:
The SQL validation step was eliminated after analysis showed consistently accurate SQL generation, removing one complete LLM round trip from the pipeline.
Smaller, more efficient models were selected for tasks that didn’t require the full capability of larger models, particularly for summarization where Titan Text Express provided better performance and cost efficiency.
Prompt optimization focused on providing the minimum tokens necessary with the right syntax and minimal yet optimal instructions. Rather than overwhelming the model with information, the focus was on efficiency.
Input/output token sizes were optimized to the smallest values that could produce the required accuracy level.
Q4 built their AI services as microservices accessible through APIs, enabling modular and scalable integration with their platform ecosystem. This architecture allows the Q&A capability to be exposed across multiple platform applications while maintaining consistent security and data privacy standards.
The financial services context imposed strict requirements around data privacy and response accuracy. Providing incorrect or outdated information in this regulated industry could impact investor trust and create compliance risks. These constraints guided the team’s emphasis on accuracy over speed, though they ultimately achieved both.
The solution maintains commercial feasibility by leveraging serverless infrastructure (Amazon Bedrock), open-source tooling (LangChain), and careful optimization of model selection and prompt engineering to balance quality with cost.
This case study illustrates several important principles for production LLM systems:
Context extraction strategy must match the data type. Semantic search and embeddings work well for natural language text but fail for numerical and structured data. SQL generation provides a more reliable retrieval mechanism for tabular data.
Multi-model pipelines can optimize cost, latency, and accuracy by selecting the right model for each task rather than using a single model for everything.
Serverless, API-based model platforms like Amazon Bedrock reduce operational overhead and provide flexibility to experiment with and switch between models.
Orchestration frameworks like LangChain reduce development effort by providing pre-built modules for common patterns like database interaction and LLM chaining.
Prompt engineering significantly impacts both quality and performance. Well-structured prompts using model-specific syntax (like XML tags for Claude) improve consistency and reduce token usage.
Validation steps should be evaluated for cost/benefit. If upstream quality is high enough, removing validation can improve latency without sacrificing accuracy.
Moody's Analytics, a century-old financial institution serving over 1,500 customers across 165 countries, transformed their approach to serving high-stakes financial decision-making by evolving from a basic RAG chatbot to a sophisticated multi-agent AI system on AWS. Facing challenges with unstructured financial data (PDFs with complex tables, charts, and regulatory documents), context window limitations, and the need for 100% accuracy in billion-dollar decisions, they architected a serverless multi-agent orchestration system using Amazon Bedrock, specialized task agents, custom workflows supporting up to 400 steps, and intelligent document processing pipelines. The solution processes over 1 million tokens daily in production, achieving 60% faster insights and 30% reduction in task completion times while maintaining the precision required for credit ratings, risk intelligence, and regulatory compliance across credit, climate, economics, and compliance domains.
Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.