## Overview
Q4 Inc. is a capital markets access platform headquartered in Toronto with offices in New York and London. The company provides solutions for investor relations including IR website products, virtual events, engagement analytics, CRM, shareholder analysis, and ESG tools. Their goal was to build a Q&A chatbot that would help Investor Relations Officers (IROs) efficiently access information from diverse datasets including CRM records, ownership data, and stock market information in a user-friendly format.
This case study is particularly interesting from an LLMOps perspective because it demonstrates the limitations of conventional RAG approaches when dealing with numerical and structured data, and shows how the team iterated through multiple approaches before arriving at a production-worthy solution using SQL generation.
## The Problem with Traditional RAG for Numerical Data
The Q4 team initially attempted several common approaches to building their chatbot, each with significant challenges:
**Pre-training** was quickly ruled out as impractical. The team recognized that pre-training an LLM on their own dataset would be resource-intensive, cost-prohibitive, and require continuous incremental training as new time-series data arrived. This would have required a dedicated cross-disciplinary team with data science, ML, and domain expertise.
**Fine-tuning** a pre-trained foundation model showed initial promise but struggled with hallucination issues. The model had difficulty understanding nuanced contextual cues and frequently returned incorrect results.
**Conventional RAG with semantic search** was the most extensively tested approach. The team experimented with search, semantic search, and embeddings to extract context. When using embeddings, the dataset was converted to vectors, stored in a vector database, and matched with question embeddings. However, this approach proved fundamentally unsuitable for Q4's dataset because embeddings generated from numbers struggled with similarity ranking. Financial data consisting of numbers, transactions, stock quotes, and dates simply did not work well with embedding-based retrieval, leading to incorrect information being retrieved as context.
This is a critical insight for LLMOps practitioners: RAG with semantic search works well for text-based content with natural language, but numerical and structured datasets require different approaches for accurate context retrieval.
## The SQL Generation Solution
Q4's hypothesis was that to achieve higher recall for context retrieval on numerical datasets, they needed to generate SQL from user questions. This would not only increase accuracy but also keep the context within the business domain. The solution architecture involves multiple stages:
The first stage translates the user's natural language question into a SQL query. To generate accurate SQL, the LLM needs to be fully context-aware of the dataset structure, which means the prompt includes the database schema, sample data rows (2-5 rows were found to be sufficient), and human-readable explanations for fields that are not self-explanatory.
The second stage (optionally) verifies the generated SQL for correct syntax and contextual sense. However, after testing, Q4 found that the SQL generation quality was consistently high enough that this validation step could be eliminated, improving latency and reducing costs.
The third stage executes the verified SQL query against the database to retrieve the relevant context.
The final stage sends the user's original question along with the retrieved context and instructions to the LLM for summarization, producing a contextual and accurate answer.
## Technology Stack and Implementation
**Amazon Bedrock** serves as the fully managed, serverless foundation model platform. Its key advantage for Q4 was providing access to multiple LLMs through a single API, enabling the team to switch between models for different tasks without infrastructure changes. This flexibility was crucial for optimizing accuracy, performance, and cost at each stage of the pipeline.
**LangChain** was adopted as the orchestration framework, providing pre-built modules for coordinating tasks between foundation models, data sources, and tools. This significantly reduced the development effort that would have been required to build the orchestration logic from scratch.
**SQLDatabaseChain** from langchain_experimental simplified the creation, implementation, and execution of SQL queries through text-to-SQL conversions. It handles the interactions between the database and the LLM, managing schema retrieval and query construction.
## Model Selection Strategy
A key LLMOps lesson from this case study is the importance of selecting the right model for each task rather than using a single model for everything. Q4's approach involved:
For **SQL generation**, Claude V2 (Anthropic) was selected for its advanced reasoning capabilities. However, testing revealed that Claude Instant could produce comparable results for simpler, more direct user questions. For sophisticated input, Claude V2 remained necessary to achieve the desired accuracy.
For **SQL validation** (when used), a smaller model like Claude Instant was considered sufficient since the task is simpler than generation.
For **summarization**, Titan Text Express or Claude Instant were found to provide adequate accuracy at better performance and cost points compared to larger models like Claude V2.
This multi-model approach allowed Q4 to optimize each step of the pipeline independently, balancing accuracy, latency, and cost based on task requirements.
## Prompt Engineering Techniques
The team developed specific prompt engineering techniques optimized for Claude models:
They used the proper human/assistant syntax that Claude is trained to understand. XML tags were employed extensively because Claude respects and understands them well. These were used to frame instructions, additional hints, database schema, table explanations, and example rows.
Clear instructions were added to prevent hallucination, including telling the model to say "sorry, I am unable to help" if it cannot produce relevant SQL. The prompt explicitly instructs the model to provide only SQL with no additional comments and not to make up answers.
The template structure includes sections for instructions, database_schema, table_description (explaining abbreviated column names), example_rows, the user question, and additional_hints. This structured approach improved both the quality and consistency of generated SQL.
## Performance Optimization
Several optimizations were implemented to achieve the target of single-digit second response times:
The SQL validation step was eliminated after analysis showed consistently accurate SQL generation, removing one complete LLM round trip from the pipeline.
Smaller, more efficient models were selected for tasks that didn't require the full capability of larger models, particularly for summarization where Titan Text Express provided better performance and cost efficiency.
Prompt optimization focused on providing the minimum tokens necessary with the right syntax and minimal yet optimal instructions. Rather than overwhelming the model with information, the focus was on efficiency.
Input/output token sizes were optimized to the smallest values that could produce the required accuracy level.
## Production Considerations
Q4 built their AI services as microservices accessible through APIs, enabling modular and scalable integration with their platform ecosystem. This architecture allows the Q&A capability to be exposed across multiple platform applications while maintaining consistent security and data privacy standards.
The financial services context imposed strict requirements around data privacy and response accuracy. Providing incorrect or outdated information in this regulated industry could impact investor trust and create compliance risks. These constraints guided the team's emphasis on accuracy over speed, though they ultimately achieved both.
The solution maintains commercial feasibility by leveraging serverless infrastructure (Amazon Bedrock), open-source tooling (LangChain), and careful optimization of model selection and prompt engineering to balance quality with cost.
## Key Takeaways for LLMOps
This case study illustrates several important principles for production LLM systems:
Context extraction strategy must match the data type. Semantic search and embeddings work well for natural language text but fail for numerical and structured data. SQL generation provides a more reliable retrieval mechanism for tabular data.
Multi-model pipelines can optimize cost, latency, and accuracy by selecting the right model for each task rather than using a single model for everything.
Serverless, API-based model platforms like Amazon Bedrock reduce operational overhead and provide flexibility to experiment with and switch between models.
Orchestration frameworks like LangChain reduce development effort by providing pre-built modules for common patterns like database interaction and LLM chaining.
Prompt engineering significantly impacts both quality and performance. Well-structured prompts using model-specific syntax (like XML tags for Claude) improve consistency and reduce token usage.
Validation steps should be evaluated for cost/benefit. If upstream quality is high enough, removing validation can improve latency without sacrificing accuracy.