Company
jonfernandes
Title
Production RAG Stack Development Through 37 Iterations for Financial Services
Industry
Finance
Year
2025
Summary (short)
Independent AI engineer Jonathan Fernandez shares his experience developing a production-ready RAG (Retrieval Augmented Generation) stack through 37 failed iterations, focusing on building solutions for financial institutions. The case study demonstrates the evolution from a naive RAG implementation to a sophisticated system incorporating query processing, reranking, and monitoring components. The final architecture uses LlamaIndex for orchestration, Qdrant for vector storage, open-source embedding models, and Docker containerization for on-premises deployment, achieving significantly improved response quality for document-based question answering.
This case study presents the experiences of Jonathan Fernandez, an independent AI engineer who specializes in building production-ready generative AI solutions, particularly for financial institutions. The presentation focuses on the iterative development of a RAG (Retrieval Augmented Generation) stack that went through 37 failed attempts before arriving at a working production solution. The case study is particularly valuable as it demonstrates real-world challenges in deploying LLMs in regulated environments like financial services, where on-premises deployment and data security are critical requirements. The use case centers around building a question-answering system for a railway company operating in London, using a knowledge base of HTML files containing information about various services and assistance available to travelers. The target question "Where can I get help in London?" serves as the benchmark for evaluating system performance throughout the development process. Fernandez structures his approach around two distinct deployment environments: prototyping and production. For prototyping, he leverages Google Colab due to its free hardware accelerator access, which facilitates rapid experimentation with different models and configurations. The production environment, however, is designed specifically for financial institutions that require on-premises data processing and deployment, leading to a Docker-based containerized architecture that can run either on-premises or in the cloud as needed. The technical architecture is broken down into several key components, each carefully selected based on lessons learned from the 37 iterations. For orchestration, the system uses LlamaIndex in both prototyping and production environments, though LangGraph is also considered for prototyping. This choice reflects the maturity and production-readiness of LlamaIndex for enterprise deployments. The embedding strategy demonstrates a thoughtful progression from closed to open models. During prototyping, closed models from providers like OpenAI (text-embedding-ada-002 and text-embedding-3-large) are used for their simplicity and API accessibility. However, the production environment shifts to open-source models, specifically the BGE-small model from Beijing Academy of Artificial Intelligence (BAI) and models from NVIDIA. This transition addresses the financial sector's requirements for data sovereignty and reduced dependency on external APIs while maintaining comparable performance. Vector database selection focused on scalability as a primary concern. Qdrant was chosen as the vector database solution due to its ability to scale from handling just a few documents to hundreds of thousands of documents without architectural changes. This scalability consideration is crucial for financial institutions that may start with limited use cases but need to expand across multiple business units and document repositories. The language model strategy follows a similar pattern to embeddings, using closed models like GPT-3.5-turbo and GPT-4 during prototyping for ease of implementation, then transitioning to open-source alternatives in production. For production deployment, Fernandez utilizes Llama 3.2 models and Qwen 3 models (4 billion parameters) from Alibaba Cloud, served through either Ollama or Hugging Face's Text Generation Inference engine. This approach provides cost control and eliminates external dependencies while maintaining acceptable performance levels. A critical aspect of the production system is the implementation of monitoring and tracing capabilities. The system incorporates Phoenix (Arize Phoenix) for production monitoring and LangSmith for prototyping. These tools provide essential visibility into system performance, latency metrics, and component-level timing analysis. This monitoring capability is particularly important in financial services where system reliability and performance transparency are regulatory requirements. The case study demonstrates the evolution from a naive RAG implementation to a sophisticated system that incorporates advanced retrieval techniques. The initial naive approach showed poor performance, providing generic responses that didn't effectively utilize the knowledge base. The improved system introduces query processing capabilities for handling personally identifiable information (PII) removal and post-retrieval reranking to improve response accuracy. The reranking component represents a significant technical advancement in the architecture. Fernandez explains the distinction between cross-encoders and bi-encoders, highlighting how cross-encoders provide higher accuracy through semantic comparison of queries and documents using BERT-based models, but at the cost of scalability. Bi-encoders, while faster and more scalable, may sacrifice some accuracy. The final architecture strategically places bi-encoders at the vector database level for initial retrieval and cross-encoders as post-retrieval rerankers, optimizing for both speed and accuracy. For reranking, the system uses Cohere's rerank model during prototyping but transitions to NVIDIA's open-source reranking solution for production deployment. This progression again reflects the pattern of using accessible commercial solutions for rapid prototyping while transitioning to open-source alternatives for production deployment in regulated environments. Evaluation methodology is addressed through the implementation of RAGAS (RAG Assessment), which provides comprehensive quality assessment of RAG solutions across multiple dimensions. This evaluation framework is essential for maintaining system quality as the knowledge base grows and user queries become more diverse. The production deployment architecture is containerized using Docker Compose, with separate containers for different system components. The architecture includes containers for data ingestion, Qdrant vector database, frontend application, model serving (through Ollama or Hugging Face TGI), Phoenix monitoring, and RAGAS evaluation. This microservices approach provides flexibility, scalability, and easier maintenance in production environments. The case study demonstrates several important LLMOps principles in practice. First, the iterative development approach with 37 documented failures shows the importance of systematic experimentation and learning from failures. Second, the clear separation between prototyping and production environments reflects best practices in MLOps where rapid experimentation needs to be balanced with production reliability and security requirements. Third, the comprehensive monitoring and evaluation strategy demonstrates mature production thinking, recognizing that deployment is just the beginning of the system lifecycle. The inclusion of tracing, performance monitoring, and quality evaluation ensures that the system can be maintained and improved over time. However, the case study should be viewed with some critical perspective. While Fernandez presents this as a systematic approach developed through 37 failures, the specific nature of these failures and the quantitative improvements achieved at each stage are not detailed. The performance improvements are demonstrated through anecdotal examples rather than comprehensive benchmarking across diverse queries and knowledge bases. Additionally, while the focus on financial services requirements (on-premises deployment, data sovereignty) is valuable, the generalizability to other industries with different constraints may be limited. The cost implications of running open-source models on-premises versus using commercial APIs are not thoroughly analyzed, which could be a significant factor for many organizations. The technical architecture, while sound, reflects choices made at a specific point in time and may not represent the optimal stack for all use cases or future requirements. The rapid evolution of the LLM landscape means that model choices and architectural decisions may quickly become outdated. Despite these limitations, this case study provides valuable insights into the practical challenges of deploying RAG systems in production, particularly in regulated industries. The emphasis on monitoring, evaluation, and iterative improvement reflects mature engineering practices that are essential for successful LLMOps implementations. The detailed breakdown of architectural components and the rationale for each choice provides a useful framework for other practitioners developing similar systems.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.