Needl.ai: Building Trust in RAG Systems Through Structured Feedback and User Collaboration

Company

Needl.ai

Title

Building Trust in RAG Systems Through Structured Feedback and User Collaboration

Industry

Finance

Link

https://www.needl.ai/blog/from-patterns-to-progress-how-we-drove-trust-in-rag-without-retraining-the-model

Year

2025

Summary (short)

Needl.ai's AskNeedl product faced challenges with user trust in their RAG-based AI system, where issues like missing citations, incomplete answers, and vague responses undermined confidence despite technical correctness. The team addressed this through a structured feedback loop involving query logging, pattern annotation, themed QA sets, and close collaboration with early adopter users from compliance and market analysis domains. Without retraining the underlying model, they improved retrieval strategies, tuned prompts for clarity, enhanced citation formatting, and prioritized fixes based on high-frequency queries and high-trust personas, ultimately transforming scattered user frustration into actionable improvements that restored trust in production.

Tags

question_answering

document_processing

regulatory_compliance

## Overview This case study describes Needl.ai's experience deploying and improving AskNeedl, a RAG-based AI system designed for enterprise knowledge management and financial services use cases. The company's platform serves compliance officers, market analysts, and documentation experts who rely on the system to answer complex queries about regulatory filings, risk disclosures, and corporate actions. The central challenge wasn't technical failure in the traditional sense—the system was functioning as designed—but rather a trust gap where users perceived outputs as unreliable due to missing citations, incomplete answers, and vague responses that felt "inferred" rather than grounded in source documents. The case study is particularly valuable because it focuses on improving a production RAG system without retraining the underlying language model, instead emphasizing product management discipline, structured feedback loops, and collaborative iteration with real users. This represents a pragmatic approach to LLMOps that many organizations will find more accessible than extensive model retraining or benchmark development. ## The Trust Problem in Production RAG Systems Needl.ai's core insight was recognizing that their RAG system exhibited what they termed "hallucination-adjacent failures"—behaviors that weren't necessarily factually incorrect but broke user trust nonetheless. These patterns included missing or incomplete citations where the system provided answers without clear source attribution, vague or partial answers that felt confident but lacked necessary detail or context, and wrong reference matching where the system confused similar entity names like HDFC Bank versus HDFC Securities. The team emphasized that in many cases, users weren't pointing out factual errors but rather flagging "breaks in trust." This distinction proved crucial for how they approached evaluation and prioritization. For enterprise users in compliance and regulatory contexts, auditability became as important as factual accuracy—users needed to trace answers back to known sources to feel confident using the information in reports or decision-making. ## Building a Lightweight, High-Signal QA Loop Rather than investing in automated hallucination evaluators or extensive benchmark suites, Needl.ai built what they describe as a "lightweight, high-signal manual feedback loop." This operationalized approach to LLMOps quality management involved several key practices that translated scattered user frustration into structured, actionable insights. The team logged all queries across internal testing and pilot user sessions, then focused analysis on failed or suspicious queries where users showed signs of dissatisfaction or confusion. Each problematic case was annotated with specific failure categories: "hallucination," "citation missing," "partial answer," or "retrieval gap." This taxonomy helped distinguish different failure modes that required different technical interventions. A particularly effective practice was creating themed QA sets based on recurring pain points observed in real usage. Examples included query patterns like "stake change over 6 months," "revenue drivers," and "SEBI circular compliance." These theme-based collections allowed the team to identify systemic issues rather than treating each problem as isolated. The product manager maintained a simple shared spreadsheet to log issues, link correct documents, tag themes, and facilitate regular review cycles with the ML team. Recently, the team integrated an MCP (Model Context Protocol) setup to partially automate response evaluation, adding structure to what was previously entirely manual. While this automation is used internally for QA, the same system powers how AskNeedl routes insights into reports, dashboards, and decision systems—demonstrating how internal LLMOps tooling can align with product functionality. The team notes this isn't a fully autonomous QA system but represents an evolution beyond "spreadsheets and gut feel." ## User Collaboration as Ground Truth A distinguishing aspect of Needl.ai's approach was actively involving early adopter teams as essential participants in the evaluation process. These users—compliance officers, market analysts, and documentation experts already using AskNeedl in production or pilot settings—became the source of ground truth for what constituted acceptable quality in their specific contexts. These early users helped define nuanced quality dimensions that wouldn't emerge from automated metrics alone: what counted as acceptable versus unacceptable paraphrasing, what level of citation coverage was sufficient versus insufficient, and crucially, which answers "felt right" versus those that "sounded confident but were incomplete." This last distinction captures the subtle trust signals that can make or break adoption in enterprise settings where users face consequences for incorrect information. The product manager describes a translation role that became central to effective collaboration: converting user statements like "this feels vague" into technical specifications the ML team could act on, such as "retrieval precision dropped due to fuzzy match between HDFC Bank and HDFC Securities." This translation layer between user experience and technical implementation represents an often-underappreciated aspect of production LLMOps—the need for someone who can bridge these perspectives effectively. ## Targeted Optimization Without Model Retraining Armed with structured feedback and clear, user-validated examples, the ML team could make targeted improvements to the RAG system without retraining the underlying language model. Key optimization areas included adjusting retrieval strategies to prioritize recent disclosures or relevant document types, tuning prompts for improved clarity in date range handling and entity matching, and enhancing citation formatting and fallback logic when sources were ambiguous or incomplete. The shift from vague problem reports ("model is wrong") to specific, contextualized feedback ("here's what this user expected, why this output felt unreliable, and what could have made it better") enabled much more efficient iteration. This represents a practical LLMOps pattern: many production issues can be addressed through retrieval, prompt, and presentation improvements rather than requiring expensive model retraining. ## Prioritization in LLMOps The case study highlights an important product management discipline in LLMOps: not all failures are equal, and prioritization decisions significantly impact where limited engineering resources should focus. Needl.ai prioritized based on three dimensions: high-frequency queries with repeat issues that affected many users, high-trust personas like compliance teams and investor relations professionals who needed especially reliable outputs, and high-sensitivity topics such as stock movements, regulatory changes, or stakeholder actions where errors could have serious consequences. This prioritization framework allowed the team to direct ML and engineering effort toward business-critical trust restoration rather than pursuing model elegance or academic benchmarks. It represents a pragmatic approach to quality improvement in production LLM systems where perfect is the enemy of good and strategic focus yields better results than attempting to fix everything at once. ## Measuring Quality Without Ground Truth One of the most thoughtful sections of the case study addresses the fundamental challenge of evaluating RAG output quality when there's no clear right answer. Many user queries in enterprise search contexts are inherently open-ended: "What are the key risk disclosures in the latest filings?" or "What's the company's outlook for next quarter?" These questions don't have single correct answers that can be scored against a gold standard. Needl.ai's evaluation strategy acknowledged this reality by assessing quality across multiple dimensions rather than seeking binary correctness. Completeness asked whether outputs covered all relevant points. Factuality verified that information was pulled from real sources. Relevance assessed whether the system surfaced the right data. Trustworthiness captured whether outputs "felt" reliable to users—a subjective but crucial dimension for enterprise adoption. The team developed a multi-layered, semi-manual evaluation approach combining several methods. They built a bank of approximately 200 task-specific queries, each tagged with expected behavior and red flags that would indicate problems. Live usage reviews sampled real user sessions to observe behavioral signals: did users reformulate the same question repeatedly, did they stop after receiving an answer or click through to sources, and were citations shown and actually clicked? These usage patterns served as proxies for satisfaction and trust even when direct feedback wasn't available. The human evaluation layer remained central, with early users regularly asked questions like "Would you trust this answer in a report?", "Is anything missing?", and "Does this feel inferred or grounded?" This feedback formed what the team describes as the "human layer of our quality assurance process," acknowledging that in current LLMOps practice, human judgment remains irreplaceable for nuanced quality assessment. ## Learning From Industry Patterns The case study also reflects on how leading AI products like ChatGPT, Perplexity, and Claude approach answer quality and trust, drawing lessons applicable to their own RAG system. Several industry patterns emerged from this analysis. Human evaluation remains central even in the most advanced systems, which rely heavily on side-by-side output comparisons, human scores for factuality and usefulness, and internal red-teaming to stress-test edge cases. This observation validated Needl.ai's decision to center real-user feedback in their QA loops rather than rushing to full automation. Leading tools evaluate based on multi-dimensional metrics rather than single scores, assessing faithfulness (grounding in retrieved content), coverage (completeness relative to the query), and confidence calibration (whether tone matches source certainty). This inspired Needl.ai to tag answers not just as correct or incorrect but with more nuanced categories: "fully grounded with citation," "partially retrieved," "fluent but unverifiable," or "factually incorrect or hallucinated." The use of adversarial prompting and stress tests in models like GPT-4 and Claude reflected Needl.ai's own discovery that more ambiguous or summary-based queries produced more fragile RAG outputs, especially when citations were missing or retrieval was partial. This suggests a general pattern where RAG systems struggle predictably with certain query types, and understanding these boundaries helps set appropriate user expectations. ## The PM's Role in Building Trust The case study concludes with reflections on product management's distinctive role in AI systems. Traditional PM skills around usability, conversion, and retention remain important, but AI products add what the team calls "credibility" as a core dimension. RAG systems don't just need to answer questions—they need to do so with transparency, humility, and traceability, because when they fail at this, even correct answers can feel wrong, and lost trust is difficult to recover. Product managers may not train models or write prompts directly, but they shape the environment in which trust is earned. This includes decisions about what information to surface and when (like citations), how to handle ambiguity and partial answers gracefully, and what verification tools to provide users so they can trace, validate, and make informed decisions with the system's outputs. ## Critical Assessment While this case study offers valuable practical insights into production RAG operations, several aspects deserve balanced consideration. The approach is heavily manual and relies significantly on access to engaged early users willing to provide detailed feedback—this may not scale easily or be available to all organizations. The case study is also essentially a first-person account from Needl.ai about their own product, so claims about effectiveness lack independent validation or quantitative results demonstrating actual improvement in user trust or adoption metrics. The recent integration of MCP for partial automation suggests the manual approach had scaling limitations that eventually required tooling investment. Organizations should consider upfront what level of manual process they can sustain and when to invest in automation infrastructure. The focus on enterprise financial services users with specific needs around compliance and auditability may limit generalizability—different domains might require different trust signals and evaluation approaches. The case study also doesn't deeply address some technical RAG challenges like how to handle fundamentally missing information in the knowledge base, strategies for keeping retrieval systems current as document collections evolve, or performance and latency considerations that affect user experience in production. The decision not to retrain the base model was presented as successful, but the case study doesn't explore whether there were limitations to what could be achieved without model improvements or what circumstances might eventually justify that investment. ## LLMOps Insights and Patterns Despite these limitations, the case study illuminates several valuable LLMOps patterns. It demonstrates that many production RAG issues stem from retrieval, prompting, and presentation rather than core model capabilities, suggesting where to focus improvement efforts first. The emphasis on structured feedback loops and systematic categorization of failure modes offers a practical template for quality management without requiring sophisticated ML infrastructure. The recognition that trust is multidimensional—involving completeness, factuality, relevance, auditability, and subjective confidence—helps frame evaluation strategies beyond simple accuracy metrics. The collaboration model between product, ML, and users, with PM serving as translator between user experience and technical implementation, represents an organizational pattern that others might adopt. Perhaps most importantly, the case study illustrates that in enterprise LLM applications, behavioral signals like reformulation rates, citation clicks, and willingness to use outputs in formal reports can serve as valuable quality proxies when ground truth is unavailable. This user-centric approach to evaluation acknowledges the ultimately social and contextual nature of trust in AI systems, particularly for high-stakes enterprise use cases where consequences of errors extend beyond individual users.

Start deploying reproducible AI workflows today