Flipkart: Using LLMs for Automated Opinion Summary Evaluation in E-commerce

Company

Flipkart

Title

Using LLMs for Automated Opinion Summary Evaluation in E-commerce

Industry

E-commerce

Link

https://blog.flipkart.tech/one-prompt-to-rule-them-all-llms-for-opinion-summary-evaluation-d5dd4eb6f225

Year

2025

Summary (short)

Flipkart faced the challenge of evaluating AI-generated opinion summaries of customer reviews, where traditional metrics like ROUGE failed to align with human judgment and couldn't comprehensively assess summary quality across multiple dimensions. The company developed OP-I-PROMPT, a novel single-prompt framework that uses LLMs as evaluators across seven critical dimensions (fluency, coherence, relevance, faithfulness, aspect coverage, sentiment consistency, and specificity), along with SUMMEVAL-OP, a new benchmark dataset with 2,912 expert annotations. The solution achieved a 0.70 Spearman correlation with human judgments, significantly outperforming previous approaches especially on open-source models like Mistral-7B, while demonstrating that high-quality summaries directly impact business metrics like conversion rates and product return rates.

meta

## Overview and Business Context Flipkart, a major e-commerce platform, faces a critical production challenge in their customer experience pipeline: evaluating the quality of AI-generated opinion summaries from customer reviews. The business case is straightforward yet compelling. When customers browse products like smartphones on Flipkart, they encounter hundreds of user reviews. Manually sifting through this feedback is time-consuming and cognitively demanding. Opinion summarization systems use AI to analyze customer reviews and generate concise, balanced summaries that help customers make informed purchase decisions more efficiently. The quality of these summaries has direct business impact on key performance indicators. High-quality summaries increase conversion rates by giving customers confidence to purchase, improve customer satisfaction by creating a more helpful shopping experience, and reduce product return rates by setting realistic expectations. This establishes a clear business case for investing in robust evaluation methods for these production summarization systems. ## The Evaluation Problem in Production The core LLMOps challenge that Flipkart addresses is fundamental to operating generative AI systems at scale: how do you reliably evaluate whether an AI-generated summary is actually good in a production context? This is not merely an academic question but a practical operational necessity. Traditional NLP evaluation metrics like ROUGE (which measures lexical overlap between generated and reference summaries) and BERTSCORE have been the standard approach, but Flipkart's research reveals these metrics are inadequate for modern LLM-generated content. The problem with traditional metrics is multifaceted. They prioritize surface-level lexical overlap over semantic meaning and fail to capture crucial qualities that matter to actual users: coherence, faithfulness to source material, comprehensive aspect coverage, and sentiment accuracy. More critically for production operations, these metrics show poor or even negative correlation with human judgment. When your evaluation metrics don't align with human perception of quality, you lose the ability to make informed decisions about model improvements, A/B testing, or deployment strategies. For opinion summarization specifically, the evaluation crisis is particularly acute. A summary must balance multiple competing objectives: it needs to be fluent and coherent while also being faithful to source reviews, covering all important product aspects, accurately reflecting sentiment distributions, and providing specific rather than generic information. Single-dimensional metrics simply cannot capture this complexity. ## The Solution: SUMMEVAL-OP Dataset and OP-I-PROMPT Framework Flipkart's research team developed a two-pronged solution addressing both the data and methodology gaps in LLM evaluation for production systems. First, they created SUMMEVAL-OP, a benchmark dataset specifically designed for opinion summary evaluation. This dataset contains 2,912 expert annotations on summaries generated by 13 different models ranging from older self-supervised systems to modern LLMs like GPT-4, Solar-10.7B, and Mistral-7B. The dataset covers 32 different products from the Amazon review test set, making it highly relevant for real-world e-commerce applications. The critical innovation in SUMMEVAL-OP is its multi-dimensional evaluation framework. Rather than assigning summaries a single quality score, expert annotators rated each summary on a 1-5 scale across seven distinct dimensions: - **Fluency** assesses grammatical correctness and readability - **Coherence** evaluates whether sentences flow logically together - **Relevance** measures whether the summary focuses on the most important and widely-held opinions - **Faithfulness** verifies that every piece of information is directly supported by source reviews - **Aspect Coverage** checks whether all key topics are mentioned (battery life, screen quality, customer service, etc.) - **Sentiment Consistency** ensures accurate reflection of positive, negative, or neutral sentiment for each aspect - **Specificity** evaluates whether detailed information is provided rather than vague, generic statements This comprehensive evaluation approach reflects the actual quality dimensions that matter in production e-commerce environments where users need actionable, trustworthy information to make purchase decisions. ## OP-I-PROMPT: A Modular Prompting Strategy The second component of Flipkart's solution is OP-I-PROMPT (Opinion Summary Independent Prompt), an innovative prompting framework that enables LLMs to serve as automated evaluators without requiring fine-tuning. This is crucial from an LLMOps perspective because it dramatically reduces the operational overhead of evaluation systems. Fine-tuning models for evaluation tasks requires labeled data, computational resources, and ongoing maintenance as models evolve. A prompt-based approach that works across different LLMs offers far greater flexibility and sustainability. OP-I-PROMPT's core innovation is its dimension-independent architecture. Unlike previous approaches that either use overly simplistic prompts or require unique, handcrafted prompts for each evaluation dimension, OP-I-PROMPT uses a modular design with a fixed skeleton prompt. The skeleton contains three constant components: Task Description, Evaluation Criteria (defining the 1-5 rating scale), and Evaluation Steps (guiding the reasoning process). To evaluate a different dimension, practitioners simply swap out a single component: the Metric Definition block that describes what aspect is being evaluated. This plug-and-play architecture offers significant operational advantages. The standardized input format ensures stable LLM behavior across different metrics, making evaluation results more reliable and reproducible. The approach reduces prompt engineering effort since the core structure doesn't need to change for different dimensions. From a maintenance perspective, improvements to the skeleton benefit all evaluation dimensions simultaneously. Critically, OP-I-PROMPT incorporates chain-of-thought reasoning by explicitly instructing the model to perform step-by-step analysis before assigning scores. This forces the model to justify its reasoning, leading to evaluations that correlate much more closely with human judgment. This aligns with broader LLMOps best practices around making model reasoning transparent and auditable, which is essential for building trust in automated evaluation systems. ## Performance Results and Model Comparisons The experimental results demonstrate that OP-I-PROMPT achieves an average Spearman correlation of 0.70 with human judgments across the seven evaluation dimensions. This strong alignment with human expert ratings is crucial for production deployment because it means the automated evaluation system can serve as a reliable proxy for human judgment, enabling scalable quality assessment without constant human annotation. The performance comparisons reveal important insights for LLMOps practitioners making technology choices. On closed-source models like ChatGPT-3.5, OP-I-PROMPT performs comparably to G-EVAL, another LLM-based evaluation approach. However, on open-source models like Mistral-7B, OP-I-PROMPT significantly outperforms alternatives including G-EVAL. This is particularly valuable because it demonstrates that high-quality automated evaluation is accessible using open-source models, reducing dependency on proprietary APIs and providing more control over costs, data privacy, and deployment environments. The research also reveals interesting findings about summary generation quality that inform model selection decisions. Human raters consistently preferred summaries generated by modern LLMs (GPT-4, Solar-10.7B, and Mistral-7B) over both older model outputs and even human-written reference summaries. This establishes a clear performance hierarchy with GPT-4 at the top, followed closely by Solar-10.7B and Mistral-7B, all of which outperformed ChatGPT-3.5. For organizations like Flipkart making production deployment decisions, this provides evidence that open-source models like Solar-10.7B and Mistral-7B can deliver near-GPT-4 quality at potentially lower cost and with greater operational flexibility. ## Critical Assessment and Production Considerations While the research presents compelling results, several considerations warrant attention when evaluating this work from an LLMOps perspective. First, the claim that traditional metrics are "obsolete" should be tempered with nuance. While ROUGE and BERTSCORE indeed show poor correlation with human judgment for opinion summarization, they may still serve useful roles in production pipelines as fast, deterministic sanity checks or for detecting gross anomalies. Complete reliance on LLM-based evaluation introduces dependencies on model availability, costs, and potential API rate limits that may not exist with traditional metrics. The correlation of 0.70 with human judgment, while strong, also means 30% of the variance in human ratings is not captured by the automated approach. For high-stakes production decisions, this gap may necessitate hybrid evaluation strategies that combine automated assessment with periodic human audits, especially for edge cases or when summary quality significantly impacts business outcomes. The research focuses primarily on English-language Amazon reviews, and the generalizability to other languages, domains, or review types common on platforms like Flipkart (which serves diverse Indian markets) remains to be demonstrated. Cross-lingual evaluation and handling of code-mixed reviews (common in multilingual contexts) would be important extensions for full production deployment. From a cost perspective, using LLMs as evaluators introduces ongoing inference costs that scale with the volume of summaries to evaluate. While the modular prompt design is efficient, organizations need to carefully model the economic tradeoffs between automated LLM evaluation costs, traditional metric computation costs, and human annotation costs to find the optimal evaluation strategy for their scale and use case. The paper mentions that modern LLMs generate summaries preferred over human references, which is presented positively but raises questions about whether human references are the right benchmark. This finding could reflect limitations in how the human reference summaries were created rather than purely superior LLM capabilities. In production contexts, the ultimate validation is whether summaries drive desired user behaviors (purchases, reduced returns, higher satisfaction) rather than matching expert preferences. ## LLMOps Integration and Operational Implications From an operational standpoint, OP-I-PROMPT represents a practical approach to the "LLM as judge" pattern that has become increasingly important in LLMOps. The framework addresses several key operational challenges. The dimension-independent design means that as evaluation requirements evolve (perhaps adding new dimensions like "personalization" or "cultural sensitivity"), the core infrastructure remains stable while only metric definitions change. This reduces technical debt and makes the evaluation system more maintainable. The use of chain-of-thought prompting provides interpretability that is valuable for debugging and continuous improvement. When the automated evaluation disagrees with spot-check human assessments, practitioners can examine the model's reasoning chain to understand whether the issue stems from prompt design, model limitations, or genuine edge cases. This level of transparency is often missing from black-box evaluation approaches. The research demonstrates feasibility with open-source models like Mistral-7B, which opens deployment options that may be preferable for organizations with data privacy constraints, cost sensitivity, or requirements for on-premises deployment. The ability to self-host evaluation models provides greater control over the entire ML pipeline and reduces dependency on external API providers. For continuous integration and deployment pipelines, automated evaluation with OP-I-PROMPT could be integrated as a quality gate, automatically assessing summary quality for new model versions before promotion to production. The standardized 1-5 rating scale across dimensions enables setting clear quality thresholds and tracking quality metrics over time, supporting data-driven decisions about model updates and rollbacks. ## Research Contributions and Future Directions The work makes several tangible contributions to the LLMOps community. SUMMEVAL-OP provides a reusable benchmark that other researchers and practitioners can use to evaluate their own summarization systems, enabling more consistent comparisons across different approaches. The multi-dimensional evaluation framework offers a template that could be adapted to other generation tasks beyond opinion summarization where quality is inherently multi-faceted. The OP-I-PROMPT methodology demonstrates that thoughtful prompt engineering can achieve strong results without fine-tuning, which is valuable for organizations that lack the infrastructure or expertise for large-scale model training. The modular design pattern itself is generalizable to other evaluation scenarios where multiple related dimensions need assessment. The authors acknowledge plans to extend this work to handle large-scale and multi-source evaluations, which would address important production scenarios like evaluating summaries across thousands of products simultaneously or aggregating reviews from multiple platforms. These extensions would further enhance the practical applicability of the approach. ## Business Impact and Broader Implications The connection between summary quality and business metrics (conversion rates, customer satisfaction, return rates) establishes a clear value proposition for investing in sophisticated evaluation infrastructure. However, the research would be strengthened by quantitative evidence demonstrating this connection. Metrics showing how improvements in specific evaluation dimensions (e.g., increasing faithfulness from 3 to 4) translate to measurable changes in business KPIs would help justify the operational investment required to deploy and maintain LLM-based evaluation systems. The work illustrates a broader trend in LLMOps where LLMs are used not just as primary generative systems but as components in evaluation, monitoring, and quality assurance infrastructure. This "LLMs evaluating LLMs" pattern introduces interesting considerations around potential biases (e.g., models favoring outputs similar to their own generation style) that warrant ongoing attention in production deployments. Overall, Flipkart's research addresses a genuine pain point in operating production LLM systems at scale: the need for reliable, comprehensive, automated evaluation that aligns with human judgment. The practical focus on e-commerce applications, the emphasis on open-source model compatibility, and the modular design philosophy all reflect mature thinking about operational realities. While questions remain about generalizability, costs, and the completeness of any automated evaluation, the work represents a solid step toward more robust LLMOps practices for content generation systems.

Start deploying reproducible AI workflows today