Mercado Libre: Practical Lessons from Deploying LLMs in Production at Scale

Company

Mercado Libre

Title

Practical Lessons from Deploying LLMs in Production at Scale

Industry

E-commerce

Link

https://medium.com/mercadolibre-tech/beyond-the-hype-real-world-lessons-and-insights-from-working-with-large-language-models-6d637e39f8f8

Year

2024

Summary (short)

Mercado Libre explored multiple production applications of Large Language Models across their e-commerce and technology platform, tackling challenges in knowledge retrieval, documentation generation, and natural language processing. The company implemented a RAG system for developer documentation using Llama Index, automated documentation generation for thousands of database tables, and built natural language input interpretation systems using function calling. Through iterative development, they learned critical lessons about the importance of underlying data quality, prompt engineering iteration, quality assurance for generated outputs, and the necessity of simplifying tasks for LLMs through proper data preprocessing and structured output formats.

Tags

## Overview Mercado Libre, one of Latin America's largest e-commerce and technology platforms, embarked on a comprehensive exploration of Large Language Models in production settings, implementing multiple use cases that provide valuable insights into real-world LLMOps challenges. The article presents a candid behind-the-scenes look at three distinct production applications: a Retrieval Augmented Generation system for technical documentation, an automated documentation generation system for data assets, and natural language input interpretation for booking and product information extraction. What makes this case study particularly valuable is the company's willingness to discuss not just successes but also failures and iterative improvements, providing a realistic view of deploying LLMs in production environments. ## Use Case 1: Retrieval Augmented Generation for Developer Documentation Mercado Libre's first major LLM production deployment focused on creating a centralized question-answering system for developers. The system needed to handle queries about both widely-used external tools (BigQuery, Tableau, Looker) and proprietary internal tools (Fury Data Applications, Data Flow, etc.). The implementation leveraged Llama Index, an open-source framework that manages the entire RAG pipeline including knowledge index construction, storage, context retrieval, and answer generation. The technical architecture centered on building a searchable knowledge base from documentation, with the system retrieving relevant context based on user queries and generating responses complete with source material links. The initial prototype demonstrated impressive capabilities and generated significant excitement among stakeholders, functioning effectively as an intelligent search engine for technical documentation. However, the production deployment quickly revealed critical limitations. As user adoption expanded beyond early enthusiasts, the team observed a concerning pattern: when users asked questions about topics not adequately covered in the documentation, the LLM would fall back on its general pre-training knowledge, leading to hallucinated responses that were confidently stated but factually incorrect in the context of Mercado Libre's specific tools and processes. This led to their first major production learning: **LLMs cannot reliably answer questions beyond their contextual knowledge base**. While this seems obvious in retrospect, the smooth functioning of the system when documentation was adequate masked this fundamental limitation during initial testing. The team realized they needed a systematic approach to ensure the model only answered questions where it had sufficient contextual information. The solution involved implementing rigorous testing protocols. The team developed test sets comprising queries they needed the system to answer accurately, as well as queries they explicitly wanted the system to decline or redirect. This evaluation process exposed significant gaps in the underlying documentation—certain tools, actions, and workflows that users frequently inquired about simply weren't documented at all. More subtly, the testing revealed documentation quality issues even where information technically existed. Some documentation described processes mechanically without explaining the "why"—the user problems these processes solved or the contexts in which they should be applied. This lack of problem-oriented documentation made it difficult for the RAG system to match user queries (which were typically problem-focused) with relevant documentation (which was procedure-focused). The team learned that **documentation quality and structure fundamentally determines RAG system performance**, and that improving LLM-based systems often means improving the underlying data assets rather than tuning the model or prompts. ## Use Case 2: Automated Documentation Generation at Scale Building on insights from the RAG system deployment, Mercado Libre expanded their ambitions to enable the documentation system to answer questions about data sources—specifically helping users identify which tables and fields contained the information they needed. However, integrating table descriptions into the RAG system produced disappointing results, and the root cause was immediately apparent: their existing data catalog documentation was cursory and superficial. The scope of the problem was daunting. Out of approximately 4,000 production tables, roughly 2,000 lacked adequate documentation. These undocumented tables were typically understood only by their creators or shared informally within specific teams, representing a massive knowledge management challenge. Creating comprehensive documentation manually was neither feasible nor cost-effective, making this an ideal application for LLM-based automation. The technical approach involved leveraging existing metadata—table names, field names, field types, and whatever existing technical documentation was available—to generate human-readable, contextual documentation. The initial prompt was straightforward: "You're an expert documenter, please create documentation for table {TABLE_NAME} based on the following elements." This simple approach achieved remarkable results, with approximately 90% of table owners agreeing with the generated documentation after making only minor adjustments. However, the 10% who rejected the generated documentation provided critical feedback that drove iterative improvements. The complaints clustered around three issues: lack of clear structure for easy comprehension, failure to explain internal acronyms and technical terminology specific to Mercado Libre, and cases where existing documentation was already adequate and the generated version added no value or introduced errors. This experience taught them a crucial LLMOps lesson about the importance of **iterative prompt engineering combined with systematic quality assurance**. The team realized they needed to move beyond treating prompts as static instructions and instead view them as products requiring continuous improvement based on production feedback. They implemented QA protocols that asked specific questions about generated outputs: Does the text behave as intended? Is it using all available information? Are we lacking information needed for accurate responses? The refined approach involved creating dynamic, structured prompts that adapted based on available input information. Rather than a single generic prompt, they developed a system that could operate at different levels of information availability—generating more detailed documentation when rich metadata was available, and more conservative documentation when information was sparse. They also established clear output schemas defining what good documentation should look like, providing the model with explicit formatting and structural guidance. This evolution from simple prompt-based generation to a sophisticated documentation pipeline demonstrates mature LLMOps thinking: treating LLM applications as systems requiring ongoing monitoring, evaluation against clear quality metrics, and continuous improvement based on stakeholder feedback. ## Use Case 3: Natural Language Input Interpretation with Function Calling The third production use case addressed a different challenge: extracting structured information from unstructured natural language text. This went beyond traditional Named Entity Recognition (which handles entities like names, dates, and organizations) to tackle more complex interpretation challenges—understanding that "next Thursday" refers to a specific calendar date, or identifying which numbers in a product listing represent quantities versus specifications. One concrete example involved product listings where unit quantities were expressed inconsistently: "ct.c/2000" (indicating a package containing 2000 items), "pcs" for pieces, "u" for units, and various other non-standardized formats. Traditional regex or rule-based extraction would struggle with this variability, but LLMs could leverage contextual understanding to interpret the intended meaning. Another application was the "Data Doctors" booking system, which connected developers and business users with data experts (specialists in databases, dashboards, machine learning, etc.). The team wanted to enable natural language booking requests like "I want to consult an expert in Tableau who is available next Thursday." This required extracting structured information—topic expertise (Tableau) and date—from free-form text, with the particular challenge that relative date expressions like "next Thursday" needed conversion to absolute dates for calendar system integration. The technical solution leveraged **function calling**, a capability introduced in GPT-3.5 and available in other models like LLaMA 2. Function calling allows LLMs to output structured data in predefined schemas rather than generating free-form text. By defining functions with specific parameters (expertise areas, dates in ISO format, etc.), the team could constrain the LLM to extract information in formats directly usable by downstream systems. This approach proved particularly effective for **extracting specific information already contained in text** while ensuring consistency and preventing the verbose, contextual responses that LLMs naturally generate. Rather than receiving a sentence like "The user wants to book an appointment with a Tableau expert on December 12, 2024," the system would receive structured data: `{"expertise": "Tableau", "date": "2024-12-12"}` ready for direct database insertion or API calls. The function calling use case demonstrates sophisticated LLMOps practice: understanding not just what LLMs can do, but selecting the appropriate interaction pattern for specific production requirements. Rather than treating LLMs solely as text generators, the team leveraged them as intelligent parsers and interpreters, extracting structured data from unstructured inputs. ## Cross-Cutting LLMOps Lessons and Production Practices Throughout these three use cases, Mercado Libre developed several overarching insights about LLMs in production: **Data Quality as Foundation**: Perhaps the most recurring theme is that LLM system performance fundamentally depends on underlying data quality. The RAG system required comprehensive, well-structured documentation. The documentation generation system's output quality correlated directly with the richness of input metadata. No amount of prompt engineering or model selection can compensate for inadequate source data, making data asset improvement often the most impactful intervention for LLM system performance. **Iterative Development and Evaluation**: The team emphasized moving beyond initial prototypes to implement systematic testing and quality assurance. This includes creating evaluation sets covering both positive cases (queries that should work) and negative cases (queries that should be declined), monitoring production outputs for quality issues, and establishing feedback loops with end users. The evolution of their documentation generation prompts from simple and generic to complex and adaptive exemplifies this iterative approach. **Task Simplification and Preprocessing**: A key principle articulated in their final thoughts is that "raw" LLMs should not be treated as magic solutions. The team learned to **simplify tasks for LLMs through intensive data processing outside the model**. This means cleaning and structuring input data, breaking complex tasks into smaller steps, and handling deterministic logic with traditional code rather than expecting the LLM to handle everything. This approach reduces costs, improves reliability, and makes systems easier to debug. **Appropriate Model Selection**: The article hints at cost-benefit considerations in model selection, noting companies should "use higher-cost models when necessary" after evaluating whether simpler, less expensive models can meet requirements. This suggests Mercado Libre developed practices around model tiering—using more capable (and expensive) models only for tasks that truly require their capabilities, while handling simpler tasks with smaller or cheaper alternatives. **Structured Output Formats**: The function calling use case highlights the importance of constraining LLM outputs to structured formats when integrating with other systems. Rather than parsing free-form text responses, defining clear schemas for LLM outputs improves reliability and simplifies downstream processing. ## Production Challenges and Limitations While the article presents these use cases positively, it's important to note limitations and potential concerns from an LLMOps perspective: The article doesn't detail monitoring and observability practices beyond initial QA. Production LLM systems require ongoing monitoring of response quality, latency, cost, and failure modes—practices not explicitly discussed here. There's limited discussion of handling edge cases and failure modes systematically. While the RAG system's hallucination issues are mentioned, the article doesn't describe comprehensive strategies for detecting and handling various failure scenarios in production. Cost analysis is absent. LLM-based systems, particularly using GPT-3.5-turbo or GPT-4, incur per-token costs that can scale significantly with usage. The documentation generation use case processing thousands of tables, and the RAG system handling developer queries, both represent non-trivial ongoing costs that aren't quantified. The article doesn't discuss versioning and model updates. As underlying models (like GPT-3.5-turbo) are updated by providers, outputs can change, potentially breaking applications. How Mercado Libre handles this dependency on external model versions isn't covered. Security and data privacy considerations aren't addressed. Sending proprietary table schemas, internal documentation, and potentially sensitive queries to external LLM APIs raises data governance questions not explored in the article. ## Balanced Assessment This case study provides genuine value by honestly discussing challenges and failures alongside successes, offering a realistic view of LLM deployment rather than presenting an idealized success story. The technical approaches described—RAG with Llama Index, systematic prompt engineering, function calling—represent sound LLMOps practices that others can learn from. However, readers should recognize this represents relatively early-stage LLM adoption (the article is from mid-2024, relatively early in the GPT-3.5-turbo era of accessible LLMs). The practices described, while solid, represent foundational LLMOps rather than cutting-edge MLOps integration with comprehensive CI/CD, automated evaluation pipelines, and production monitoring sophistication we might expect in more mature deployments. The emphasis on data quality and iterative improvement is valuable and well-founded, though the article's format as a lessons-learned retrospective means we don't see detailed architectural diagrams, code examples, or quantitative metrics that would enable deeper technical assessment. Claims about 90% stakeholder acceptance of generated documentation should be viewed as anecdotal rather than rigorously measured outcomes. Overall, this case study represents honest, practical sharing of real-world LLM deployment experiences at a major e-commerce platform, providing valuable lessons for organizations beginning similar journeys while maintaining healthy skepticism about the technology's limitations.

Start deploying reproducible AI workflows today