ZenML

AI-Powered Ad Description Generation for Classifieds Platform

Leboncoin 2025
View original source

Leboncoin, a French classifieds platform, addressed the "blank page syndrome" where sellers struggled to write compelling ad descriptions, leading to poorly described items and reduced engagement. They developed an AI-powered feature using Claude Haiku via AWS Bedrock that automatically generates ad descriptions based on photos, titles, and item details while maintaining human control for editing. The solution was refined through extensive user testing to match the platform's authentic, conversational tone, and early results show a 20% increase in both inquiries and completed transactions for ads using the AI-generated descriptions.

Industry

E-commerce

Technologies

Overview

Leboncoin, a major French classifieds platform serving 30 million monthly visitors, implemented a production-scale generative AI solution to address a persistent user experience challenge known as “blank page syndrome.” This case study illustrates the deployment of large language models in a consumer-facing application where the AI acts as a writing assistant rather than a replacement for human creativity. The implementation showcases thoughtful LLMOps practices including extensive prompt engineering, iterative user testing, cost management considerations, and careful integration of legal compliance into the development process.

The business problem was clear and measurable: sellers struggled to write effective ad descriptions, which created a negative cycle of poor-performing listings leading to frustrated users who were less likely to post future ads. This directly impacted both the quality and quantity of ads on the platform. Rather than providing simple templates or writing tips, Leboncoin chose to leverage generative AI to fundamentally transform the ad creation experience while maintaining the authentic, human character that defines their marketplace.

Technical Architecture and Infrastructure

The production system is built on Claude Haiku, accessed through AWS Bedrock infrastructure. This architectural choice is significant from an LLMOps perspective as it demonstrates the use of managed AI services rather than self-hosted models, which offers several operational advantages including simplified scaling, reduced infrastructure management overhead, and access to enterprise-grade security and compliance features. AWS Bedrock provides a unified API for accessing foundation models while handling the underlying infrastructure complexity.

The system operates as a multimodal solution, processing multiple input types to generate descriptions. The AI considers uploaded photos, item titles, and structured item details (category-specific attributes) to generate contextually relevant descriptions. This multimodal approach is critical for the use case, as visual information about the item’s condition and characteristics can significantly inform the quality of generated text. The architecture must handle image preprocessing, feature extraction, and coordinated processing of visual and textual inputs before generating the final description.

From a production deployment perspective, the feature is integrated directly into the ad creation flow across multiple categories including consumer goods and vehicles, with real estate being the next planned expansion. This category-by-category rollout strategy represents a pragmatic LLMOps approach that allows for domain-specific optimization and risk mitigation rather than attempting a platform-wide deployment simultaneously.

Prompt Engineering as Core Development Activity

One of the most revealing insights from this case study is how prompt engineering became the central focus of the development and iteration process. The product manager explicitly notes that “what’s particularly noteworthy is how much of our discovery process ultimately centered around prompt creation. The prompts became the critical interface between user needs and generative AI capabilities.” This observation highlights a fundamental shift in product development for LLM-powered features: the primary engineering artifact isn’t traditional code but rather the carefully crafted prompts that guide model behavior.

The team invested significant iteration cycles in refining prompts to achieve the right balance between structure and flexibility. Initial attempts produced descriptions that were too verbose and formal, reading “more like product catalogs than leboncoin’s ads.” This mismatch between model output and platform expectations required multiple rounds of prompt refinement. The team had to encode platform-specific knowledge into the prompts, including:

This iterative prompt development process was tightly coupled with user testing, creating a feedback loop where real user reactions directly informed prompt modifications. The case study notes “each iteration brought us closer to that perfect leboncoin tone,” suggesting a systematic approach to evaluating and refining prompt effectiveness based on qualitative user feedback.

Evaluation and Testing Methodology

The team employed a user-centric evaluation approach that prioritized qualitative assessment alongside any automated metrics. Multiple rounds of user interviews and testing sessions were conducted, with test participants providing feedback on different versions of AI-generated outputs. This human-in-the-loop evaluation methodology is particularly appropriate for a consumer-facing feature where subjective qualities like “authenticity” and “natural tone” are critical success factors that automated metrics struggle to capture.

Key evaluation dimensions included:

The product manager’s background context emphasizes that “our users became our co-creators, helping us fine-tune not just the length and style of descriptions, but also the essential keywords that make ads more discoverable.” This collaborative approach to evaluation represents a best practice in LLMOps where domain experts and end users contribute specialized knowledge that technical teams may lack.

Notably, the case study indicates that “the traditional discovery process remains essential, but with AI products, we found ourselves cycling through iterations much faster.” This acceleration of iteration cycles is both an advantage and a challenge in LLMOps—while rapid experimentation is possible, it requires disciplined evaluation frameworks to ensure changes represent genuine improvements rather than just differences.

Production Constraints and Cost Management

The team made a deliberate architectural decision to limit users to one AI-generated description per ad. This constraint reflects sophisticated thinking about LLMOps operational considerations beyond pure technical capability. The rationale encompasses multiple dimensions:

This single-generation constraint represents a thoughtful LLMOps tradeoff where business sustainability, environmental responsibility, and user behavior design align. It’s a reminder that production LLM systems must balance capability with practical operational constraints. The team essentially implemented a form of rate limiting at the feature level, baking resource management into the product design rather than treating it as a pure infrastructure concern.

From a technical implementation perspective, this constraint likely involves session tracking and state management to prevent repeated generations for the same ad draft, though the case study doesn’t detail the specific enforcement mechanisms.

Human-in-the-Loop Design and Control

A critical LLMOps principle demonstrated in this implementation is the positioning of AI as an assistive co-pilot rather than autonomous agent. The generated description serves as a starting point that users can accept as-is, edit and refine, or completely replace. This design acknowledges several important considerations for production AI systems:

This architectural choice reflects a mature understanding that production LLM systems often work best in collaboration with humans rather than attempting to replace human judgment entirely. The interface design must accommodate this workflow, providing easy editing capabilities while making the AI contribution valuable enough that users don’t simply delete everything and start over.

The case study highlights an innovative approach to legal compliance where legal team members were embedded in the design process from day one rather than serving as gate-keepers at the end. This collaborative model addressed several critical questions early in development:

This proactive legal integration represents a best practice for LLMOps, particularly in consumer-facing applications operating under European data protection frameworks like GDPR. The case notes this approach resulted in “no last-minute redesigns, no painful compromises,” suggesting that early legal involvement actually accelerated time-to-market by preventing late-stage blockers.

From an LLMOps governance perspective, this demonstrates the importance of establishing compliance frameworks before deployment rather than retrofitting them afterward. The team likely implemented mechanisms for data handling transparency, user consent flows, and audit trails that satisfy regulatory requirements while maintaining a seamless user experience.

Business Impact and Metrics

The production system has demonstrated measurable business impact with ads using the AI-generated description feature showing a 20% increase in both inquiries and completed transactions. This metric is particularly meaningful because it measures actual business outcomes (successful transactions) rather than just engagement metrics or AI performance scores. The 20% improvement suggests the AI is genuinely creating better-performing ads, likely through:

These results validate the product hypothesis that better descriptions drive marketplace performance. However, it’s worth noting that the case study doesn’t provide detailed statistical methodology, control group definitions, or confidence intervals, so we should view these figures as indicative rather than rigorously controlled experimental results.

The team also reports strong qualitative feedback, with “this is exactly what I needed” becoming a common refrain in testing sessions. This combination of quantitative business metrics and qualitative user satisfaction suggests a successful production deployment.

The feature’s success was recognized externally when Leboncoin received the grand prize for innovation at the Grand Prix Favor’i e-commerce in March 2025, organized by FEVAD (the French e-commerce and distance selling federation).

Scaling and Category Expansion Strategy

The team’s approach to category-by-category expansion demonstrates pragmatic LLMOps scaling strategy. After initial deployment in consumer goods and mobility (vehicles), they’re now expanding to real estate—but explicitly not just rolling out the existing solution as-is. The product manager asks: “Why not just roll out what we already have? Well… selling an apartment is quite different from selling a Just Dance game.”

This recognition that different categories require domain-specific optimization is critical for production LLM systems. Real estate ads require different information architecture:

Each category essentially requires its own prompt engineering effort, evaluation methodology, and potentially different model configurations. This category-specific approach allows for:

The case study notes “the discovery process will be faster this time around” for real estate, suggesting the team has developed reusable frameworks and methodologies even though the specific prompts and evaluation criteria must be adapted for each domain.

Critical Analysis and Considerations

While the case study presents a largely positive narrative, a balanced assessment should consider several factors:

Limited technical transparency: The case provides minimal detail about the actual technical implementation—prompt structures, image processing pipelines, response time requirements, fallback mechanisms, or monitoring approaches. This makes it difficult to assess the sophistication of the LLMOps practices beyond what’s explicitly described.

Claimed results without methodology: The 20% improvement figure is impressive but lacks context about statistical significance, sample size, control group definition, or potential confounding factors. Was this a randomized experiment or an observational comparison? How were transactions attributed to the AI feature versus other factors?

Cost sustainability questions: While the team acknowledges AI is expensive and implements single-generation limits, the case doesn’t address whether the current architecture is economically sustainable at scale or what unit economics look like. As adoption grows, will infrastructure costs become prohibitive?

Model dependency risks: Relying on Claude Haiku via AWS Bedrock creates vendor dependency. What happens if pricing changes significantly, if model behavior shifts with updates, or if the service experiences outages? The case doesn’t discuss model versioning strategies or fallback mechanisms.

Content quality monitoring: How does the system detect and prevent problematic generations at scale? What monitoring and alerting exist for quality degradation? The human-in-the-loop design provides a safety layer, but are there automated quality checks before content reaches users?

Multimodal complexity: Processing images alongside text adds complexity. How robust is the system to various image qualities, angles, or types? What happens when images are ambiguous or misleading?

Despite these questions, the case study demonstrates several LLMOps strengths: thoughtful prompt engineering as a core discipline, user-centric evaluation methodology, deliberate operational constraints, human-centered design that maintains user agency, and proactive legal integration. The measurable business impact and external recognition suggest a genuinely successful production deployment that balances technical capability with practical considerations.

More Like This

Scaling AI Product Development with Rigorous Evaluation and Observability

Notion 2025

Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.

document_processing content_moderation question_answering +52

Accelerating Game Asset Creation with Fine-Tuned Diffusion Models

Rovio 2025

Rovio, the Finnish gaming company behind Angry Birds, faced challenges in meeting the high demand for game art assets across multiple games and seasonal events, with artists spending significant time on repetitive tasks. The company developed "Beacon Picasso," a suite of generative AI tools powered by fine-tuned diffusion models running on AWS infrastructure (SageMaker, Bedrock, EC2 with GPUs). By training custom models on proprietary Angry Birds art data and building multiple user interfaces tailored to different user needs—from a simple Slackbot to advanced cloud-based workflows—Rovio achieved an 80% reduction in production time for specific use cases like season pass backgrounds, while maintaining brand quality standards and keeping artists in creative control. The solution enabled artists to focus on high-value creative work while AI handled repetitive variations, ultimately doubling content production capacity.

content_moderation caption_generation poc +24

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90