A team at Prosus built web agents to help automate food ordering processes across their e-commerce platforms. Rather than relying on APIs, they developed web agents that could interact directly with websites, handling complex tasks like searching, navigating menus, and placing orders. Through iterative development and optimization, they achieved an 80% success rate target for specific e-commerce tasks by implementing a modular architecture that separated planning and execution, combined with various operational modes for different scenarios.
This case study comes from a podcast discussion featuring the Prosus AI team, a large technology company that operates numerous e-commerce and consumer platforms including iFood (food delivery), OLX (classifieds marketplace), and others, collectively serving approximately two billion consumers worldwide. The discussion centers on their six-month journey building web agents for production use, specifically focusing on automating food ordering tasks. The conversation features insights from Kiara, an engineer who led the web agent development, along with other team members discussing the technical challenges and lessons learned.
Prosus recognized that a significant portion of e-commerce interactions happen through web interfaces rather than APIs. Their prediction that 10% of e-commerce transactions will be conducted by agents on behalf of consumers in the near future drove them to explore web agents as a critical capability. The fundamental challenge they faced was that most consumer-facing services—ordering food, booking flights, finding secondhand goods—don’t have APIs designed for agent interaction. These experiences are built for human navigation through browsers, involving complex multi-step processes with dynamic content, popups, captchas, and interactive elements.
The team specifically tackled building an agent to help users order food, which sounds deceptively simple but involves understanding user context (dietary restrictions, location, time of day, local events), navigating multiple food platforms, advising on available options and promotions, and ultimately completing the ordering transaction.
The team extensively evaluated available web agent frameworks before building their own solution. They tested frameworks including Multion (described as “probably the most famous at the moment”), Web Voyager, Agent E, and various open-source projects from companies and research groups. Their findings were sobering:
The existing tools had significant limitations when applied to real-world production scenarios. Websites are built for humans, containing vast amounts of information not optimized for agent consumption. Dynamic content loading, enormous and changing DOM structures, and standard anti-bot measures like captchas all posed challenges. The Web Voyager benchmark, which many agents use for evaluation, showed that success rates varied dramatically between websites and tasks. Websites requiring many actions (like booking.com or Google Flights) proved particularly difficult because actions could trigger unexpected state changes—for example, selecting a departure airport might automatically change available destination options.
Critically, the team discovered that benchmark performance on datasets like Web Arena didn’t translate to their actual use cases. Results were “super unpredictable”—the same task might succeed one time and fail the next. They frequently observed agents getting stuck in loops or simply not knowing what action to take next. Even multimodal vision models struggled with a fundamental challenge: understanding where on a webpage to click to execute a desired action, translating intent into specific coordinates.
Given the limitations of existing tools, the team built their own web agent framework optimized for their specific use case. The architecture follows a planner-executor pattern, which represents an emerging best practice in agentic systems.
The core insight is that a web agent fundamentally mirrors API-based agents: it has access to information (screenshots, DOM content, browser state) and can take actions (clicking, typing, scrolling) which function as tools. However, the team made several key design decisions:
Planner-Executor Separation: They used distinct models for planning versus execution. Planning requires sophisticated reasoning to understand user intent and devise multi-step strategies—models like Claude’s reasoning capabilities or similar were preferred here. Execution, by contrast, operates in a constrained action space and doesn’t require deep reasoning; it needs reliable action selection and the ability to recognize task completion.
Three Operational Modes: The team implemented three modes to optimize for speed and reliability:
The system falls back gracefully: it attempts fast approaches first, and if those fail, progressively uses slower but more thorough methods.
A significant engineering effort went into DOM cleaning. Raw webpage DOMs contain enormous amounts of irrelevant information. The team filtered to only clickable elements and essential content, dramatically reducing the context the agent needed to process.
Tool design followed a principle of minimizing agent choices. They merged tools that were always used together—for example, combining “type text” and “press enter” into a single search tool, since you essentially never type without submitting. This reduced the decision space and improved reliability.
Scrolling proved particularly challenging. Long restaurant menus or product lists required sophisticated scrolling strategies to capture all relevant information. The team adopted and refined techniques from open-source frameworks but optimized them for their specific use case.
A powerful aspect of web-based agents is the ability to simulate and explore. The team sent agents to explore websites like web crawlers, defining success states (e.g., “found the item” or “added to cart”) and running tasks many times to learn successful trajectories. Out of 20 attempts at a given task, they would analyze which paths succeeded.
Successful trajectories were stored—including the path taken and page states encountered—creating a form of learned navigation expertise. This persistent memory allowed the agent to become “familiar” with websites over time, moving from exploration to expertise. The storage mechanism itself wasn’t critical as long as it was privacy-compliant, but having this historical data dramatically improved performance on repeated interactions.
Perhaps the most important lesson, emphasized by Kiara, was approaching this as a software engineering problem rather than a data science one. This meant:
A concrete example: the team initially built an agent to query user databases for dietary restrictions. They realized this was unnecessary—the data could simply be pulled and added to context deterministically. This revelation simplified the system and increased reliability.
The execution agent needed extremely detailed, broken-down instructions from the planner. The more thinking delegated to the planning phase, the better the executor performed. Instructions should specify each step clearly: “open the OLX page, then search in the bar for X, then…”
Rather than giving the agent access to all possible tools at all times, the team dynamically limited available tools based on the current page and task context. If you only need certain interactions on a particular page, only those tools should be available.
Instructions were loaded dynamically based on which platform the agent was interacting with. Different food delivery apps have different flows, and the agent’s prompts adapted accordingly.
The team established representative e-commerce interaction tasks as their evaluation suite, then added variations to stress-test the system. Their target was 80% success rate, though they acknowledged this heavily depends on task complexity and website design.
They found that user behavior significantly impacted success—agents couldn’t order food without an address, so the planning layer needed to ensure all required information was collected before attempting execution. Deterministic validation of prerequisites before agent execution proved essential.
An interesting thread in the discussion touched on how websites might evolve to better serve agents. The team observed some websites beginning to add markdown descriptions of pages—making it much faster for agents to load and understand content, especially on e-commerce pages with many items. This represents a shift toward designing web experiences not just for humans but for the agents that will increasingly act on their behalf.
The team also noted that as platform owners (rather than pure outsiders), they have access to valuable data not displayed on websites—seller reputation, location data, supply/demand dynamics, reasonable price ranges. Integrating this rich marketplace context with web agent capabilities could create significantly more powerful AI-first commerce experiences, rather than just automating human-style browsing.
The case study provides valuable production insights, though it’s worth noting some limitations. Success rates (targeting 80%) may still be insufficient for fully autonomous high-stakes transactions. The solution required substantial custom engineering despite the proliferation of web agent frameworks, suggesting the technology remains immature for production deployment. The team’s access to platform-level data and their ability to customize agent behavior for specific websites they control represents advantages not available to general-purpose web agent projects.
Nevertheless, their methodical approach—evaluating existing tools honestly, understanding their limitations, building modular systems with appropriate determinism, and treating this as engineering rather than pure ML—offers a realistic template for organizations attempting to deploy web agents in production environments.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.
Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.