iFood: Building ISO: A Hyperpersonalized AI Food Ordering Agent for Millions of Users

Company

iFood

Title

Building ISO: A Hyperpersonalized AI Food Ordering Agent for Millions of Users

Industry

E-commerce

Link

https://www.youtube.com/watch?v=uevJBcXKLlQ

Year

2025

Summary (short)

iFood, Brazil's largest food delivery company, built Ailo, an AI-powered food ordering agent to address the decision paralysis users face when choosing what to eat from overwhelming options. The agent operates both within the iFood app and on WhatsApp, providing hyperpersonalized recommendations based on user behavior, handling complex intents beyond simple search, and autonomously taking actions like applying coupons, managing carts, and facilitating payments. Through careful context management, latency optimization (reducing P95 from 30 to 10 seconds), and sophisticated evaluation frameworks, the team deployed ISO to millions of users in Brazil, demonstrating significant improvements in user experience through proactive engagement and intelligent personalization.

## Case Study Overview

iFood, Brazil's dominant food delivery platform processing 160 million monthly orders with 55 million active users across 1,500 cities, developed Ailo—an AI-powered food ordering agent designed to solve a common e-commerce problem: decision paralysis when faced with too many choices. The company, already operating 150 proprietary AI models (many generative AI) making 14 billion real-time predictions monthly, built Ailo as a true agentic system rather than a simple chatbot. The agent is deployed both within the native iFood app and on WhatsApp, which is critical in Brazil where over 150 million users actively use WhatsApp for e-commerce transactions, not just social messaging.

The fundamental problem Ailo addresses is user anxiety when deciding what to eat. Users experience different emotional states—from deserving something special after a hard week to wanting quick delivery without knowing preferences—and each user has unique dietary patterns. The team highlighted three distinct user profiles: one enjoying Brazilian dishes, another focused on high-protein health foods, and a third preferring sophisticated cuisine. This heterogeneity makes hyperpersonalization essential rather than optional.

## Agent Architecture and System Design

ISO implements what the team describes as a "single agent with a twist"—essentially a multi-agent setup implemented within a single agent architecture. The system uses state-dependent system prompts, meaning the agent's behavior changes based on which flow or tool is being used, creating something akin to multiple specialized agents within one framework. When a user message arrives, the LLM receives a system prompt that varies depending on the current state, along with contextual information about the user that's been pre-loaded.

The agent has access to a collection of domain-specific tools related to food ordering: searching for food, managing the shopping cart, applying coupons, and processing orders. Notably, some of these tools are themselves AI workflows that can operate independently with their own intelligence based on the main agent's task. This hierarchical approach allows the system to handle complex operations while maintaining clear separation of concerns. These tools are also tightly integrated with UI elements, generating carousels, buttons, and other interface components to minimize user typing—a critical design decision based on the learning that users in this context don't want to type extensively.

The system also implements follow-up suggestions, creating a conversational flow that guides users toward completing their orders. This architecture enables ISO to function as a true agent capable of autonomous action rather than merely a conversational interface.

## Personalization Strategy and Implementation

The personalization system represents one of the most sophisticated aspects of ISO's implementation. When a user searches for something seemingly simple like "pizza," the system performs complex operations behind the scenes. The team built offline processing pipelines that create representations of user preferences based on behavioral data from the app. These representations capture order history, preferred food categories, time-of-day patterns (breakfast vs. dinner), and other contextual signals.

When a user makes a request, the agent considers both the immediate conversation context (what preferences the user has indicated in the current session) and this rich historical context. This information gets packaged and sent to the search tool, which has a self-contained workflow with all the information needed to retrieve optimal options. The process involves converting user input into multiple search queries—simple requests might trigger semantic and exact search, while complex requests like "I'm hungry" require expansion into multiple queries representing diverse user preferences.

After retrieving results, the system uses the user context for reranking, selecting the best options for that specific individual. The same query "pizza" produces dramatically different results for different users: meat-heavy options for carnivores, low-carb choices for health-conscious users, and sophisticated options for gourmets. The system shows multiple options, all carefully selected to match user interests. The team emphasized that context management is crucial here since the context can grow rapidly, requiring smart strategies to prevent bloat while maintaining personalization quality.

## Latency Optimization: From 30 to 10 Seconds

Latency optimization emerged as a critical challenge since hungry users don't want to wait. The team's initial implementation had a P95 latency of around 30 seconds, which they successfully reduced to 10 seconds through systematic optimization across three main dimensions: flow simplification, context handling, and prompt compression.

For flow simplification, the team created shortcuts for simple requests that didn't require the full complexity of the agentic workflow. When users made straightforward queries like searching for a specific food item or requesting available promotions, the system could bypass complex reasoning steps. This avoided over-engineering simple interactions while preserving the capability to handle complex requests.

Context handling proved particularly important. The team analyzed all context processed during user requests and identified opportunities to move processing to asynchronous workflows. For example, compacting context from previous messages and selecting the best information from user behavior could happen asynchronously rather than in the critical path. This approach didn't necessarily reduce total token count but significantly reduced tokens processed in the slowest, synchronous flow, directly improving perceived latency.

The team also discovered and documented what they call the "language tax"—a fascinating finding that non-English languages require significantly more tokens to express the same information. Research by team member Paul Vanderborg showed that languages beyond English can require 50% or more tokens for equivalent prompts, leading to both higher latency and faster context window exhaustion. The only exception was Chinese models, which are more efficient with Chinese text. Based on this research, the team standardized all prompts to English, achieving measurable token savings even though the product serves Portuguese-speaking Brazilian users.

## Prompt Engineering and Combating Prompt Bloat

The team confronted a common challenge in agentic systems: prompt bloat. As with many production AI systems, it's tempting to add rules to the system prompt whenever bugs appear in production, user complaints arise, or errors surface in evaluations. This leads to bloated prompts full of edge cases—what the team characterized as a "code smell" indicating deeper problems.

Their approach to deflating bloated prompts involved first creating comprehensive evaluations for every edge case that had motivated a prompt addition. If a production error led to adding a rule to the prompt, they created an evaluation case for that scenario. This ensured that prompt simplification wouldn't regress on known issues.

The most impactful change came from improving tool names and variable names. The team used a simple heuristic: show someone unfamiliar with the agent the list of tools and their names, then ask if they could understand and use them with just those instructions. If not, the tool names needed improvement. The issue was that many tool names were specific to iFood's internal terminology but didn't make sense in the context of an agent without domain knowledge. This poor naming forced the system prompt to explicitly mention edge cases to clarify tool usage. By improving tool naming to be self-explanatory, they eliminated many edge case rules from the prompt while maintaining (and often improving) performance on evaluation scenarios. This dramatically reduced token count and improved latency.

## Evaluation Framework and Testing Strategy

The team implemented multiple layers of evaluation, including standard production trace analysis and regular tests. However, they also developed an innovative approach: scenario-based evaluation defined in natural language. These scenarios include instructions describing what should happen, setup steps to establish the scenario, and expected agent behavior.

The motivation for natural language scenarios was twofold. First, it's sometimes difficult to specify correct agent behavior in a single LLM judge call, but it's relatively easy to pinpoint what's wrong and what should happen when reviewing specific failures. Second, natural language descriptions are maintainable by non-developers, democratizing the evaluation process.

The implementation uses an agent that acts as a simulated user, running through defined scenarios by pinging the ISO endpoint and evaluating both responses and UI elements. This approach enables testing diverse scenarios including guardrail evasion across multiple conversation turns. The team emphasized that this scenario-based testing complements rather than replaces production evaluations, creating a comprehensive testing strategy that catches different types of issues.

## Multi-Channel Deployment: App vs. WhatsApp

Deploying Ailo across both the native app and WhatsApp revealed interesting behavioral differences that influenced design decisions. Users on WhatsApp are significantly more open to longer conversations, which makes sense given WhatsApp's primary identity as a messaging platform. In contrast, app users prefer short interactions and expect ISO to quickly surface relevant dishes without extended dialogue.

WhatsApp deployment in Brazil is particularly important given the platform's dominance—it's described as "a way of life" where people commonly send voice notes to restaurants to place orders. This cultural context makes WhatsApp a natural fit for food ordering, despite presenting unique challenges. The team noted that WhatsApp has UI limitations and potential security concerns, particularly relevant since another company in their portfolio (OLX, a secondhand marketplace) deals with significant scam activity on the platform.

To address security concerns, iFood implemented a hybrid authentication approach. When new users or unrecognized phone numbers interact with ISO on WhatsApp, they go through an authentication flow requiring validation in the native iFood app. This creates continuous movement between WhatsApp and the app, adding security while the team experiments with making WhatsApp more standalone.

## Agentic Capabilities and Autonomous Actions

ISO demonstrates true agentic behavior by taking real-world actions on behalf of users rather than merely providing information. The agent can apply coupons automatically, determine eligibility for loyalty program discounts, add items to the cart, and even process payments autonomously (in experimental features). This autonomous action capability distinguishes ISO from conversational search interfaces.

The agent is also contextually aware, understanding factors like weather and location. When users travel to new cities, Ailo proactively mentions this and recommends restaurants matching their usual preferences but located in the new area. The system maintains memory across sessions, enabling it to reference previous orders—for example, noting that a user ordered a particular dish twice this week and offering recommendations from similar restaurants.

Proactivity represents another key agentic characteristic. The team recognized that many chatbots and agents fail because initial user excitement fades as people become unsure what to ask or how to continue conversations. To combat this, Ailo doesn't wait for user initiation—it reaches out at appropriate moments. The agent listens to app events and can proactively engage after periods of inactivity, offering items based on user preferences. This proactive stance helps maintain engagement and prevents the agent from becoming dormant.

The team noted an interesting behavioral pattern: many users simply reorder the same food repeatedly, which might partially reflect difficulty expressing their current needs rather than true preference stagnation. An agent capable of understanding vague or complex intents ("I'm hungry," "surprise me," "I'm with two friends, what should we order?") potentially opens users to trying new items they wouldn't discover through traditional browsing.

## Production Scale and Operational Learnings

ISO operates at massive scale, serving millions of users across Brazil on both platforms. This production deployment revealed several important operational insights. The team emphasized that larger context windows (like Claude's 200K or Gemini's 1M tokens) don't eliminate the need for careful context management. Larger windows can actually worsen performance through a "needle in haystack" problem—the model has more irrelevant information to sift through, degrading output quality even when everything fits within limits.

This problem is exacerbated by emerging standards like Model Context Protocol (MCP). While the team appreciates MCP's potential, they noted that connecting to GitHub MCP, for example, instantly adds 93 tools, completely bloating the context. This makes tool selection and organization critical rather than simply connecting all available tools.

The team identified several primary causes of context explosion: tool outputs (especially when tools return large amounts of data and the LLM gets to choose what to include), long conversations with multiple tool iterations, and excessive numbers of tools requiring extensive descriptions. Mitigation strategies include careful summarization, selecting only relevant context, combining tools that are always called sequentially (either making one tool available only after another is called, or merging them into a single workflow), and organizing tools strategically in multi-agent architectures.

## Balanced Assessment

This case study demonstrates sophisticated LLMOps practices addressing real production challenges at scale. The team's focus on latency optimization, context management, and evaluation reflects mature operational thinking rather than just getting a prototype working. The language tax research provides genuinely valuable insights for international deployments, and the prompt deflation strategy of improving tool naming rather than adding edge cases represents best-practice prompt engineering.

However, some claims warrant careful consideration. The reduction from 30 to 10 seconds P95 latency is impressive, but 10 seconds still represents a significant wait time that may frustrate hungry users—the team's ongoing optimization suggests they recognize this. The hyperpersonalization claims are supported by the described architecture, but the presentation doesn't quantify improvement in user satisfaction, conversion rates, or other business metrics that would validate the approach's effectiveness.

The WhatsApp integration reflects good cultural understanding of the Brazilian market, but the security model of bouncing between WhatsApp and the app for authentication may create friction that undermines the convenience benefit of WhatsApp deployment. The proactive engagement strategy is innovative but could potentially become intrusive if not carefully calibrated—the presentation doesn't discuss how they manage the boundary between helpful proactivity and annoying interruption.

The multi-agent-in-single-agent architecture is pragmatic but adds complexity that may make debugging and maintenance challenging. The natural language scenario evaluation approach is creative and accessibility-focused, but it's unclear how they ensure consistency and comprehensiveness compared to more structured testing approaches.

Overall, this case study represents a strong example of production LLMOps at scale, with particular strengths in systematic optimization, thoughtful evaluation strategies, and cultural adaptation for the Brazilian market. The technical choices reflect real-world constraints and tradeoffs rather than academic ideals, making this valuable learning material for practitioners building similar systems.

Start deploying reproducible AI workflows today