ZenML

Building a Data-Centric Multi-Agent Platform for Enterprise AI

Alibaba 2025
View original source

Alibaba shares their approach to building and deploying AI agents in production, focusing on creating a data-centric intelligent platform that combines LLMs with enterprise data. Their solution uses Spring-AI-Alibaba framework along with tools like Higress (API gateway), Otel (observability), Nacos (prompt management), and RocketMQ (data synchronization) to create a comprehensive system that handles customer queries and anomalies, achieving over 95% resolution rate for consulting issues and 85% for anomalies.

Industry

Tech

Technologies

Overview

This case study from Alibaba Cloud’s Native Community provides insights into their approach for building and operating AI Agents in production environments. The article, authored by Yanlin and published in March 2025, discusses both the conceptual framework and practical implementation of multi-agent systems at Alibaba. While the content is somewhat promotional in nature (featuring Alibaba’s open-source tools), it offers valuable perspectives on operationalizing LLMs through agent architectures in enterprise settings.

The fundamental premise is that LLMs alone are insufficient for solving real-world problems—they need to be embedded within agent systems that can perceive environments, make decisions, access tools, and execute actions. Alibaba’s approach centers on building a “data-centric intelligent agent platform” that can continuously improve through a feedback loop they call the “data flywheel.”

The Multi-Agent Architecture Philosophy

Alibaba’s perspective on AI Agent evolution is notable: they observe a progression from single-task, fixed agents toward multi-agent collaboration systems. The text acknowledges that while the ultimate goal might be a “super agent” capable of general problem-solving (AGI), the practical near-term path is building specialized agents that collaborate effectively.

Their framework for competitive AI products rests on three pillars: models, data, and scenarios. Regarding models, they note that public domain data has been extensively mined, and the focus is now shifting to cost and performance optimization (with DeepSeek accelerating this trend). Private domain data is positioned as the core competitive barrier for enterprises, with emphasis on mining, solidifying, and continuously optimizing proprietary data. Scenario selection focuses on high-frequency, structured, and risk-controllable use cases that can progressively extend specialization.

The Data Flywheel Concept

A central concept in Alibaba’s LLMOps approach is what they term the “data flywheel” for continuous improvement. The process involves several stages: first, applications collect and consolidate personalized data from customers; second, domain-specific data and SOPs (Standard Operating Procedures) are combined with customer data; third, evaluation datasets are prepared to meet customer SLA requirements before deployment; finally, post-deployment customer feedback is collected and used to analyze and optimize industry data, toolsets, and scenarios.

This continuous loop between evaluation systems and private data optimization aims to achieve alignment between customer demands and data quality. While the concept is sound, it’s worth noting that the article doesn’t provide detailed metrics on how this flywheel actually performs in practice, making it difficult to assess the real-world effectiveness of this approach.

Technical Architecture Components

Alibaba’s technical implementation relies on several key open-source and proprietary components that form their multi-agent architecture:

Spring-AI-Alibaba Framework: This is Alibaba’s primary framework for constructing intelligent agents, introduced at their Yunqi Conference. It provides the orchestration layer for building agent systems, with support for both “chat modes” for consultation-style interactions and “composer modes” for addressing customer anomalies through more complex workflows.

Higress (AI-Native API Gateway): Positioned as an open-source AI-native API gateway, Higress serves as the integration layer for multiple data sources and models. Its capabilities include one-click integration across multiple models with unified protocols, permissions, and disaster recovery; access to domain data through search tools and customer data through MCP Server; standardized data format conversion; building short and long-term memory through caching and vector retrieval to reduce LLM calls, costs, and enhance performance; and integrated observability for data compliance and quality assessments.

The gateway also supports end-to-end TLS encryption for model access chains, content safety measures for data compliance, centralized API key management to prevent leakage risks, and traffic and quota control based on internal API keys to prevent costly token overconsumption due to code bugs.

OpenTelemetry (Otel) Observation System: This provides full-chain data quality tracking, enabling automatic analysis of reasoning process effectiveness and recall results. When performance issues arise, the end-to-end tracking system allows operators to trace customer search and reasoning processes to determine whether problems originate from the knowledge base, RAG pipeline, or toolset. This observability layer is crucial for optimization efficiency in production environments.

Nacos: Used for dynamic prompt data updates, Nacos enables real-time pushing of prompt word changes without redeployment. It supports gray (canary) configuration for gradually monitoring prompt optimization effects, which is particularly valuable when concerns arise about prompt changes impacting production systems.

Apache RocketMQ: Addresses the timeliness challenge in RAG systems by syncing change events and data in real-time. This ensures that the most current data and effects are available for each inference, addressing a common pain point in RAG deployments where stale data can degrade system performance.

Knowledge Base and Vector Database Integration

The platform architecture includes building a corporate knowledge base where data is transformed into Markdown format through platform tools, then pushed to a vector database to build domain data. Tools help agents access structured customer data, creating a comprehensive data layer for agent operations.

The construction of data assessment sets and automated intelligent data evaluation systems is emphasized as critical infrastructure, though the specific evaluation methodologies are not detailed in the text.

Practical Application: Intelligent Diagnostic System

Alibaba describes an implementation using their technical stack combined with Alibaba Cloud’s native API gateway and Microservices Engine (MSE) to create an intelligent diagnostic system. The claimed results are impressive: solving over 95% of consulting issues and over 85% of anomalies. However, it’s important to note that the methodology for measuring these percentages is not disclosed, nor is there information about the baseline comparison, the volume of issues handled, or how “solving” is defined and verified.

The system uses Higress to shield underlying models and tool systems while building secure data links and account security systems. Spring-AI-Alibaba handles agent construction and orchestration.

DeepSeek Integration and Security Considerations

The article highlights integration with DeepSeek, noting that the “connected version” (with web search capabilities) represents the full-strength variant of the model. Customers are reportedly using Higress for one-click integration with DeepSeek combined with Quark search data.

Security considerations are addressed through several mechanisms: end-to-end TLS on model access chains, content safety measures for data compliance, centralized API key management with internal keys provided to agents, and traffic and quota control to prevent runaway costs from code bugs.

Critical Assessment

While this case study provides a useful framework for thinking about multi-agent LLMOps, several caveats should be noted. The article is clearly promotional in nature, showcasing Alibaba’s open-source tools and cloud services. The claimed results (95% and 85% resolution rates) lack methodological transparency. Technical details on specific challenges encountered during implementation are sparse. The “data flywheel” concept, while conceptually appealing, is presented without concrete metrics on improvement cycles or data volumes.

Nevertheless, the architectural patterns described—API gateway for model abstraction and security, observability for debugging, dynamic configuration for prompt management, and message queues for RAG timeliness—represent sound engineering practices that would be applicable across different tech stacks and cloud providers.

Key LLMOps Themes

The case study touches on several important LLMOps themes that are broadly applicable: the importance of observability and tracing in LLM applications for debugging and optimization; dynamic configuration management for prompts and system parameters without redeployment; data pipeline management for keeping RAG systems current; security layers including encryption, content safety, and API key management; cost management through caching, quota controls, and performance optimization; and evaluation infrastructure for maintaining quality as systems evolve.

These themes reflect industry-wide challenges in moving LLM applications from prototype to production, making the architectural patterns described here potentially valuable even for teams not using Alibaba’s specific toolchain.

More Like This

Scaling AI Product Development with Rigorous Evaluation and Observability

Notion 2025

Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.

document_processing content_moderation question_answering +52

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57

Building a Search Engine for AI Agents: Infrastructure, Product Development, and Production Deployment

Exa.ai 2025

Exa.ai has built the first search engine specifically designed for AI agents rather than human users, addressing the fundamental problem that existing search engines like Google are optimized for consumer clicks and keyword-based queries rather than semantic understanding and agent workflows. The company trained its own models, built its own index, and invested heavily in compute infrastructure (including purchasing their own GPU cluster) to enable meaning-based search that returns raw, primary data sources rather than listicles or summaries. Their solution includes both an API for developers building AI applications and an agentic search tool called Websites that can find and enrich complex, multi-criteria queries. The results include serving hundreds of millions of queries across use cases like sales intelligence, recruiting, market research, and research paper discovery, with 95% inbound growth and expanding from 7 to 28+ employees within a year.

question_answering data_analysis chatbot +44