Feedzai: AI-Powered Fraud Detection Using Mixture of Experts and Federated Learning

Company

Feedzai

Title

AI-Powered Fraud Detection Using Mixture of Experts and Federated Learning

Industry

Finance

Link

https://medium.com/feedzaitech/feedzai-trustscore-enabling-network-intelligence-to-fight-financial-crime-9ce7fcff84fb

Year

2025

Summary (short)

Feedzai developed TrustScore, an AI-powered fraud detection system that addresses the limitations of traditional rule-based and custom AI models in financial crime detection. The solution leverages a Mixture of Experts (MoE) architecture combined with federated learning to aggregate fraud intelligence from across Feedzai's network of financial institutions processing $8.02T in yearly transactions. Unlike traditional systems that require months of historical data and constant manual updates, TrustScore provides a zero-day, ready-to-use solution that continuously adapts to emerging fraud patterns while maintaining strict data privacy. Real-world deployments have demonstrated significant improvements in fraud detection rates and reductions in false positives compared to traditional out-of-the-box rule systems.

## Overview and Business Context Feedzai, a financial crime prevention technology company, has developed TrustScore as part of their Feedzai IQ™ solution to address the escalating challenges of fraud detection in modern financial systems. The case study presents a compelling production AI deployment that tackles one of the most difficult adversarial machine learning problems in the financial sector: detecting fraud that represents less than 0.1% of all transactions while fraudsters continuously adapt their tactics. The authors note that the rise of GenAI technologies has actually exacerbated this problem, enabling fraudsters to develop more sophisticated attacks at faster rates, while traditional detection systems remain too rigid or too narrow to keep pace. The business problem is framed around the limitations of existing approaches used by financial institutions. Traditional rule-based systems rely on manually defined conditions describing known fraud patterns, while custom AI models are trained exclusively on historical data from individual institutions. Both approaches suffer from significant operational constraints: they require months to design, tune, and validate before deployment; they demand continuous manual updates and oversight; and most critically, they provide only a siloed, reactive view of fraud since they learn only after fraud has occurred within their specific environment. Given that modern fraudsters operate collaboratively across borders and institutions, Feedzai argues that defenders need a similarly collaborative approach. ## Technical Architecture and Production Implementation The core technical innovation of TrustScore centers on a Mixture of Experts (MoE) architecture implemented through a federated learning framework. This represents a sophisticated production AI system where multiple expert models are developed on different fraud scenarios and geographic environments, then combined to produce final risk decisions. The analogy presented is akin to assembling a panel of medical specialists rather than consulting a single doctor—each expert model becomes a "specialist" in specific geographies and use cases such as banking, payment processing, or anti-money laundering. The production deployment model is particularly noteworthy from an LLMOps perspective. Rather than requiring each new client to collect months of historical data and train custom models from scratch, TrustScore provides pre-trained expert models that encode collective fraud knowledge from across Feedzai's network of partners processing $8.02 trillion in yearly transactions. When deploying the solution for a new client, a relevant mixture of these pre-trained experts is assembled and becomes immediately operational—this is what Feedzai terms a "zero-day solution." This deployment strategy fundamentally changes the time-to-value proposition compared to traditional approaches. The federated learning implementation maintains strict data privacy guarantees by design. Throughout the entire model training and deployment process, raw transaction data never leaves individual clients' secure environments. Instead, only high-level encodings of fraud patterns, represented as trained machine learning model parameters, are shared across institutions. This architecture ensures regulatory compliance while enabling collaborative intelligence, addressing one of the most significant challenges in deploying AI systems across financial institutions with stringent data protection requirements. ## Data Engineering and Standardization Challenges The case study provides valuable insights into the data engineering challenges inherent in building a unified AI solution applicable across multiple financial institutions. Data standardization emerges as a critical technical consideration for making the MoE architecture operationally viable. Feedzai implements several sophisticated procedures to ensure that fraud patterns learned by expert models can be seamlessly integrated into new financial institutions' environments. Schema alignment represents the first layer of standardization. Financial transaction data typically arrives in tabular format with events as rows and information dimensions as columns. Feedzai defines a standard schema to unify table structure across institutions, with the important design decision that the schema supports incomplete tables—not all columns need to be populated. This flexibility allows institutions with varying data collection practices to participate in the network while still benefiting from the collective intelligence. Categorical consistency poses another significant challenge, as different institutions may use different encodings to represent identical information such as merchant type classifications. The solution employs mapping and harmonization techniques to ensure categorical feature representations remain consistent across environments. Currency normalization adds yet another layer of complexity, particularly given that normal versus abnormal spending behavior is context-dependent. Feedzai addresses this through currency conversions, cost-of-living adjustments, and percentile-based feature engineering to standardize signals across different economic contexts. Time zone handling illustrates the global scale considerations in this production system. Transaction timing serves as a critical fraud signal, and with Feedzai's worldwide reach, the system must simultaneously combine insights from institutions operating across all time zones. The engineering team normalizes across time zones and creates derived features such as daytime versus nighttime spending patterns that maintain meaningful interpretability regardless of geographic location. ## Model Calibration and Ensemble Management The production deployment of a Mixture of Experts system presents unique calibration challenges that the case study addresses directly. Since each expert model is trained independently without context from other experts and their specific operational environments, ensuring that all experts contribute effectively to the final risk score requires careful calibration. Without proper calibration, some experts might dominate the ensemble while others contribute negligibly, or score ranges might be inconsistent across experts, making aggregation unreliable. Feedzai applies calibration techniques to ensure that experts yield homogeneous fraud risk representations. While the specific calibration mechanisms remain proprietary (the authors note that gating and aggregation mechanisms are confidential), the emphasis on this challenge highlights a critical consideration for production ensemble systems. The calibration must balance allowing each expert to contribute its specialized knowledge while ensuring score interpretability and consistency. The stable and interpretable risk score requirement represents another production consideration often overlooked in academic ML discussions. As fraud patterns evolve and individual expert models are updated or replaced, the overall risk score distribution must remain consistent over time. Feedzai applies techniques to maintain score stability even as the underlying expert models change. This stability is crucial for integration with downstream systems and for maintaining consistent interpretation—a score of 0.8 today should represent the same fraud risk level as a score of 0.8 next month, even if the models producing those scores have been updated. This interpretability consideration also facilitates regulatory compliance and auditability in the heavily regulated financial services sector. ## Continuous Monitoring and Model Refreshing The case study emphasizes continuous monitoring and refreshing as core operational capabilities distinguishing TrustScore from traditional static fraud detection systems. The production system implements monitoring at two distinct levels: individual expert model performance and the aggregated MoE system performance. This dual-level monitoring enables early detection of model drift, a critical concern in adversarial environments where fraudster tactics continuously evolve. The modular architecture of the MoE system provides operational flexibility for managing model lifecycle in production. Individual expert models can be added, removed, or updated independently whenever performance drops are detected or when integrating newly discovered fraud patterns. This modularity represents a significant operational advantage over monolithic models, where updates typically require retraining and redeploying entire systems. The ability to selectively update specific experts without disrupting the entire system reduces deployment risk and enables more agile responses to emerging fraud patterns. The continuous refreshing process actively maintains and updates the expert models to integrate both newly identified fraud patterns and advances in machine learning techniques. While the case study doesn't specify the exact cadence or triggers for model updates, the emphasis on "dynamic solution" and "actively maintained" suggests a production system with established MLOps processes for model versioning, testing, validation, and deployment. This ongoing maintenance represents a significant operational commitment and differentiates the solution from traditional rule-based systems that require manual analyst intervention for updates. ## Personalization and Augmentation Capabilities An interesting production deployment consideration highlighted in the case study is the system's capability for personalization after initial deployment. While TrustScore provides immediate value as a zero-day solution leveraging pre-trained experts, it can be fine-tuned and optimized for individual financial institutions once sufficient institution-specific data has been collected. This personalization capability addresses a common tension in production AI systems between generalization (working well across diverse contexts) and specialization (optimizing for specific operational environments). The fine-tuning approach likely involves techniques such as transfer learning or domain adaptation, though the case study doesn't specify implementation details. This capability enables a graduated value proposition: immediate deployment with strong baseline performance from pre-trained experts, followed by continuous improvement as institution-specific patterns are learned. This addresses a critical business concern for financial institutions adopting new fraud detection systems—they need protection immediately while also wanting solutions that improve over time based on their unique customer populations and fraud exposure. The augmentation capability represents another production deployment pattern worth noting. TrustScore can be combined with existing solutions such as legacy rule-based systems or custom AI models rather than requiring complete replacement. The risk score from TrustScore becomes an additional signal that can enhance existing fraud detection infrastructure. This integration flexibility significantly reduces deployment barriers and adoption risk, allowing financial institutions to incrementally introduce AI-powered detection without disrupting established operational workflows. The case study frames this as providing value "not only when activating a new customer, but in a long-term setting," suggesting that the system maintains relevance even as clients develop more sophisticated internal capabilities. ## Responsible AI and Privacy Considerations The case study emphasizes that TrustScore is "developed with responsibility at its core," guided by Feedzai's TRUST Framework. The strict privacy guarantees by design have already been mentioned in the context of federated learning, but the broader responsible AI considerations extend to transparency, fairness, and auditability. These considerations are particularly critical in financial services, where AI systems making risk decisions face regulatory scrutiny and potential legal liability. The emphasis on auditability suggests that the production system maintains appropriate logging, versioning, and explainability capabilities to support regulatory compliance. Financial institutions typically need to justify fraud detection decisions, especially when those decisions result in blocking transactions or freezing accounts. The case study doesn't detail specific explainability mechanisms, but the focus on interpretable risk scores and stable scoring distributions suggests attention to making the system's outputs comprehensible to fraud analysts and regulators. The fairness consideration is particularly important given that fraud detection systems can potentially introduce discriminatory outcomes if not carefully designed and monitored. While the case study doesn't provide specifics on fairness testing or bias mitigation techniques, the explicit mention of fairness as a design principle suggests awareness of these risks. The global nature of the system, operating across diverse geographic and cultural contexts, likely necessitates sophisticated approaches to ensure that fraud detection patterns don't inadvertently discriminate against particular demographic groups or geographic regions. ## Performance Results and Business Impact The case study reports that clients across geographies and use cases have experienced "significant performance improvement compared to traditional systems, as well as a reduction of time spent chasing false positives." While specific quantitative metrics are not provided (likely due to client confidentiality), the claimed improvements span both detection efficacy (catching more fraud) and operational efficiency (fewer false alerts requiring analyst investigation). The reduction in false positives represents a critical business outcome often overlooked when focusing solely on detection rates. Fraud analyst time is expensive, and high false positive rates can overwhelm investigation teams while also degrading customer experience through unnecessary transaction blocks or account reviews. A system that simultaneously improves fraud detection while reducing false alerts delivers compound business value. The claims about "boost in fraud detection" and "faster time to market" position TrustScore as delivering immediate business impact. However, as instructed, we should note that these performance claims come from Feedzai's own case study and lack independent validation. The comparison baseline of "traditional out-of-the-box rule systems" also matters—comparing against basic rule systems may be less impressive than comparisons against well-tuned custom AI models, though the case study suggests advantages over both categories of traditional approaches. ## Limitations and Open Questions While the case study provides substantial technical detail about the TrustScore architecture and deployment model, several aspects remain underspecified that would be valuable for fully assessing the production system. The specific machine learning algorithms underlying the expert models are not disclosed—we don't know whether these are gradient boosted trees, neural networks, or other model types. The gating and aggregation mechanisms for combining expert outputs are explicitly noted as confidential, making it difficult to assess the sophistication of the ensemble approach. The case study also doesn't specify the operational cadence for model monitoring and updates. How frequently are expert models retrained? What performance thresholds trigger updates? How is the impact of updates validated before deployment? These MLOps process details would provide valuable insights into the operational maturity of the production system. The personalization and fine-tuning capabilities are mentioned but not detailed. What specific techniques are used? How much institution-specific data is required before personalization becomes beneficial? How are personalized models integrated with the shared expert models? These questions remain unanswered but would be important for institutions considering deployment. Finally, while the federated learning approach is described at a high level, the specific protocols for privacy-preserving model aggregation are not detailed. Are techniques like differential privacy, secure multi-party computation, or homomorphic encryption employed? Understanding the specific privacy guarantees would help assess the system's suitability for highly regulated environments. ## Connection to GenAI and LLMs Interestingly, while the case study mentions GenAI in the context of fraudsters using these technologies to develop more sophisticated attacks, it doesn't explicitly describe TrustScore itself as leveraging large language models or generative AI technologies. The architecture described—Mixture of Experts with federated learning applied to tabular transaction data—represents a sophisticated production AI system but not necessarily one that involves LLMs or generative models in the traditional sense. The mention of GenAI appears primarily as a business context factor (explaining why fraud detection is becoming more challenging) rather than as a technology incorporated into the solution itself. This is worth noting because the case represents production AI and MLOps best practices that apply broadly across machine learning systems, not specifically LLM-based systems. The operational considerations around model deployment, monitoring, calibration, and continuous updating apply equally well to traditional discriminative models as to modern generative models. That said, the Mixture of Experts architecture described does share conceptual similarities with MoE approaches used in some large language models, where different expert networks specialize in different types of inputs or tasks. The federated learning framework also has parallels in distributed training approaches used for large models. So while TrustScore may not be an LLM-based system per se, the architectural patterns and operational practices described have relevance for teams deploying LLMs in production settings, particularly in scenarios requiring privacy preservation and collaborative learning across organizational boundaries.

Start deploying reproducible AI workflows today