Bunq: Multi-Agent AI Banking Assistant Using Amazon Bedrock

Overview

Bunq is Europe’s second-largest neobank, founded in 2012 and serving 20 million users across Europe with a focus on international lifestyle banking. This case study documents their development and production deployment of Finn, an in-house generative AI assistant built using Amazon Bedrock and a sophisticated multi-agent architecture. The case represents a significant LLMOps implementation in the highly regulated financial services sector, where security, compliance, and 24/7 availability are paramount concerns.

The business problem centered on delivering consistent, high-quality customer support across multiple channels, languages, and time zones while maintaining strict security protocols and regulatory compliance. Users expected instant access to essential banking functions including transaction disputes, account management, and personalized financial advice. Traditional support models created bottlenecks and strained internal resources, while the team also needed efficient mechanisms to analyze feature requests and bug reports for continuous improvement. The challenge exemplifies common LLMOps concerns around scalability, reliability, and maintaining context across customer interactions in production environments.

Solution Architecture and Technology Stack

Finn was launched in 2023 as part of Bunq’s proprietary AI stack and has undergone continuous evolution. The system leverages Amazon Bedrock as its foundational service, providing access to Anthropic’s Claude models through a unified API with enhanced security features critical for banking applications. The choice of Amazon Bedrock reflects typical LLMOps considerations around model access, compliance requirements, and the need to avoid managing infrastructure for model serving.

The core infrastructure relies on several AWS services working in concert. Amazon ECS (Elastic Container Service) provides fully managed container orchestration for deploying and managing the multi-agent architecture, allowing Bunq to focus on agent logic rather than cluster management. This containerization approach enables horizontal scalability and streamlined deployment practices. DynamoDB serves as the persistence layer for agent memory, conversation history, and session data, delivering single-digit millisecond performance necessary for maintaining context across customer interactions—a critical requirement for production conversational AI systems. Amazon OpenSearch Serverless provides vector search capabilities for the RAG implementation, enabling semantic search across Bunq’s knowledge base with automatic scaling based on application demand.

The architecture includes additional supporting services: Amazon S3 for document storage, Amazon MemoryDB for real-time session management, and a comprehensive observability stack using AWS CloudTrail, Amazon GuardDuty, and Amazon CloudWatch for monitoring performance, detecting security threats, and maintaining compliance. User access is secured through AWS WAF and Amazon CloudFront, with authentication flowing through Bunq’s proprietary identity system. Amazon SageMaker hosts fine-tuned models for specialized banking scenarios, complementing the foundation models accessed through Amazon Bedrock.

Multi-Agent Architecture Evolution

The case study provides valuable insights into the evolution of multi-agent architectures in production LLMOps settings. Bunq’s initial implementation followed a straightforward router-based pattern where a central router agent directed user queries to specialized sub-agents handling specific domains like technical support, general inquiries, transaction status, and account management. However, as the system scaled, three critical problems emerged that are instructive for LLMOps practitioners.

First, routing complexity increased dramatically as more specialized agents were added to handle the expanding ecosystem. The router required increasingly sophisticated logic to determine correct destinations, creating a maintenance burden. Second, overlapping capabilities meant multiple agents required access to the same data sources and capabilities, forcing the router to predict not just primary intent but also which secondary agents might be needed downstream—an essentially impossible task at scale. Third, the router became a scalability bottleneck and single point of failure, where every new agent or capability required updating router logic and comprehensive testing of all routing scenarios.

The architectural redesign centered on an orchestrator agent that fundamentally differs from the previous router approach. Rather than attempting to route to all possible agents, the orchestrator routes queries to only three to five primary agents and empowers these primary agents to invoke other agents as tools when needed. This “agent-as-tool” pattern delegates decision-making to the agents themselves rather than concentrating intelligence in a central routing component. Primary agents detect when they need specialized help and invoke tool agents dynamically through well-defined interfaces. This hierarchical structure avoids routing complexity while maintaining flexibility—a key lesson for designing scalable multi-agent systems in production.

The orchestrator maintains these primary agents as containerized services on Amazon ECS, enabling horizontal scaling where additional agent instances can be spun up automatically as demand increases. Specialized agents act as tools that primary agents call upon for specific capabilities such as analyzing transaction data, retrieving documentation, or processing complex queries. This design pattern represents an important evolution in production LLMOps architecture, moving from rigid pre-defined routing to dynamic, context-aware agent composition.

RAG Implementation and Knowledge Management

The system implements Retrieval Augmented Generation using Amazon OpenSearch Serverless as the vector store, enabling semantic search across Bunq’s knowledge base. This RAG approach allows Finn to ground responses in up-to-date banking documentation, policies, and procedures while reducing hallucination risks inherent in pure generative approaches. The OpenSearch Serverless configuration provides automatic scaling based on application needs, eliminating the operational overhead of managing search cluster capacity—an important consideration for production LLMOps where usage patterns can vary significantly.

The RAG implementation appears to support Finn’s capabilities around summarizing complex banking information, providing financial insights and budgeting advice, and accessing relevant documentation to answer user queries. The case study mentions that agents can access “only pertinent data to the request” while maintaining strict security and privacy controls, suggesting sophisticated data access patterns and potentially fine-grained permissions systems to ensure compliance with banking regulations.

Model Selection and Inference

Bunq uses Anthropic’s Claude models accessed through Amazon Bedrock as the primary foundation models for natural language understanding and generation. The choice of Claude likely reflects considerations around reasoning capabilities, context window size, and performance on complex financial queries. The case study also mentions fine-tuned models hosted on Amazon SageMaker for “specialized banking scenarios,” indicating a hybrid approach where foundation models handle general language understanding while custom models address domain-specific tasks.

This dual-model strategy represents a pragmatic LLMOps approach that balances the broad capabilities of large foundation models with the precision of task-specific fine-tuned models. However, the case study provides limited technical detail about the fine-tuning process, training data strategies, model versioning, or how the system determines when to use foundation models versus fine-tuned models—details that would be valuable for understanding the full LLMOps implementation.

Multilingual Capabilities and Translation

Finn supports translation of the Bunq application into 38 languages and provides real-time speech-to-speech translation for support calls, described as “a first in global banking.” These multilingual capabilities likely leverage Claude’s inherent multilingual understanding combined with specialized translation services, though the technical implementation details are not fully specified in the case study. Supporting 38 languages in production at scale presents significant LLMOps challenges around consistency, cultural nuances, and maintaining response quality across languages.

The speech-to-speech translation capability suggests integration with speech recognition and text-to-speech services, potentially AWS services like Amazon Transcribe and Amazon Polly, though these are not explicitly mentioned. Real-time translation requirements impose strict latency constraints that impact model selection, infrastructure design, and optimization strategies in the LLMOps pipeline.

Multimodal Capabilities

Beyond text, Finn includes image recognition capabilities for automating tasks like invoice processing and document verification. This multimodal functionality extends the LLMOps challenges beyond pure language models to include computer vision, potentially leveraging Claude’s vision capabilities or separate vision models. The case study mentions receipt processing and document verification specifically, which are high-value use cases in banking that reduce manual processing overhead while requiring high accuracy to maintain user trust and regulatory compliance.

Deployment and Development Velocity

The deployment timeline provides important insights into LLMOps practices at Bunq. The team moved from concept to production in 3 months starting in January 2025, assembling a cross-functional team of 80 people including AI engineers and support staff. During the initial rollout, the team deployed updates three times per day, indicating a highly automated CI/CD pipeline and robust testing infrastructure to support this deployment cadence without disrupting the production banking service.

This rapid iteration cycle represents best-in-class LLMOps practices but also raises questions about quality assurance, testing strategies, and rollback procedures that aren’t detailed in the case study. Deploying three times daily for a production banking application serving 20 million users suggests sophisticated canary deployment, A/B testing, or blue-green deployment strategies, along with comprehensive monitoring to detect issues before they impact the broader user base.

The containerization approach using Amazon ECS clearly enables this deployment velocity by providing consistent environments across development, testing, and production. The architecture likely includes automated testing pipelines, model performance monitoring, and rollback capabilities, though these operational details are not extensively covered in the marketing-focused case study.

Observability and Monitoring

The architecture includes a comprehensive observability stack using AWS CloudTrail for audit logging, Amazon GuardDuty for threat detection, and Amazon CloudWatch for performance monitoring. These services address critical LLMOps concerns around compliance (particularly important in banking), security threat detection, and operational visibility into system performance.

However, the case study provides limited detail about AI-specific monitoring such as tracking model performance degradation, monitoring for hallucinations or inappropriate responses, measuring user satisfaction metrics, or detecting when agents make incorrect decisions. Production LLMOps for customer-facing banking applications would typically require specialized monitoring for response quality, accuracy metrics, escalation rates, and compliance with banking regulations. The mention of 97% of support handled by Finn with 82% fully automated suggests metrics are being tracked, but the specific monitoring and alerting infrastructure isn’t detailed.

Session Management and Context Maintenance

DynamoDB serves as the persistence layer for agent memory, conversation history, and session data, enabling Finn to maintain context across customer interactions. This stateful architecture is essential for coherent multi-turn conversations where users might ask follow-up questions or switch between topics. Amazon MemoryDB handles real-time session management, providing in-memory performance for active conversations.

The dual-database approach (DynamoDB for persistent storage, MemoryDB for active sessions) represents a thoughtful architecture for managing different performance and durability requirements. Active conversations require millisecond-level latency for responsive user experience, while historical conversation data supports longer-term analytics, compliance requirements, and potentially model improvement through fine-tuning or few-shot learning.

Security and Compliance

Operating in the banking sector imposes strict security and compliance requirements that significantly impact LLMOps implementation. The case study mentions that the system maintains “strict security protocols and compliance standards” and accesses “only pertinent data to the request” while maintaining “strict security and privacy controls.” The infrastructure operates within a secure VPC (virtual private cloud) with AWS WAF for web application firewall protection and GuardDuty for threat detection.

These security layers are essential for production LLMOps in regulated industries, but the case study provides limited technical detail about data handling practices, model access controls, audit trails for AI decisions, or how the system ensures compliance with European banking regulations like GDPR and PSD2. The proprietary identity system suggests careful attention to authentication and authorization, but the specifics of how user data is isolated, how agent decisions are audited, and how the system prevents unauthorized access or data leakage aren’t extensively covered.

Performance Metrics and Business Impact

The case study presents impressive performance metrics that indicate successful production deployment. Finn now handles 97% of Bunq’s user support activity, with 82% fully automated (the remaining 15% presumably requiring some human intervention or oversight). Average response times of 47 seconds represent significant improvement over traditional support models, though this metric likely includes time for complex operations like transaction analysis rather than just initial response generation.

The transformation enabled Bunq to position itself as “Europe’s first AI-powered bank,” expanding reach to 38 languages and serving 20 million users across Europe with round-the-clock availability. The system provides capabilities that traditional support couldn’t deliver including real-time speech-to-speech translation, image recognition for receipt and document processing, and intelligent financial insights.

However, several important LLMOps metrics are not provided in the case study. There’s no detail about accuracy rates, user satisfaction scores, escalation patterns, or false positive/negative rates for automated responses. The percentage of interactions that are “fully automated” versus those requiring human oversight isn’t clearly defined, and there’s no discussion of edge cases, failure modes, or how the system handles queries it cannot confidently answer. These omissions are typical of marketing-focused case studies but limit technical assessment of the LLMOps implementation.

Critical Assessment and Limitations

While the case study presents an impressive implementation, several aspects warrant critical consideration. The source is an AWS blog post co-authored with Bunq, which inherently presents the solution in favorable terms to promote AWS services. Claims like “Europe’s first AI-powered bank” and handling “97% of support” should be considered in this marketing context.

The architectural evolution from router to orchestrator represents genuine learning about multi-agent systems at scale, but the case study doesn’t discuss challenges encountered, failed approaches, or limitations of the current system. There’s no mention of hallucination mitigation strategies beyond RAG, no discussion of how the system handles adversarial inputs or edge cases, and limited detail about testing and validation approaches for banking-critical decisions.

The rapid 3-month deployment and three-times-daily update cadence, while impressive, raise questions about testing rigor, regression prevention, and quality assurance processes that aren’t addressed. For a production banking system handling potentially sensitive financial decisions, the governance, approval, and validation processes would be critical components of responsible LLMOps that deserve more attention.

The case study also lacks detail about cost management, a significant concern for production LLM deployments at scale. With 20 million users and handling 97% of support interactions, the inference costs, storage costs for conversation history, and vector database costs would be substantial. There’s no discussion of optimization strategies, caching approaches, or cost monitoring—all important LLMOps considerations for sustainable production deployment.

Lessons for LLMOps Practitioners

Despite the marketing framing, the case study offers valuable lessons for production LLMOps implementations. The evolution from router-based to orchestrator-based multi-agent architecture demonstrates the importance of flexible, decentralized decision-making in complex agent systems. The agent-as-tool pattern where specialized agents are dynamically invoked by primary agents represents an important architectural pattern for scaling multi-agent systems without creating routing bottlenecks.

The use of managed services (Amazon Bedrock for model access, ECS for orchestration, OpenSearch Serverless for vector search) reflects pragmatic LLMOps decisions to minimize operational overhead and focus engineering resources on business logic rather than infrastructure management. The dual-database approach for session management (MemoryDB for active sessions, DynamoDB for persistent storage) demonstrates thoughtful architecture addressing different performance and durability requirements.

The cross-functional team of 80 people including AI engineers and support staff highlights the organizational requirements for successful LLMOps, where domain expertise, user feedback, and technical implementation must work together. The high deployment velocity suggests mature DevOps practices adapted to AI/ML workloads, though the specific practices aren’t detailed.

Overall, this case study represents a significant production LLMOps implementation in the challenging financial services domain, demonstrating how modern AI architectures can transform customer support operations while maintaining the security and compliance requirements essential to banking. However, readers should recognize the marketing context and the absence of discussion around challenges, limitations, and operational complexities that would provide a more complete picture of the LLMOps practices required to build and maintain such a system.

Multi-Agent AI Banking Assistant Using Amazon Bedrock

Industry

Technologies