## Overview
This case study presents two distinct but complementary approaches to deploying LLMs in production within highly regulated environments. The presentation, delivered at AWS re:Invent, features Amanda Quinto (AWS Solutions Architect), Edson Lisboa (IT Executive at Sicoob), and Andre Gretenberg (from Holland Casino). The contrast between these two organizations offers valuable insights into different architectural patterns for LLMOps—one focused on infrastructure-level control using Kubernetes and open-source models, the other emphasizing managed services and rapid agent development.
Sicoob represents Brazil's largest cooperative financial system with presence across nearly 2,500 Brazilian cities, serving over 9 million members through more than 300 credit unions. They operate under strict regulations from Brazil's Central Bank, combining emerging AI legislation with existing data protection laws similar to GDPR. Holland Casino operates 13 physical casinos and one online casino under Dutch government mandate with a 50-year history, subject to extremely strict gaming oversight from Dutch authorities where non-compliance can result in casino closures, significant financial penalties, and license suspension. Both organizations demonstrate that sophisticated LLMOps implementations are possible within restrictive regulatory frameworks.
## Regulatory Framework and Compliance Considerations
The presentation emphasizes that modern LLMOps must navigate dense layers of laws, standards, and regulations. The speakers identify four common pillars across global AI regulations: compliance and governance, security, legal and privacy controls, and risk management. These appear consistently across different jurisdictions despite variations in specific requirements.
The regulatory landscape includes over 1,000 different AI regulations spanning 69 countries. Europe has implemented the EU AI Act with risk-based frameworks that completely ban certain AI uses and impose strict obligations on high-risk systems. Brazil combines emerging AI legislation with existing data protection requirements, including federal obligations mandating that public sector data remain within Brazil's borders and industry-specific requirements from regulators like the Central Bank.
The speakers present a layered compliance approach starting from the bottom with the Gen AI application itself, moving through country-specific regulatory layers, then broader frameworks including AWS Responsible AI principles, compliance standards like ISO certifications (particularly ISO 42001 specifically for AI systems—where AWS Bedrock was the first cloud provider to achieve certification), risk management frameworks like NIST AI 600-1, security patterns from OWASP (including the OWASP Top 10 for LLM vulnerabilities), and finally the AWS Well-Architected Framework with its specific lens for generative AI workloads.
A critical compliance point emphasized repeatedly: AWS does not use customer data to train or improve foundation models, customer data is not shared between customers, data remains in the specified region unless customers explicitly configure cross-region inference, and customers maintain full control over which models and regions they use. For highly regulated industries, these data sovereignty and isolation guarantees form the foundation of compliant LLMOps.
## Sicoob's Infrastructure-Centric Approach: Kubernetes and Open-Source LLMs
Sicoob's architecture represents a sophisticated infrastructure-as-code approach to LLMOps, centered on Amazon EKS (Elastic Kubernetes Service) running GPU-enabled EC2 instances. Their decision to use Kubernetes rather than managed services like Bedrock stems from several factors: existing Kubernetes expertise within their organization, desire for maximum control and flexibility, commitment to open-source models, and need for cost optimization at scale.
### Technical Architecture
The core architecture runs on Amazon EKS clusters deployed across three availability zones in Brazil's AWS region to meet data residency requirements. The infrastructure leverages GPU-enabled EC2 instances, specifically optimized for AI workloads using specialized AMIs that include pre-configured NVIDIA drivers and other dependencies, avoiding the operational complexity of managing these components manually.
Two critical open-source tools enable their cost-effective operation at scale:
**Karpenter** handles cluster autoscaling by dynamically provisioning and deprovisioning GPU instances based on workload demands. Rather than pre-defining static node groups, Karpenter intelligently selects the most appropriate and cost-effective GPU instance types across availability zones, prioritizing spot instances where possible. This approach enables true pay-as-you-go economics for expensive GPU infrastructure. When workload demands decrease, Karpenter removes instances, eliminating idle GPU costs.
**KEDA (Kubernetes Event-Driven Autoscaling)** complements Karpenter by scaling pods within the cluster based on incoming requests. When users aren't actively using the AI services, no pods run, meaning no GPU instances are provisioned. When requests arrive, KEDA triggers pod creation, which in turn signals Karpenter to provision appropriate GPU instances. While this introduces latency (seconds to deploy), this tradeoff is acceptable for their non-transactional use cases, delivering substantial cost savings.
This two-layer autoscaling approach—KEDA for pod-level scaling and Karpenter for infrastructure-level scaling—creates an efficient system where GPU resources are consumed only when genuinely needed. The speakers emphasize this is a key advantage of the Kubernetes-based approach for LLMOps in cost-sensitive environments.
### Model Serving and Management
Sicoob runs multiple open-source foundation models simultaneously within the same cluster infrastructure: Meta's Llama family, Mistral (specifically mentioned as using version 5.4), the Chinese DeepSeek model, and IBM's Granite. They deliberately avoid "falling in love" with any single model, instead selecting the best-performing model for each specific use case. The architecture supports running multiple models side-by-side, enabling A/B testing and gradual migration as better models emerge.
For model serving and inference optimization, they initially used vLLM (Very Large Language Model inference engine), but recently migrated to vLLM as their primary inference engine. The presenters note that vLLM provides significantly faster inference and better computational optimization compared to alternatives. As an actively maintained open-source project receiving continuous improvements from the community, vLLM represents the current state-of-the-art for self-hosted LLM inference.
For model management and providing user interfaces to their models, Sicoob deployed Open WebUI (formerly Ollama WebUI), an open-source solution that provides a user-friendly interface for interacting with locally hosted models rather than requiring command-line access. This democratizes access to the AI capabilities across their organization.
Models themselves are stored in Amazon S3, with the infrastructure configured to load models from S3 rather than downloading them each time pods start. This significantly accelerates deployment times and reduces external bandwidth dependencies.
Container images for their AI workloads are managed through Amazon ECR (Elastic Container Registry), and they expose model endpoints using AWS Application Load Balancers integrated with Kubernetes Ingress controllers, providing resilient and scalable access points for applications consuming the AI services.
### Production Use Cases
Sicoob has deployed four major production use cases leveraging their LLMOps infrastructure:
**Sicoob AI Code Assistant** integrates directly into developer IDEs, providing code autocompletion, code recommendations, and accelerating onboarding for new developers. Serving approximately 10,500 developers, this internal tool improves code quality, reduces time-to-delivery for new features, and provides continuous support throughout the development lifecycle. Notably, they chose to build this rather than purchase commercial solutions, maintaining control and customization capabilities.
**Back-office Automation** uses AI agents integrated with their robotic process automation platform to handle complex manual tasks. This implementation has saved approximately 400,000 human hours, representing massive operational efficiency gains. The combination of traditional RPA with generative AI capabilities enables automation of tasks previously considered too complex or variable for pure rule-based automation.
**Investment Advisor** provides specialized support for investment decisions, creating personalized recommendations based on member profiles, market best practices, and Sicoob's available investment products. This use case demonstrates AI deployment in a heavily regulated domain (financial advice) while maintaining compliance with Central Bank requirements.
**Sicoob Smart Assistant** (core banking assistant) offers multiple capabilities including interaction with legal documents and contracts using natural language, intelligent search across documentation, automated analysis, and automatic generation of credit loan decision reports. This represents sophisticated document understanding and generation in a compliance-critical context.
The diversity of these use cases—spanning code generation, process automation, financial advisory, and document intelligence—demonstrates the flexibility of their Kubernetes-based infrastructure. All use cases run on the same underlying platform, with model selection and scaling handled dynamically based on demand patterns.
### Operational Considerations and Tradeoffs
The presenters are candid about the operational overhead of their chosen approach. Running production LLMOps on Kubernetes requires significant expertise in container orchestration, GPU management, model serving frameworks, and integration of multiple open-source tools. Organizations without existing Kubernetes capabilities would face a steep learning curve.
However, for organizations already operating Kubernetes in production, extending the platform to AI workloads provides several advantages: consistent operational patterns across all workloads, reuse of existing disaster recovery and monitoring infrastructure, unified cost visibility and management, and leveraging the rich open-source ecosystem around Kubernetes and AI.
The emphasis on open-source models and infrastructure reflects both philosophical commitment and practical necessity. Open-source models provide transparency into training data and model behavior (important for compliance), avoid vendor lock-in, enable on-premises testing before cloud deployment, and offer cost advantages at scale compared to API-based managed services.
The presenters specifically recommend their approach for organizations that already have Kubernetes expertise and production workloads, need maximum control and customization, operate at sufficient scale to justify the operational investment, have specific compliance requirements around model hosting and data residency, and are committed to open-source technologies. For organizations lacking these characteristics, managed services like Bedrock may offer better tradeoffs.
## Holland Casino's Managed Service Approach: Bedrock and Agent Core
Holland Casino's LLMOps journey represents a contrasting approach, leveraging AWS managed services to achieve rapid deployment with minimal operational overhead while maintaining the strict compliance requirements of regulated gaming operations.
### Organizational Context and Requirements
Holland Casino's regulatory environment is exceptionally strict. The Dutch gaming authorities mandate comprehensive player safety measures, fraud detection, and anti-money laundering controls. Non-compliance results in immediate casino closures until issues are remediated, severe reputational damage to a trusted national brand, financial penalties substantially exceeding typical regulatory fines, and potential license suspension.
Regulations change frequently in response to political decisions, sometimes with implementation deadlines as short as one to three months. Holland Casino depends on third-party applications for gaming machines, jackpot systems, and casino management platforms, but these vendors cannot respond quickly enough to regulatory changes. This created a strategic imperative to establish rapid-response capabilities for regulatory compliance within AWS, where they maintain full control.
Since 2017, Holland Casino has hosted regulatory flows, reporting systems, and alerting infrastructure in AWS. Their team—combining developers, testers, architects, product owners, and IT managers, supported by consultants from Easy2Cloud—has developed deep AWS expertise. Over time, success with regulatory workloads led them to migrate additional systems including their central casino management system, business intelligence platforms, data analysis workloads using SageMaker, and even legacy enterprise bus systems on EC2 for stability.
This established AWS foundation and organizational culture of "start small, gain confidence, then scale" directly informed their approach to LLMOps. Rather than building complex infrastructure, they sought to leverage managed services that would allow them to focus on use case delivery rather than operational concerns.
### The Management Insight Gap Use Case
Holland Casino identified a specific but important problem: management and stakeholders need oversight of costs, security, and compliance but shouldn't need to log into the AWS Console. The existing pattern had stakeholders requesting ad-hoc reports from the internal team, who would write Python scripts to generate these reports. This created unnecessary dependencies, didn't scale, and consumed engineering resources on repetitive tasks.
The solution: provide management with AI agents that deliver self-service access to the information they need, supporting natural language dialogue rather than rigid reporting templates. Initial agents focus on three domains: cost analysis and billing, security posture and compliance status, and operational metrics and insights.
This use case exemplifies effective LLMOps strategy—identifying a clear business problem with definable scope, choosing technology appropriate to the problem scale, and delivering measurable value (reduced dependency on engineering team, faster access to insights for decision-makers) rather than deploying AI for its own sake.
### Technical Implementation with Strands Agents
Holland Casino chose the Strands agent framework for agent development, citing several advantages for their context. Strands agents are remarkably simple to implement—the entire agent definition fits in a few lines of Python code. The framework is compact but extensible with straightforward integration of custom tools and MCP (Model Context Protocol) servers. Being Python-based aligned with their existing codebase and team skills. As an open-source project, they could inspect exactly how the framework operates, important for compliance and security validation.
A typical Strands agent definition includes specifying the model (they use Anthropic's Claude 3.5 Sonnet via Bedrock), providing a system prompt with detailed instructions (the presenters note they invest significant effort crafting comprehensive system prompts, though space constraints prevented showing the full prompt in their presentation), and registering tools that the agent can invoke.
Critically, Holland Casino discovered that their existing ad-hoc reporting scripts could be easily transformed into agent tools with relatively minor refactoring. Rather than discarding this prior work, they repurposed it, emphasizing the importance of writing high-quality tool specifications that clearly describe tool capabilities, parameters, and expected outputs to help the model select appropriate tools.
Once an agent is defined and invoked with a session ID, users can have multi-turn conversations, asking follow-up questions like comparing current costs to previous months or drilling into specific services. This dialogue capability transforms static reports into interactive exploration, substantially improving the user experience for management stakeholders.
### Deployment Evolution: From Manual to Bedrock Agent Core
The presenters describe Holland Casino's deployment evolution, providing valuable insights into practical LLMOps decisions. Initially, getting a Strands agent running locally is trivial—"it works on my machine, and I guarantee it works on every one of your machines," Andre notes. However, production deployment requires addressing several concerns: hosting the agent in AWS Cloud with proper security, enabling the agent to scale based on demand, implementing serverless architecture to minimize costs, streaming responses for better user experience, and enforcing authentication and authorization.
Their first approach involved packaging the agent as a FastAPI application, containerizing it with Docker, deploying to AWS Lambda for serverless execution, implementing auth layers, establishing versioning practices, and creating deployment pipelines. This works—Andre confirms they successfully deployed this way—but requires substantial boilerplate code and operational investment for what is fundamentally a simple agent.
The introduction of Amazon Bedrock Agent Core dramatically simplified their deployment process. Agent Core allows developers to wrap their Strands agent (or other custom agents) in a Python function decorated with `@app.entry_point`, test locally exactly as before, run `bedrock-agent-core configure` to automatically generate all the boilerplate infrastructure code previously written manually, and deploy to production with `bedrock-agent-core launch`.
This approach provides enormous advantages for rapid prototyping and experimentation, seamless CI/CD integration, and access to the full suite of Agent Core services including managed runtimes, identity and authentication services, and sandbox environments for code interpretation and browser tools.
The architectural pattern Holland Casino adopted places Agent Core agents at the center, with access to Bedrock foundation models (primarily Anthropic Claude), Bedrock Guardrails for safety and compliance, Bedrock Knowledge Bases for RAG-based access to regulatory documentation, and custom tools for AWS API interactions. Two types of applications consume these agents: in-house applications hosted on AWS Amplify with Cognito user pools federated to Active Directory and Cognito identity pools providing temporary STS credentials for direct API access to invoke agents, and third-party applications without STS capability accessing agents through API Gateway.
The overall deployment follows a standard multi-account AWS architecture with separate pre-production and production accounts for AI workloads, surrounded by shared accounts for security, compliance, monitoring, and networking—treating AI infrastructure no differently than other production applications.
### Operational Lessons and Best Practices
Andre shares several hard-won lessons from production LLMOps that merit emphasis. First, non-deterministic behavior is inherent to LLMs and must be explicitly managed. Holland Casino's mitigation strategies include investing heavily in system prompt engineering with clear instructions, explicit do's and don'ts, and specified output formats; implementing Bedrock Guardrails to constrain outputs within acceptable bounds; designing agents with single, focused responsibilities rather than attempting to build one agent that "rules them all" (Andre's initial approach, which failed due to tool selection confusion); ensuring tool specifications are crystal clear and non-overlapping to help models select appropriate tools; investing in realistic evaluation jobs using Bedrock's evaluation capabilities, particularly when knowledge bases are in flux or models change; and implementing easy feedback mechanisms for end users in production to report issues or unexpected behavior.
The emphasis on system prompt quality cannot be overstated—the presenters repeatedly return to this as perhaps the most important factor in agent reliability. Combined with Guardrails, careful prompting provides the primary mechanism for ensuring consistent, compliant behavior from inherently probabilistic models.
For knowledge bases and RAG implementations, continuous evaluation becomes critical. As source documents change or models are updated, retrieval quality and answer accuracy can drift. Automated evaluation jobs using Bedrock's built-in metrics provide visibility into this drift, enabling proactive remediation before users encounter problems.
## Cross-Cutting LLMOps Themes and Technical Insights
Several themes emerge across both implementations that provide broader lessons for production LLMOps:
**Model Selection Philosophy**: Both organizations emphasize avoiding commitment to specific models. Sicoob explicitly runs multiple models simultaneously, selecting the best for each use case. Holland Casino chose Claude for their current needs but architect their systems to enable model switching. The presenters repeatedly stress that models are rapidly improving, new options constantly emerge, and different models excel at different tasks. Production LLMOps must accommodate model evolution as a first-class concern.
**Security and Data Isolation**: Both implementations prioritize data security and isolation. Sicoob's Kubernetes architecture provides per-session isolation, and their data residency controls ensure compliance with Brazilian regulations. Holland Casino leverages Bedrock Agent Core's session isolation and identity management capabilities. The presenters emphasize that security is "non-negotiable" in regulated industries and must be architectural from the start, not added later.
**Cost Management**: Both organizations implement sophisticated cost optimization strategies appropriate to their architectural choices. Sicoob's Karpenter and KEDA combination provides fine-grained control over expensive GPU resources. Holland Casino's serverless approach with Agent Core eliminates idle costs. The presenters suggest that effective LLMOps at scale requires treating cost as a first-class concern equal to functionality and performance.
**Compliance as Enabler**: A recurring theme is that compliance requirements should not prevent AI adoption but rather guide it. The presenters argue that AI can actually improve compliance by helping organizations understand and operationalize complex regulations. Both organizations use AI to assist with regulatory interpretation and reporting. This reframing—from "compliance blocks AI" to "AI enables compliance"—represents an important mindset shift.
**Guardrails and Safety**: Both implementations use guardrails, though implemented differently. Sicoob implements application-level controls and monitoring. Holland Casino uses Bedrock Guardrails directly. The presenters emphasize that because LLMs are non-deterministic, guardrails aren't optional enhancements but essential components for production deployment, particularly in regulated environments where outputs must remain within defined bounds.
**Open Source Ecosystems**: The presentations highlight the maturity and vitality of open-source tooling for LLMOps. Sicoob's entire stack relies on open-source components (Karpenter, KEDA, vLLM, Open WebUI, plus the models themselves). Holland Casino uses open-source Strands agents. The presenters note that the open-source ecosystem for Kubernetes-based AI has matured dramatically over just 18 months, evolving from minimal tooling to production-ready solutions for GPU cluster management, optimized inference engines, and comprehensive model serving frameworks.
**Start Small, Scale Thoughtfully**: Both organizations explicitly followed a pattern of starting with focused use cases, gaining organizational confidence in the technology, and then expanding scope. Holland Casino's management insight gap and Sicoob's code assistant both represent bounded problems with clear success criteria. This approach reduces risk, enables learning, and builds organizational capability before tackling more ambitious use cases.
**Team Skills and Organizational Readiness**: The presenters are candid about skill requirements. Sicoob's approach requires deep Kubernetes expertise, understanding of GPU infrastructure, and familiarity with multiple open-source tools. Holland Casino's approach requires solid Python skills and AWS service knowledge but substantially less infrastructure expertise. Organizations should assess their existing capabilities and choose approaches that align with or slightly extend current skills rather than requiring complete capability transformation.
## Responsible AI and Transparency
Amanda dedicates significant discussion to AWS Responsible AI principles, noting comprehensive documentation covering data handling, customer safeguarding, security, and prompt engineering structures to detect and prevent malicious use. Following these patterns substantially increases the likelihood of meeting country-specific regulatory requirements.
The presentation highlights emerging transparency practices, specifically mentioning Anthropic's system cards (140 pages documenting model training, data sources, and trustworthiness) and AWS scorecards for Amazon Nova models. This transparency—understanding what data trained models and how they behave—is increasingly important for compliance and trust in regulated industries.
The presenters note ISO 42001, the new international standard specifically for AI management systems, with AWS Bedrock achieving this certification before other major cloud providers. This demonstrates the maturity of managed AI services for compliance-critical workloads.
## Framework and Standards Landscape
The presentation provides a useful taxonomy of applicable frameworks for AI governance: NIST AI Risk Management Framework (particularly NIST 600-1), providing structured approaches to identifying and managing AI risks (though NIST provides frameworks, not audits); OWASP Top 10 for LLM Applications, documenting common vulnerabilities in LLM systems and mitigation approaches (actively evolving as new vulnerabilities emerge); AWS Well-Architected Framework with its specific lens for generative AI workloads, providing prescriptive guidance for building on AWS (presented as a "living document" that continues to evolve); and country-specific regulatory frameworks including the EU AI Act with risk-based classifications, Brazilian Central Bank requirements for financial institutions, and Dutch gaming authority regulations.
The layered approach presented—starting with specific regulations, then applying broader frameworks, then implementing AWS best practices—provides a practical methodology for navigating the complex compliance landscape.
## Evaluation and Monitoring
While not extensively detailed, both presentations touch on evaluation and monitoring as critical LLMOps concerns. Holland Casino specifically emphasizes investing in realistic evaluation jobs for knowledge bases, particularly when data changes or models are updated. The mention of Bedrock's evaluation capabilities suggests they use AWS-native tools for this assessment.
The emphasis on easy user feedback mechanisms in production suggests a pragmatic approach where automated evaluation is supplemented by real-world user experience reporting, creating a continuous improvement loop.
## Infrastructure and Deployment Patterns
The contrast between approaches illuminates fundamental LLMOps architectural decisions. Sicoob's infrastructure-as-code approach using Kubernetes provides maximum flexibility and control, enables use of any open-source model, supports advanced optimization strategies like spot instances and fine-grained autoscaling, and aligns with existing organizational capabilities and cultural preferences. However, it requires significant operational expertise, ongoing maintenance and updates, and careful management of multiple integrated components.
Holland Casino's managed service approach using Bedrock and Agent Core dramatically reduces operational overhead, provides built-in security and compliance features, enables rapid development and deployment cycles, and benefits from continuous AWS service improvements. However, it offers less flexibility in model selection, potentially higher costs at very large scale, and some dependency on AWS service roadmap and capabilities.
Neither approach is inherently superior—the choice depends on organizational context, existing capabilities, scale requirements, compliance needs, and philosophical preferences around control versus convenience. The presentations effectively demonstrate that both paths can lead to successful production LLMOps in highly regulated environments.
## Conclusion and Industry Implications
This case study provides unusually detailed and honest insights into production LLMOps in regulated industries. The presenters balance enthusiasm for AI capabilities with realistic assessments of challenges, operational requirements, and necessary compromises. The inclusion of two contrasting approaches—infrastructure-centric and managed-service-centric—within the same presentation offers valuable perspective on the range of valid implementation strategies.
The demonstrated ability of both a Brazilian financial cooperative and a Dutch gaming operator to successfully deploy production AI while maintaining strict compliance provides an existence proof that regulatory requirements need not block AI adoption when approached thoughtfully with appropriate architectural patterns, strong emphasis on security and governance, leveraging either infrastructure control or managed services based on organizational capabilities, and commitment to responsible AI practices and transparency.
The presentations emphasize that AWS and its ecosystem have matured substantially for LLMOps over the past 18-24 months, with both open-source tooling and managed services now production-ready for regulated industries. The speakers note that organizations should expect continued rapid evolution in models, frameworks, and best practices, requiring architectural flexibility to accommodate change.