AWS GENAIC (Japan): Large-Scale Foundation Model Training Infrastructure for National AI Initiative

LLMOps Database

Government

AWS GENAIC (Japan)

Company

AWS GENAIC (Japan)

Title

Large-Scale Foundation Model Training Infrastructure for National AI Initiative

Industry

Government

Link

https://aws.amazon.com/blogs/machine-learning/beyond-accelerators-lessons-from-building-foundation-models-on-aws-with-japans-geniac-program?tag=soumet-20

Year

2025

Summary (short)

Japan's GENIAC program partnered with AWS to provide 12 organizations with massive compute resources (127 P5 instances and 24 Trn1 instances) for foundation model development. The challenge revealed that successful FM training required far more than raw hardware access - it demanded structured organizational support, reference architectures, cross-functional teams, and comprehensive enablement programs. Through systematic deployment guides, monitoring infrastructure, and dedicated communication channels, multiple large-scale models were successfully trained including 100B+ parameter models, demonstrating that large-scale AI development is fundamentally an organizational rather than purely technical challenge.

## Overview The GENIAC (Generative AI Accelerator Challenge) program represents a significant national-level initiative by Japan's Ministry of Economy, Trade and Industry (METI) to accelerate foundation model development. AWS served as the cloud provider for GENIAC's second cycle, supporting 12 participating organizations with massive compute resources and technical guidance. This case study provides valuable insights into the operational challenges and solutions for large-scale foundation model training in production environments. The program successfully deployed 127 Amazon EC2 P5 instances (NVIDIA H100 TensorCore GPU servers) and 24 Amazon EC2 Trn1 instances (AWS Trainium1 servers) in a single day, leading to the successful training of multiple large-scale models including Stockmark-2-100B-Instruct-beta, Llama 3.1 Shisa V2 405B, and Llama-3.1-Future-Code-Ja-8B. However, the key learning was that allocating thousands of accelerators was merely the starting point - the real challenge lay in architecting reliable systems and overcoming distributed training obstacles. ## Organizational Structure and Cross-Functional Teams One of the most critical insights from GENIAC was that running a multi-organization, national-scale machine learning initiative requires coordinated support across diverse internal teams. AWS established a virtual team structure that brought together account teams, specialist Solutions Architects, and service teams in a layered approach to provide scalable technical guidance. The engagement model consisted of multiple layers: customers (business and technical leads, ML and platform engineers) responsible for executing training workloads; AWS account teams (Solutions Architects and Account Managers) managing relationships and communication flows; the World Wide Specialist Organization (WWSO) Frameworks team specializing in large-scale ML workloads with focus on core HPC and container services; and service teams for Amazon EC2, S3, FSx, and SageMaker HyperPod providing deep technical expertise. This structure enabled effective scaling of technical guidance across complex foundation model training workloads. The WWSO Frameworks team worked closely with Lead Solutions Architects, designated specifically to support GENIAC engagements, who served as extensions of the specialist team and worked directly with customers while interfacing with framework specialists for in-depth technical discussions. ## Communication and Documentation Infrastructure AWS established robust communication channels that proved critical to success. A dedicated internal Slack channel for GENIAC program coordination connected AWS account teams with lead SAs, enabling real-time troubleshooting, knowledge sharing, and rapid escalation of customer issues. An external Slack channel bridged AWS teams with customers, creating a collaborative environment for questions, insights, and immediate support, significantly reducing resolution times and fostering a community of practice. Comprehensive workload tracking documents maintained details of each customer's training implementation (model architecture, distributed training frameworks, software components) alongside infrastructure specifications (instance types, quantities, cluster configurations for AWS ParallelCluster or SageMaker HyperPod deployments, storage solutions including FSx for Lustre and S3). This tracking system maintained chronological history of customer interactions and support cases. Weekly review meetings tracked outstanding customer inquiries and technical issues, enabling team members to share lessons learned and apply them across engagements. This structured approach helped identify common challenges like misconfigured NCCL library impacting multi-node performance, share solutions across teams, and continuously refine the engagement model. ## Reference Architectures and Infrastructure Templates Rather than allowing each team to configure clusters from scratch, AWS created pre-validated templates and automation for two main approaches: AWS ParallelCluster for user-managed HPC clusters and SageMaker HyperPod for managed, resilient cluster services. These reference architectures covered the full stack from compute, network, and storage to container environments and monitoring, delivered as a GitHub repository for minimal friction deployment. AWS ParallelCluster proved invaluable as an open source cluster management tool for multi-node GPU training, automating setup of Slurm-based HPC clusters using simple YAML configuration. SageMaker HyperPod provided an alternative as a managed service provisioning GPU and Trainium clusters for large-scale ML, integrating with orchestrators like Slurm or Kubernetes for scheduling with additional managed functionality around cluster resiliency. The reference architecture seamlessly combined compute, networking, storage, and monitoring into an integrated system. The base infrastructure stack was available as an AWS CloudFormation template provisioning complete infrastructure with minimal effort, automatically configuring dedicated VPC with optimized networking settings and implementing high-performance FSx for Lustre file systems for training data, complemented by optional FSx for OpenZFS support for shared home directories. The architecture employed a hierarchical storage approach balancing performance and cost-effectiveness, using S3 for durable long-term storage of training data and checkpoints, linked to the Lustre file system through data repository association (DRA) enabling automatic and transparent data transfer between S3 and FSx for Lustre for high-performance access without manual copying. ## Monitoring and Observability Infrastructure The monitoring infrastructure combined Amazon Managed Service for Prometheus and Amazon Managed Grafana (or self-managed Grafana on EC2) to provide comprehensive observability. It integrated DCGM Exporter for GPU metrics and EFA Exporter for network metrics, enabling real-time monitoring of system health and performance with continuous tracking of GPU health, network performance, and training progress. Automated alerting for anomalies was provided through Grafana Dashboards, including a GPU Health Dashboard providing metrics of common GPU errors: Uncorrectable Remapped Rows, Correctable Remapped Rows, XID Error Codes, Row Remap Failure, Thermal Violations, and Missing GPUs from Nvidia-SMI, helping users identify hardware failures quickly. ## Training and Enablement Programs Even the best reference architectures required teams to understand their usage. AWS Japan and the WWSO Frameworks team conducted a mass enablement session for GENIAC Cycle 2 participants, welcoming over 80 participants with comprehensive lectures, hands-on labs, and group discussions, earning a CSAT score of 4.75. The lecture sessions covered infrastructure fundamentals, exploring orchestration options such as AWS ParallelCluster, Amazon EKS, and SageMaker HyperPod, along with software components necessary for building and training large-scale foundation models. Sessions highlighted practical challenges in foundation model development including massive compute requirements, scalable networking, and high-throughput storage, mapping them to appropriate AWS services and best practices. Best practices sessions taught attendees to set up performance dashboards with Prometheus and Grafana, monitor EFA traffic, and troubleshoot GPU failures using NVIDIA's DCGM toolkit and custom Grafana dashboards based on the Frameworks team's experience managing clusters with 2,000 P5 instances. The WWSO team prepared workshops for both AWS ParallelCluster and SageMaker HyperPod, providing detailed deployment guides for the reference architecture. Participants conducted hands-on exercises deploying training clusters using Slurm with file systems including FSx for Lustre and FSx for OpenZFS, running multi-node PyTorch distributed training. Workshop segments focused on observability and performance tuning, teaching participants to monitor resource utilization, network throughput (EFA traffic), and system health. Following workshops, customers participated in onboarding sessions - structured, hands-on meetings with Lead SAs focusing on customer-specific cluster deployments tailored to each team's unique use case. Lead SAs worked directly with teams to deploy training environments, validate setup using NCCL tests, and resolve technical issues in real time. ## Technical Achievements and Production Results The program achieved significant technical milestones, with 12 customers successfully launching over 127 P5 instances and 24 Trn1 instances across multiple AWS Regions including Asia Pacific (Tokyo) in a single day. Multiple large language models and custom models were trained successfully, including a 32B multimodal model on Trainium and a 405B tourism-focused multilingual model. Customer feedback highlighted the program's impact. AI Inside's Executive Officer and CTO Takuma Inoue noted significant improvements in processing accuracy and cost-efficiency by applying two-stage reasoning and autonomous learning with SLM and LLM for regular items, and visual learning with VLM using 100,000 synthetic data samples for detailed items, utilizing Amazon EC2 P5 instances to enhance research and development efficiency. Future's Chief Research Engineer Makoto Morishita emphasized how AWS's wide range of tools like AWS ParallelCluster and strong support from AWS Solutions Architects enabled quick start of large-scale training when developing large-scale language models specialized for Japanese and software development. ## Lessons Learned and Future Implications The GENIAC program demonstrated that training foundation models at scale is fundamentally an organizational challenge, not merely a hardware one. Through structured support, reproducible templates, and cross-functional engagement teams, even small teams can successfully execute massive workloads in the cloud. Key insights include the critical importance of structured organizational support beyond raw compute resources, the value of pre-validated reference architectures and automation templates, the necessity of comprehensive monitoring and observability infrastructure, the impact of structured training and enablement programs, and the benefits of dedicated communication channels and documentation systems. AWS has begun preparations for the next cycle of GENIAC, hosting a comprehensive technical event in Tokyo attended by over 50 participants, showcasing commitment to supporting scalable, resilient generative AI infrastructure. The event featured intensive workshops on SageMaker HyperPod and Slurm, with participants gaining hands-on experience with multi-node GPU clusters, distributed PyTorch training, and observability tools. The technical engagement framework established through GENIAC Cycle 2 has provided crucial insights into large-scale foundation model development, with AWS advancing improvements across multiple dimensions: engagement models, technical assets, and implementation guidance. This includes strengthening cross-functional collaboration, systematizing knowledge sharing, enhancing reference architectures and automated training templates, and codifying practical technical workshops and best practices based on lessons learned. The success of GENIAC serves as a blueprint for enabling organizations to build and scale their AI capabilities effectively, demonstrating that successful large-scale foundation model training requires not just access to powerful hardware, but a comprehensive operational framework encompassing infrastructure automation, monitoring, training, and organizational support structures.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source