Various: Climate Tech Foundation Models for Environmental AI Applications

LLMOps Database

Energy

Various

Company

Various

Title

Climate Tech Foundation Models for Environmental AI Applications

Industry

Energy

Link

https://aws.amazon.com/blogs/machine-learning/how-climate-tech-startups-are-building-foundation-models-with-amazon-sagemaker-hyperpod?tag=soumet-20

Year

2025

Summary (short)

Climate tech startups are leveraging Amazon SageMaker HyperPod to build specialized foundation models that address critical environmental challenges including weather prediction, sustainable material discovery, ecosystem monitoring, and geological modeling. Companies like Orbital Materials and Hum.AI are training custom models from scratch on massive environmental datasets, achieving significant breakthroughs such as tenfold performance improvements in carbon capture materials and the ability to see underwater from satellite imagery. These startups are moving beyond traditional LLM fine-tuning to create domain-specific models with billions of parameters that process multimodal environmental data including satellite imagery, sensor networks, and atmospheric measurements at scale.

## Climate Tech Foundation Models Case Study Overview This case study examines how climate technology startups are building specialized foundation models to address environmental challenges using Amazon SageMaker HyperPod as their primary MLOps infrastructure. The case covers multiple companies including Orbital Materials and Hum.AI, representing a new wave of climate tech companies that have moved beyond traditional LLM fine-tuning to develop custom foundation models trained from scratch on environmental datasets. The climate tech sector has evolved through distinct phases of AI adoption. Initially in early 2023, companies focused on operational optimization using existing LLMs through Amazon Bedrock and fine-tuning on AWS Trainium. The second wave involved building intelligent assistants by fine-tuning models like Llama 7B for specific use cases. The current third wave represents companies building entirely new foundation models specifically designed for environmental applications, processing real-world data rather than text-based datasets. ## Technical Implementation and Architecture ### Orbital Materials: Diffusion Models for Material Discovery Orbital Materials has developed a proprietary AI platform called "Orb" that uses generative AI to design, synthesize, and test new sustainable materials. Their approach replaces traditional trial-and-error laboratory methods with AI-driven design processes. The company built Orb as a diffusion model trained from scratch using SageMaker HyperPod, focusing initially on developing sorbents for carbon capture in direct air capture facilities. The technical achievement is significant - since establishing their laboratory in Q1 2024, Orbital achieved a tenfold improvement in material performance using their AI platform, representing an order of magnitude faster development than traditional approaches. This improvement directly impacts the economics of carbon removal by driving down costs and enabling rapid scale-up of carbon capture technologies. From an LLMOps perspective, Orbital Materials chose SageMaker HyperPod for its integrated management capabilities, describing it as a "one-stop shop for control and monitoring." The platform's deep health checks for stress testing GPU instances allowed them to reduce total cost of ownership by automatically swapping out faulty nodes. The automatic node replacement and training restart from checkpoints freed up significant engineering time that would otherwise be spent managing infrastructure failures. The SageMaker HyperPod monitoring agent provides comprehensive oversight, continually detecting memory exhaustion, disk failures, GPU anomalies, kernel deadlocks, container runtime issues, and out-of-memory crashes. Based on the specific issue detected, the system either replaces or reboots nodes automatically, ensuring training continuity without manual intervention. With the launch of SageMaker HyperPod on Amazon EKS, Orbital established a unified control plane managing both CPU-based workloads and GPU-accelerated tasks within a single Kubernetes cluster. This architectural approach eliminates the complexity of managing separate clusters for different compute resources, significantly reducing operational overhead. The integration with Amazon CloudWatch Container Insights provides enhanced observability, collecting and aggregating metrics and logs from containerized applications with detailed performance insights down to the container level. ### Hum.AI: Hybrid Architecture for Earth Observation Hum.AI represents another compelling example of climate tech foundation model development, building generative AI models that provide intelligence about the natural world. Their platform enables tracking and prediction of ecosystems and biodiversity, with applications including coastal ecosystem restoration and biodiversity protection. The company works with coastal communities to restore ecosystems and improve biodiversity outcomes. The technical architecture employed by Hum.AI is particularly sophisticated, utilizing a variational autoencoder (VAE) and generative adversarial network (GAN) hybrid design specifically optimized for satellite imagery analysis. This encoder-decoder model transforms satellite data into a learned latent space while the decoder reconstructs imagery after processing, maintaining consistency across different satellite sources. The discriminator network provides both adversarial training signals and feature-wise reconstruction metrics. This architectural approach preserves important ecosystem details that would typically be lost with traditional pixel-based comparisons, particularly for underwater environments where water reflections interfere with visibility. The company achieved a breakthrough capability to see underwater from space for the first time, overcoming historical challenges posed by water reflections. Hum.AI trains their models on 50 years of historic satellite data, amounting to thousands of petabytes of information. Processing this massive dataset required the scalable infrastructure provided by SageMaker HyperPod. The distributed training approach simultaneously optimizes both VAE and GAN objectives across multiple GPUs, paired with the auto-resume feature that automatically restarts training from the latest checkpoint, providing continuity even through node failures. The company leveraged comprehensive observability features through Amazon Managed Service for Prometheus and Amazon Managed Service for Grafana for metric tracking. Their distributed training monitoring included dashboards for cluster performance, GPU metrics, network traffic, and storage operations. This extensive monitoring infrastructure enabled optimization of training processes and maintained high resource utilization throughout model development. ## LLMOps Infrastructure and Operational Excellence ### SageMaker HyperPod Capabilities The case study demonstrates several critical LLMOps capabilities that SageMaker HyperPod provides for foundation model development. The platform removes undifferentiated heavy lifting for climate tech startups, enabling them to focus on model development rather than infrastructure management. The service provides deep infrastructure control optimized for processing complex environmental data, featuring secure access to Amazon EC2 instances and seamless integration with orchestration tools including Slurm and Amazon EKS. The intelligent resource management capabilities prove particularly valuable for climate modeling applications, automatically governing task priorities and resource allocation while reducing operational overhead by up to 40%. This efficiency is crucial for climate tech startups processing vast environmental datasets, as the system maintains progress through checkpointing while ensuring critical climate modeling workloads receive necessary resources. The platform includes a library of over 30 curated model training recipes that accelerate development, allowing teams to begin training environmental models in minutes rather than weeks. Integration with Amazon EKS provides robust fault tolerance and high availability, essential for maintaining continuous environmental monitoring and analysis. ### Distributed Training and Fault Tolerance Both companies highlighted the critical importance of fault tolerance in their foundation model training. Hum.AI's CEO Kelly Zheng emphasized that SageMaker HyperPod "was the only service out there where you can continue training through failure." The ability to train larger models faster through large-scale clusters and redundancy offered significant advantages over alternative approaches. The automatic hot-swapping of GPUs when failures occur saves thousands of dollars in lost progress between checkpoints. The SageMaker HyperPod team provided direct support to help set up and execute large-scale training rapidly and easily, demonstrating the importance of expert support in complex foundation model development projects. The fault tolerance mechanisms include sophisticated checkpointing strategies that enable training to resume from the exact point of failure, rather than requiring restarts from the beginning. This capability is particularly crucial for foundation models that may require weeks or months of training time on massive datasets. ### Resource Optimization and Cost Management The case study demonstrates several approaches to resource optimization and cost management in foundation model training. SageMaker HyperPod's flexible training plans allow organizations to specify completion dates and resource requirements while automatically optimizing capacity for complex environmental data processing. The system's ability to suggest alternative plans provides optimal resource utilization for computationally intensive climate modeling tasks. Support for next-generation AI accelerators such as AWS Trainium chips, combined with comprehensive monitoring tools, provides climate tech startups with sustainable and efficient infrastructure for developing sophisticated environmental solutions. This enables organizations to focus on their core mission of addressing climate challenges while maintaining operational efficiency and environmental responsibility. ## Sustainable Computing Practices Climate tech companies demonstrate particular awareness of sustainable computing practices, which aligns with their environmental mission. Key approaches include meticulous monitoring and optimization of energy consumption during computational processes. By adopting efficient training strategies, such as reducing unnecessary training iterations and employing energy-efficient algorithms, startups significantly lower their carbon footprint. The integration of renewable energy sources to power data centers plays a crucial role in minimizing environmental impact. AWS has committed to making the cloud the cleanest and most energy-efficient way to run customer infrastructure, achieving 100% renewable energy matching across operations seven years ahead of the original 2030 timeline. Companies are implementing carbon-aware computing principles, scheduling computational tasks to coincide with periods of low carbon intensity on the grid. This practice ensures that energy used for computing has lower environmental impact while promoting cost efficiency and resource conservation. ## Model Architecture Trends and Technical Innovations The case study reveals several important trends in foundation model architecture for climate applications. Unlike language-based models with hundreds of billions of parameters, climate tech startups are building smaller, more focused models with just a few billion parameters. This approach results in faster and less expensive training while maintaining effectiveness for specific environmental applications. The top use cases for climate foundation models include weather prediction trained on historic weather data for hyperaccurate, hyperlocal predictions; sustainable material discovery using scientific data to invent new sustainable materials; natural ecosystem analysis combining satellite, lidar, and ground sensor data; and geological modeling for optimizing geothermal and mining operations. Multimodal data integration represents a critical technical challenge, requiring sophisticated attention mechanisms for spatial-temporal data and reinforcement learning approaches. The complexity of environmental data demands robust data infrastructure and specialized model architectures that can effectively process and analyze diverse data types simultaneously. ## Partnership and Ecosystem Development The case study demonstrates the importance of deep partnerships in foundation model development. AWS and Orbital Materials established a multiyear partnership where Orbital builds foundation models with SageMaker HyperPod while developing new data center decarbonization and efficiency technologies. This creates a beneficial flywheel effect where both companies advance their respective goals. Orbital Materials is making their open-source AI model "Orb" available to AWS customers through Amazon SageMaker JumpStart and AWS Marketplace, marking the first AI-for-materials model available on AWS platforms. This enables AWS customers working on advanced materials and technologies including semiconductors, batteries, and electronics to access accelerated research and development within a secure and unified cloud environment. ## Conclusion and Future Implications This case study demonstrates how climate tech startups are leveraging advanced LLMOps infrastructure to build specialized foundation models that address critical environmental challenges. The success of companies like Orbital Materials and Hum.AI illustrates the potential for domain-specific foundation models to achieve breakthrough capabilities that were previously impossible with traditional approaches. The technical achievements - including tenfold improvements in material performance and the ability to see underwater from satellite imagery - represent significant advances that could have substantial environmental impact at scale. The LLMOps infrastructure provided by SageMaker HyperPod enables these breakthroughs by handling the complexity of distributed training, fault tolerance, and resource optimization, allowing companies to focus on innovation rather than infrastructure management. The case study also highlights the evolution of AI applications in climate tech, moving from operational optimization and intelligent assistants to custom foundation models trained on environmental datasets. This progression represents a maturing field that is developing increasingly sophisticated technical solutions to address the climate crisis through advanced artificial intelligence capabilities.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.

Learn more

Try Free