This study tested HPC application performance across three clouds and on-premises HPC. The repository is organized as follows:
-
docker: includes container builds for different environments. Containers are shared between environments when possible to reduce redundancy.
-
experiments: are organized first by cloud, and then the underlying environment. In each, a README with the full experiment protocol (and usually commands to run) are included.
-
Google Cloud includes HPC Toolkit (Compute Engine), and GKE (Kubernetes) for each of CPU and GPU
-
Amazon Web Services includes Parallel Cluster (EC2), and EKS (KUbernetes) for each of CPU and GPU
-
Microsoft Azure includes CycleCloud (VMs), and AKS (Kubernetes) for each of CPU and GPU.
- Microsoft Azure CycleCloud CPU (date)
- size 32 (abhik done 6 apps 8/28/2024, done milroy 8/30/2024)
- size 64 (abhik done 6 apps 8/28/2024, done milroy 8/30/2024)
- size 128 (done milroy 8/30/2024)
- size 256 (done milroy 8/31/2024)
- Microsoft Azure CycleCloud GPU (date)
- size 4 (milroy and ani 8/31/2024)
- size 8 (milroy and ani 8/31/2024)
- size 16 (milroy and ani 8/31/2024)
- size 32 (milroy and ani 8/31/2024)
- AWS GPU Parallel Cluster
- size 32 (not going to do, could not build image)
- size 64 (not going to do, could not build image)
- size 128 (not going to do, could not build image)
- size 256 (not going to do, could not build image)
- AWS CPU Parallel Cluster
- size 32 (done milroy 8/29/2024-8/30/2024)
- size 64 (done ani 8/29/2024-8/30/2024)
- size 128 (done ani 8/29/2024-8/30/2024)
- size 256 (done ani 8/29/2024-8/30/2024)
- Google Cloud Compute Engine CPU (redone several times due to app configurations)
- size 32 (vsoch done 8/26/2024)
- size 64 (vsoch done 8/26/2024)
- size 128 (vsoch done 8/27/2024)
- size 256 (vsoch done 8/27/2024)
- Google Compute Engine GPU
- done on llnl-flux
- New VM and automation needed with Terraform (vsoch, early 9/2024)
- size 4 (vsoch 9/6/2024)
- size 8 (vsoch 9/7/2024)
- size 16 (vsoch 9/8/2024)
- size 32 (vsoch 9/8/2024)
- quicksilver and osu all reduce need runs at all sizes (vsoch 9/9/2024)
- Microsoft Azure AKS CPU
- size 32 (vsoch done 8/24/2024), redone with placement (vsoch 8/28/2024)
- size 64 (vsoch done 8/24/2024), redone with placement (vsoch 8/28/2024)
- size 128 (vsoch done 8/28/2024)
- size 256 (vsoch TBA 8/29/2024)
- Google Cloud GKE CPU
- size 32 (vsoch done 8/21/2024)
- size 64 (vsoch done 8/22/2024)
- size 128 (vsoch done 8/23/2024)
- size 256 (vsoch done 8/23/2024)
- AWS CPU EKS
- size 32 (vsoch done 8/21/2024-8/22/2024)
- size 64 (vsoch done 8/22/2024)
- size 128 (vsoch done 8/22/2024)
- size 256 (vsoch done on 8/31/2024)
- AWS GPU EKS
- size 4 (done vsoch 8/26/2024, milroy lammps/osu 8/27/2024)
- size 8 (done vsoch 8/26/2024, milroy lammps/osu 8/27/2024)
- size 16 (done vsoch, milroy lammps/osu 8/27/2024)
- size 32 not possible, could not get more than 16 nodes from AWS
- Google Cloud GKE GPU
- size 4 (done vsoch 8/29/2024)
- size 8 (done vsoch TBA 8/29/2024)
- size 16 (done vsoch 8/30/2024)
- size 32 (done vsoch 8/30/2024)
- milroy figured out installing latest drivers - key to success here!
- Microsoft Azure AKS GPU
- size 4 (done vsoch 8/31/2024)
- size 8 (done vsoch 8/31/2024)
- size 16 (done vsoch 8/31/2024)
- size 32 (done vsoch 8/31/2024)
HPCIC DevTools is distributed under the terms of the MIT license. All new contributions must be made under this license.
See LICENSE, COPYRIGHT, and NOTICE for details.
SPDX-License-Identifier: (MIT)
LLNL-CODE- 842614