PyTorch implementation for the climate segmentation benchmark, based on the Exascale Deep Learning for Climate Analytics codebase here: https://github.com/azrael417/ClimDeepLearn, and the paper: https://arxiv.org/abs/1810.01993
The dataset for this benchmark comes from CAM5 [1] simulations and is hosted at NERSC. The samples are stored in HDF5 files with input images of shape (768, 1152, 16) and pixel-level labels of shape (768, 1152). The labels have three target classes (background, atmospheric river, tropical cycline) and were produced with TECA [2].
The current recommended way to get the data is to use GLOBUS and the following globus endpoint:
https://app.globus.org/file-manager?origin_id=0b226e2c-4de0-11ea-971a-021304b0cca7&origin_path=%2F
The dataset folder contains a README with some technical description of the dataset and an All-Hist folder containing all of the data files.
The dataset is already pre-split in train/val/test and a summary statistics file is shipped as well. Note that the test split is never used by the benchmark (i.e. not required for submission) but can be used to perform additional crosschecks in terms of generalization and training stability.
The split was generated by a selection of scripts which achieves this. The splitting scripts are under src/utils
. If you want to manually split and summarize the dataset, feel free to use this scripts, but note that for a valid benchmark submission the dataset split and preprocessing has to be the same as in the reference implementation.
For splitting the dataset, please change the lines 5 and 6 (inputdir
and outputdir
) in split_data.py
accordingly. The first variable should specify the absolute path to the full dataset, the second variable specifies the parent directory of where the train/validation/test splits end up. Instead of copying the files, symbolic links will be created. Therefore, if you plan to run the code from a container or system with different mount points than those used for the splitting, the links might be invalid and files not found. In this case, perform the splitting in the same environment used for the runs later.
For summarizing the dataset (i.e. computing summary statistics for input normalization), use script summarize_data.py
in the same directory. Please modify line 85, data_path_prefix
accordingly. It should point to the parent directory which hosts all the split, i.e. is equal to the output_dir
from the above mentioned splitting script. Note that the summary script uses mpi4py
for distributed computing, as the whole summarization on a single CPU can take a few hours. Once the stats.h5
file is created, place it inside the training, test and validation directories.
Make sure you have a working python environment with pytorch
, h5py
, basemap
and wandb
setup.
If you want to use learning rate warmup, you must also install the warmup-scheduler package
available at https://github.com/ildoonet/pytorch-gradual-warmup-lr.
The training uses Weights & Biases (WandB/W&B, https://app.wandb.ai) as logging facility.
In order to use it, please sign up, log in and create a new project.
Create a file named .wandbirc
containing the user login and the API key as follows:
<login> <API key>
Place this file in a directory accessible by the workers.
Submission scripts are in src/deepCam/run_scripts
.
To submit to the Cori KNL system, set up a conda env called
mlperf_deepcam
which contains all the prereqs, such as h5py
, wandb
and basemap
.
Please edit the entries
export PROJ_LIB=/global/homes/t/tkurth/.conda/envs/mlperf_deepcam/share/basemap
export PYTHONPATH=/global/homes/t/tkurth/.conda/envs/mlperf_deepcam/lib/python3.7/site-packages:${PYTHONPATH}
in src/deepCam/run_scripts/run_training_cori.sh
to point to the correct paths and add
--wandb_certdir <my-cert-dir>
to the arguments passed to the python training script. Here, <my-cert-dir>
should point to the directory which contains the .wandbirc
file created before.
Then run
# This example runs on 64 nodes.
cd src/deepCam/run_scripts
sbatch -N 64 run_training_cori.sh
The implementation comes with a Dockerfile optimized for NVIDIA DGX-2 workstations but usable on
other NVIDIA multi-gpu systems. Use the Dockerfile
docker/Dockerfile.train
to build the container and the script src/deepCam/run_scripts/run_training_dgx2.sh
for training. Please create a file no-git/wandb_cert.key
formatted as mentioned above before building the image so that the builder can properly place your WandB credentials inside the container.
- Wehner, M. F., Reed, K. A., Li, F., Bacmeister, J., Chen, C.-T., Paciorek, C., Gleckler, P. J., Sperber, K. R., Collins, W. D., Gettelman, A., et al.: The effect of horizontal resolution on simulation quality in the Community Atmospheric Model, CAM5. 1, Journal of Advances in Modeling Earth Systems, 6, 980-997, 2014.
- Prabhat, Byna, S., Vishwanath, V., Dart, E., Wehner, M., Collins, W. D., et al.: TECA: Petascale pattern recognition for climate science, in: International Conference on Computer Analysis of Images and Patterns, pp. 426-436, Springer, 2015b.