Skip to content

Latest commit

 

History

History
90 lines (63 loc) · 5.26 KB

README.md

File metadata and controls

90 lines (63 loc) · 5.26 KB

Deep Learning Climate Segmentation Benchmark

PyTorch implementation for the climate segmentation benchmark, based on the Exascale Deep Learning for Climate Analytics codebase here: https://github.com/azrael417/ClimDeepLearn, and the paper: https://arxiv.org/abs/1810.01993

Dataset

The dataset for this benchmark comes from CAM5 [1] simulations and is hosted at NERSC. The samples are stored in HDF5 files with input images of shape (768, 1152, 16) and pixel-level labels of shape (768, 1152). The labels have three target classes (background, atmospheric river, tropical cycline) and were produced with TECA [2].

The current recommended way to get the data is to use GLOBUS and the following globus endpoint:

https://app.globus.org/file-manager?origin_id=0b226e2c-4de0-11ea-971a-021304b0cca7&origin_path=%2F

The dataset folder contains a README with some technical description of the dataset and an All-Hist folder containing all of the data files.

Preprocessing

The dataset is already pre-split in train/val/test and a summary statistics file is shipped as well. Note that the test split is never used by the benchmark (i.e. not required for submission) but can be used to perform additional crosschecks in terms of generalization and training stability.

The split was generated by a selection of scripts which achieves this. The splitting scripts are under src/utils. If you want to manually split and summarize the dataset, feel free to use this scripts, but note that for a valid benchmark submission the dataset split and preprocessing has to be the same as in the reference implementation.

For splitting the dataset, please change the lines 5 and 6 (inputdir and outputdir) in split_data.py accordingly. The first variable should specify the absolute path to the full dataset, the second variable specifies the parent directory of where the train/validation/test splits end up. Instead of copying the files, symbolic links will be created. Therefore, if you plan to run the code from a container or system with different mount points than those used for the splitting, the links might be invalid and files not found. In this case, perform the splitting in the same environment used for the runs later.

For summarizing the dataset (i.e. computing summary statistics for input normalization), use script summarize_data.py in the same directory. Please modify line 85, data_path_prefix accordingly. It should point to the parent directory which hosts all the split, i.e. is equal to the output_dir from the above mentioned splitting script. Note that the summary script uses mpi4py for distributed computing, as the whole summarization on a single CPU can take a few hours. Once the stats.h5 file is created, place it inside the training, test and validation directories.

Before you run

Make sure you have a working python environment with pytorch, h5py, basemap and wandb setup. If you want to use learning rate warmup, you must also install the warmup-scheduler package available at https://github.com/ildoonet/pytorch-gradual-warmup-lr.

The training uses Weights & Biases (WandB/W&B, https://app.wandb.ai) as logging facility. In order to use it, please sign up, log in and create a new project. Create a file named .wandbirc containing the user login and the API key as follows:

<login> <API key>

Place this file in a directory accessible by the workers.

How to run the benchmark

Submission scripts are in src/deepCam/run_scripts.

Running at NERSC

To submit to the Cori KNL system, set up a conda env called mlperf_deepcam which contains all the prereqs, such as h5py, wandb and basemap. Please edit the entries

export PROJ_LIB=/global/homes/t/tkurth/.conda/envs/mlperf_deepcam/share/basemap
export PYTHONPATH=/global/homes/t/tkurth/.conda/envs/mlperf_deepcam/lib/python3.7/site-packages:${PYTHONPATH}

in src/deepCam/run_scripts/run_training_cori.sh to point to the correct paths and add

--wandb_certdir <my-cert-dir>

to the arguments passed to the python training script. Here, <my-cert-dir> should point to the directory which contains the .wandbirc file created before. Then run

# This example runs on 64 nodes.
cd src/deepCam/run_scripts
sbatch -N 64 run_training_cori.sh

Using Docker

The implementation comes with a Dockerfile optimized for NVIDIA DGX-2 workstations but usable on other NVIDIA multi-gpu systems. Use the Dockerfile docker/Dockerfile.train to build the container and the script src/deepCam/run_scripts/run_training_dgx2.sh for training. Please create a file no-git/wandb_cert.key formatted as mentioned above before building the image so that the builder can properly place your WandB credentials inside the container.

References

  1. Wehner, M. F., Reed, K. A., Li, F., Bacmeister, J., Chen, C.-T., Paciorek, C., Gleckler, P. J., Sperber, K. R., Collins, W. D., Gettelman, A., et al.: The effect of horizontal resolution on simulation quality in the Community Atmospheric Model, CAM5. 1, Journal of Advances in Modeling Earth Systems, 6, 980-997, 2014.
  2. Prabhat, Byna, S., Vishwanath, V., Dart, E., Wehner, M., Collins, W. D., et al.: TECA: Petascale pattern recognition for climate science, in: International Conference on Computer Analysis of Images and Patterns, pp. 426-436, Springer, 2015b.