Skip to content

Commit

Permalink
Merge pull request #61 from KatherLab/exp_odelia_multisite_revision
Browse files Browse the repository at this point in the history
Exp odelia multisite revision
  • Loading branch information
Ultimate-Storm authored Feb 28, 2024
2 parents 856bb9e + e2a93e3 commit 3a3a131
Show file tree
Hide file tree
Showing 54 changed files with 769 additions and 3,867 deletions.
37 changes: 37 additions & 0 deletions DUKE_dataset_preparation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
## Data Preparation
### Notes
This will take a long time. Just run with either command, `get_dataset_gdown.sh` is recommended to run before you have done step 2, `get_dataset_scp.sh` is recommended to run after you have done step 2.
`get_dataset_gdown.sh` will download the dataset from Google Drive.
```sh
$ sh workspace/automate_scripts/sl_env_setup/get_dataset_gdown.sh
```
The [-s sentinel_ip] flag is only necessary for `get_dataset_scp.sh` The script will download the dataset from the sentinel node.
```sh
$ sh workspace/automate_scripts/sl_env_setup/get_dataset_scp.sh -s <sentinel_ip>
```

### Instructions

1. Make sure you have downloaded Duke data.

2. Create the folder `WP1` and in it `test` and `train_val`
```bash
mkdir workspace/<workspace-name>/user/data-and-scratch/data/WP1
mkdir workspace/<workspace-name>/user/data-and-scratch/data/WP1/{test,train_val}
```
3. Search for your institution in the [Node list](#nodelist) and note the data series in the column "Data"

4. Prepare the clinical tables
```sh
cp workspace/<workspace-name>/user/data-and-scratch/data/*.xlsx workspace/<workspace-name>/user/data-and-scratch/data/WP1
```

5. Copy the nifty files from feature folder into `WP1/test` from 801 to 922
```sh
cp -r workspace/<workspace-name>/user/data-and-scratch/data/odelia_dataset_only_sub/{801..922}_{right,left} workspace/<workspace-name>/user/data-and-scratch/data/WP1/test
```

6. Copy the nifty files from feature folder with the order you noted into `WP1/train_val` from xxx to yyy
```sh
cp -r workspace/<workspace-name>/user/data-and-scratch/data/odelia_dataset_only_sub/{<first_number>..<second_number>} workspace/<workspace-name>/user/data-and-scratch/data/WP1/train_val
```
176 changes: 112 additions & 64 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[![standard-readme compliant](https://img.shields.io/badge/readme%20style-standard-brightgreen.svg?style=flat-square)](https://github.com/RichardLitt/standard-readme)

Swarm learning based on HPE platform, experiments performed based on HPE Swarm Learning version number 2.1.0
Swarm learning based on HPE platform, experiments performed based on HPE Swarm Learning version number 2.2.0

This repository contains:

Expand Down Expand Up @@ -65,47 +65,56 @@ This is the Swarm Learning framework:
* Any experimental release of Ubuntu greater than LTS 20.04 MAY result in unsuccessful swop node running.
* It also works on WSL2(Ubuntu 20.04.2 LTS) on Windows systems. WSL1 may have some issues with the docker service.

### Upgrade the Swarm Learning Environment
### Upgrade the Swarm Learning Environment from Older Version
1. Run the following command to upgrade the Swarm Learning Environment from 1.x.x to 2.x.x
```sh
$ sh workspace/automate_scripts/server_setup/cleanup_old_sl.sh
sh workspace/automate_scripts/server_setup/cleanup_old_sl.sh
```
Then proceed to 1. `Prerequisite` in [Setting up the Swarm Learning Environment](#setting-up-the-swarm-learning-environment)

### Setting up the user and repository
1. Create a user named "swarm" and add it to the sudoers group.
Login with user "swarm".
```sh
$ sudo adduser swarm
$ sudo usermod -aG sudo swarm
$ sudo su - swarm
sudo adduser swarm
sudo usermod -aG sudo swarm
sudo su - swarm
```
2. Add the Docker user to the sudoers group
```sh
sudo usermod -aG docker swarm
```
After running this command, you will need to log out and log back in for the changes to take effect, or you can use the newgrp command like so:
```sh
newgrp docker
```
2. Run the following commands to set up the repository:

3. Run the following commands to set up the repository:

```sh
$ cd / && sudo mkdir opt/hpe && cd opt/hpe && sudo chmod 777 -R /opt/hpe
$ git clone https://github.com/KatherLab/swarm-learning-hpe.git && cd swarm-learning-hpe
cd / && sudo mkdir opt/hpe && cd opt/hpe && sudo chmod 777 -R /opt/hpe
git clone https://github.com/KatherLab/swarm-learning-hpe.git && cd swarm-learning-hpe
```

3. Install cuda environment and nvidia drivers, as soon as you could see correct outputs of the following command you may proceed.
4. Install cuda environment and nvidia drivers, as soon as you could see correct outputs of the following command you may proceed.
```sh
$ nvidia-smi
nvidia-smi
```
Please disable secure boot. On some systems, Secure Boot might prevent unsigned kernel modules (like NVIDIA's) from loading.
Check Loaded Kernel Modules:
- To see if the NVIDIA kernel module is loaded:
```sh
$ lsmod | grep nvidia
lsmod | grep nvidia
```
Review System Logs:
- Sometimes, system logs can provide insights into any issues with the GPU or driver:
```sh
$ dmesg | grep -i nvidia
dmesg | grep -i nvidia
```
Manually Load the NVIDIA Module:
- You can try manually loading the NVIDIA kernel module using the modprobe command:
```sh
$ sudo modprobe nvidia
sudo modprobe nvidia
```
Requirements and dependencies will be automatically installed by the script mentioned in the following section.

Expand All @@ -120,30 +129,27 @@ Requirements and dependencies will be automatically installed by the script ment

**Please only proceed to the next step by observing "... is done successfully" from the log**

0. Optional: download preprocessed datasets. This will take a long time. Just run with either command, `get_dataset_gdown.sh` is recommended to run before you have done step 2, `get_dataset_scp.sh` is recommended to run after you have done step 2.
`get_dataset_gdown.sh` will download the dataset from Google Drive.
```sh
$ sh workspace/automate_scripts/sl_env_setup/get_dataset_gdown.sh
```
The [-s sentinel_ip] flag is only necessary for `get_dataset_scp.sh` The script will download the dataset from the sentinel node.
```sh
$ sh workspace/automate_scripts/sl_env_setup/get_dataset_scp.sh -s <sentinel_ip>
```
0. Optional: download preprocessed datasets. Please refer to the [Data Preparation](DUKE_dataset_preparation.md) section for more details.

1. `Prerequisite`: Runs scripts that check for required software and open/exposed ports.
```sh
$ sh workspace/automate_scripts/automate.sh -a
sh workspace/automate_scripts/automate.sh -a
```
2. `Server setup`: Runs scripts that set up the swarm learning environment on a server.
```sh
$ sh workspace/automate_scripts/automate.sh -b -s <sentinel_ip> -d <host_index>
sh workspace/automate_scripts/automate.sh -b -s <sentinel_ip> -d <host_index>
```
3. `Final setup`: Runs scripts that finalize the setup of the swarm learning environment. Only <> is required. The [-n num_peers] and [-e num_epochs] flags are optional.
```sh
$ sh workspace/automate_scripts/automate.sh -c -w <workspace_name> -s <sentinel_ip> -d <host_index> [-n num_peers] [-e num_epochs]
sh workspace/automate_scripts/automate.sh -c -w <workspace_name> -s <sentinel_ip> -d <host_index> [-n num_peers] [-e num_epochs]
```

Optional 5. Reconnect to VPN
In case your machine got restarted or lost the vpn connection. Here is the guide to reconnect: [VPN connect guide](https://support.goodaccess.com/configuration-guides/linux/linux-terminal)
```sh
sh /workspace/automate_scripts/server_setup/setup_vpntunnel.sh
```

In case your machine got restarted or lost the vpn connection. Here is the guide to reconnect: [VPN connect guide](https://support.goodaccess.com/configuration-guides/linux)
The file.ovpn is the config file that TUD assigned to you.

If a problem is encountered, please observe this [README.md](workspace%2Fautomate_scripts%2FREADME.md) file for step-by-step setup. Specific instructions are given about how to run the commands.
Expand All @@ -152,56 +158,93 @@ All the processes are automated, so you can just run the above command and wait
If any problem occurs, please first try to figure out which step is going wrong, try to google for solutions and find solution in [Troubleshooting.md](Troubleshooting.md). Then contact the maintainer of the Swarm Learning Environment and document the error in the Troubleshooting.md file.

## Usage
### Data Preparation
1. Make sure you have downloaded Duke data.

2. Create the folder `WP1` and in it `test` and `train_val`
```bash
mkdir workspace/<workspace-name>/user/data-and-scratch/data/WP1
mkdir workspace/<workspace-name>/user/data-and-scratch/data/WP1/{test,train_val}
```
3. Search for your institution in the [Node list](#nodelist) and note the data series in the column "Data"
### Ensuring Dataset Structure

To ensure proper organization of your dataset, please follow the steps outlined below:

1. **Directory Location**

Place your dataset under the specified path:

/workspace/odelia-breast-mri/user/data-and-scratch/data


Within this path, create a folder named `multi_ext`. Your directory structure should then resemble:
/opt/hpe/swarm-learning-hpe/workspace/odelia-breast-mri/user/data-and-scratch/data
└── multi_ext
├── datasheet.csv # Your clinical tabular data
├── test # External validation dataset
├── train_val # Your own site training data
└── segmentation_metadata_unilateral.csv # External validation table

2. **Data Organization**

Inside the `train_val` or `test` directories, place folders that directly contain NIfTI files. The folders should be named according to the following convention:

<patientID>_right
<patientID>_left

Here, `<patientID>` should correspond with the patient ID in your tables (`datasheet.csv` and `segmentation_metadata_unilateral.csv`). This convention assists in linking the imaging data with the respective clinical information efficiently.

#### Summary

- **Step 1:** Ensure your dataset is placed within `/workspace/odelia-breast-mri/user/data-and-scratch/data/multi_ext`.
- **Step 2:** Organize your clinical tabular data, external validation dataset, your own site training data, and external validation table as described.
- **Step 3:** Name folders within `train_val` and `test` as `<patientID>_right` or `<patientID>_left`, matching the patient IDs in your datasheets.

Following these structured steps will help in maintaining a well-organized dataset, thereby enhancing data management and processing in your projects.

4. Prepare the clinical tables
```sh
cp workspace/<workspace-name>/user/data-and-scratch/data/*.xlsx workspace/<workspace-name>/user/data-and-scratch/data/WP1
```

5. Copy the nifty files from feature folder into `WP1/test` from 801 to 922
```sh
cp -r workspace/<workspace-name>/user/data-and-scratch/data/odelia_dataset_only_sub/{801..922}_{right,left} workspace/<workspace-name>/user/data-and-scratch/data/WP1/test
```

6. Copy the nifty files from feature folder with the order you noted into `WP1/train_val` from xxx to yyy
```sh
cp -r workspace/<workspace-name>/user/data-and-scratch/data/odelia_dataset_only_sub/{<first_number>..<second_number>} workspace/<workspace-name>/user/data-and-scratch/data/WP1/train_val
```

### Running Swarm Learning Nodes
To run a Swarm Network node -> Swarm SWOP Node -> Swarm SWCI node. Please open a terminal for each of the nodes to run. Observe the following commands:
#### SN
- To run a Swarm Network (or sentinel) node:
```sh
$ ./workspace/automate_scripts/launch_sl/run_sn.sh -s <sentinel_ip_address> -d <host_index>
./workspace/automate_scripts/launch_sl/run_sn.sh -s <sentinel_ip_address> -d <host_index>
```

or
```sh
runsn
```
#### SWOP
- To run a Swarm SWOP node:
```sh
$ ./workspace/automate_scripts/launch_sl/run_swop.sh -w <workspace_name> -s <sentinel_ip_address> -d <host_index>
./workspace/automate_scripts/launch_sl/run_swop.sh -w <workspace_name> -s <sentinel_ip_address> -d <host_index>
```
or
```sh
runswop
```
#### SWCI

- To run a Swarm SWCI node(SWCI node is used to generate training task runners, could be initiated by any host, but currently we suggest only the sentinel host is allowed to initiate):
```sh
$ ./workspace/automate_scripts/launch_sl/run_swci.sh -w <workspace_name> -s <sentinel_ip_address> -d <host_index>
./workspace/automate_scripts/launch_sl/run_swci.sh -w <workspace_name> -s <sentinel_ip_address> -d <host_index>
```
or
```sh
runswci
```


- To check the logs from training:
```sh
$ ./workspace/automate_scripts/launch_sl/check_latest_log.sh
./workspace/automate_scripts/launch_sl/check_latest_log.sh
```
or
```sh
cklog [--ml] [--swci] [--swop] [--sn]
```


- To stop the Swarm Learning nodes, --[node_type] is optional, if not specified, all the nodes will be stopped. Otherwise, could specify --sn, --swop for example.
```sh
$ ./workspace/swarm_learning_scripts/stop-swarm --[node_type]
./workspace/swarm_learning_scripts/stop-swarm --[node_type]
```
or
```sh
stopswarm [--node_type]
```

- To view results, see logs under `workspace/<workspace_name>/user/data-and-scratch/scratch`
Expand All @@ -220,15 +263,20 @@ Please observe [Troubleshooting.md](Troubleshooting.md) section 10 for successfu
Nodes will be added to vpn and will be able to communicate with each other after setting up the Swarm Learning Environment with [Install](#install)
| Project | Node Name | Location | Hostname | Data | Maintainer |
| ------- | --------- | ------------------| ---------| --------- | ------------------------------------------|
| Sentinel node | TUD | Dresden, Germany | swarm | 1-100 | [@Jeff](https://github.com/Ultimate-Storm) |
| ODELIA | VHIO | Madrid, Spain | radiomics | 401-500 | [@Adrià]([email protected]) |
| | UKA | Aachen, Germany | swarm | 101-200 | [@Gustav]([email protected]) |
| | RADBOUD | Nijmegen, Netherlands | swarm | 501-600 | [@Tianyu]([email protected]) |
| | MITERA | | | 201-300 | |
| | RIBERA | | | 301-400 | |
| | UTRECHT | | | 601-700 | |
| | CAMBRIDGE | | | 701-800 | |
| | ZURICH | | | | |
| Sentinel node | TUD | Dresden, Germany | swarm | | [@Jeff](https://github.com/Ultimate-Storm) |
| ODELIA | VHIO | Madrid, Spain | radiomics | | [@Adrià]([email protected]) |
| | UKA | Aachen, Germany | swarm | | [@Gustav]([email protected]) |
| | RADBOUD | Nijmegen, Netherlands | swarm | | [@Tianyu]([email protected]) |
| | MITERA | Paul, Greece | | | |
| | RIBERA | Lopez, Spain | | | |
| | UTRECHT | | | | |
| | CAMBRIDGE | Nick, Britain | | | |
| | ZURICH | Sreenath, Switzerland | | | |
| SWAG | | | swarm | | |

| DECADE | | | swarm | | |


| Other nodes | UCHICAGO | Chicago, USA | swarm | | [@Sid]([email protected]) |

## Models implemented
Expand Down
52 changes: 0 additions & 52 deletions sllib/src/README.md

This file was deleted.

19 changes: 0 additions & 19 deletions sllib/src/python-client/pyproject.toml

This file was deleted.

Loading

0 comments on commit 3a3a131

Please sign in to comment.