Skip to content

Latest commit

 

History

History
executable file
·
332 lines (246 loc) · 15 KB

README.md

File metadata and controls

executable file
·
332 lines (246 loc) · 15 KB

swarm-learning-hpe

standard-readme compliant

Swarm learning based on HPE platform, experiments performed based on HPE Swarm Learning version number 2.2.0

This repository contains:

  1. SWARM Learning For Histopathology Image Analysis and Radiology Image Analysis
  2. Work flow to help keep track of what's under process.
  3. Issue section where people can dump ideas and raise questions encountered when using this repo.
  4. Working version of marugoto_mri for Attention MIL based model, originally suitable for histopathology images but Marta has modified it to work with MRI images.
  5. Working version of odelia-breast-mri for 3D-CNN model by @Gustav.

Table of Contents

Background

Brief description about HPE platform

Course of Swarm Leaning explained in a generally understandable way: https://learn.software.hpe.com/swarm-learning-essentials

HPE Swarm Learning extends the concept of federated learning to decentralized learning by adding functionality that obviates the need for a central leader. It combines the use of AI, edge computing, and blockchain.

HPE Swarm Learning is a decentralized, privacy-preserving Machine Learning (ML) framework. Swarm Learning framework uses the computing power at, or near, the distributed data sources to run the ML algorithms that train the models. It uses the security of a blockchain platform to share learning with peers safely and securely. In Swarm Learning, training of the model occurs at the edge, where data is most recent, and where prompt, data-driven decisions are mostly necessary. In this decentralized architecture, only the insights learned are shared with the collaborating ML peers, not the raw data. This tremendously enhances data security and privacy.

The following image provides an overview of the Swarm Learning framework. Only the model parameters (learnings) are exchanged between the various edge nodes and not the raw data. This ensures that the privacy of data is preserved. img.png

This is the Swarm Learning framework: sl_node_structure.png

Install

Prerequisites

Hardware recommendations

  • 64 GB of RAM (32 GB is the absolute minimum)
  • 16 CPU cores (8 is the absolute minimum)
  • an NVIDIA GPU with 48 GB of RAM (24 is the minimum)
  • 8 TB of Storage (4 TB is the absolute minimum)
  • We deliberately want to show that we can work with lightweight hardware like this. Here are three quotes for systems like this for less than 10k EUR (Lambda, Dell Precision, and Dell Alienware)
  • Typical installation time can take 30 minutes to build up the necessary dependencies and another 1 hour to build up the environment for running demo experiments

Operating System

  • Ubuntu 20.04 LTS
    • We have tested the Swarm Learning Environment on [Ubuntu 20.04 LTS, Ubuntu 22.04.2 LTS, Ubuntu 20.04.5 LTS] and they work fine.
    • Any experimental release of Ubuntu greater than LTS 20.04 MAY result in unsuccessful swop node running.
    • It also works on WSL2(Ubuntu 20.04.2 LTS) on Windows systems. WSL1 may have some issues with the docker service.

Upgrade the Swarm Learning Environment from Older Version

  1. Run the following command to upgrade the Swarm Learning Environment from 1.x.x to 2.x.x
sh workspace/automate_scripts/server_setup/cleanup_old_sl.sh

Then proceed to 1. Prerequisite in Setting up the Swarm Learning Environment

Setting up the user and repository

  1. Create a user named "swarm" and add it to the sudoers group. Login with user "swarm".
sudo adduser swarm
sudo usermod -aG sudo swarm
sudo su - swarm
  1. Add the Docker user to the sudoers group
sudo usermod -aG docker swarm

After running this command, you will need to log out and log back in for the changes to take effect, or you can use the newgrp command like so:

newgrp docker
  1. Run the following commands to set up the repository:
cd / && sudo mkdir opt/hpe && cd opt/hpe && sudo chmod 777 -R /opt/hpe
git clone https://github.com/KatherLab/swarm-learning-hpe.git && cd swarm-learning-hpe
  1. Install cuda environment and nvidia drivers, as soon as you could see correct outputs of the following command you may proceed.
nvidia-smi

Please disable secure boot. On some systems, Secure Boot might prevent unsigned kernel modules (like NVIDIA's) from loading. Check Loaded Kernel Modules:

  • To see if the NVIDIA kernel module is loaded:
lsmod | grep nvidia

Review System Logs:

  • Sometimes, system logs can provide insights into any issues with the GPU or driver:
dmesg | grep -i nvidia

Manually Load the NVIDIA Module:

  • You can try manually loading the NVIDIA kernel module using the modprobe command:
sudo modprobe nvidia

Requirements and dependencies will be automatically installed by the script mentioned in the following section.

Setting up the Swarm Learning Environment

PLEASE REPLACE THE <PLACEHOLDER> WITH THE CORRESPONDING VALUE!

<sentinel_ip> = 172.24.4.67 currently it's the IP assigned by VPN server for TUD host.

<host_index> = Your institute's name. For ODELIA project should be chosen from TUD Ribera VHIO Radboud UKA Utrecht Mitera Cambridge Zurich

<workspace_name> = The name of the workspace you want to work on. You can find available modules under workspace/ folder. Each module corresonds to a radiology model. Currently we suggest to use odelia-breast-mri or marogoto_mri here.

Please only proceed to the next step by observing "... is done successfully" from the log

  1. Optional: download preprocessed datasets. Please refer to the Data Preparation section for more details.

  2. Prerequisite: Runs scripts that check for required software and open/exposed ports.

sh workspace/automate_scripts/automate.sh -a
  1. Server setup: Runs scripts that set up the swarm learning environment on a server.
sh workspace/automate_scripts/automate.sh -b -s <sentinel_ip> -d <host_index>
  1. Final setup: Runs scripts that finalize the setup of the swarm learning environment. Only <> is required. The [-n num_peers] and [-e num_epochs] flags are optional.
sh workspace/automate_scripts/automate.sh -c -w <workspace_name> -s <sentinel_ip> -d <host_index> [-n num_peers] [-e num_epochs]

Optional 5. Reconnect to VPN

sh /workspace/automate_scripts/server_setup/setup_vpntunnel.sh

In case your machine got restarted or lost the vpn connection. Here is the guide to reconnect: VPN connect guide The file.ovpn is the config file that TUD assigned to you.

If a problem is encountered, please observe this README.md file for step-by-step setup. Specific instructions are given about how to run the commands. All the processes are automated, so you can just run the above command and wait for the process to finish.

If any problem occurs, please first try to figure out which step is going wrong, try to google for solutions and find solution in Troubleshooting.md. Then contact the maintainer of the Swarm Learning Environment and document the error in the Troubleshooting.md file.

Usage

Ensuring Dataset Structure

To ensure proper organization of your dataset, please follow the steps outlined below:

  1. Directory Location

    Place your dataset under the specified path:

/workspace/odelia-breast-mri/user/data-and-scratch/data

Within this path, create a folder named multi_ext. Your directory structure should then resemble: /opt/hpe/swarm-learning-hpe/workspace/odelia-breast-mri/user/data-and-scratch/data └── multi_ext ├── datasheet.csv # Your clinical tabular data ├── test # External validation dataset ├── train_val # Your own site training data └── segmentation_metadata_unilateral.csv # External validation table

  1. Data Organization

Inside the train_val or test directories, place folders that directly contain NIfTI files. The folders should be named according to the following convention:

_right _left

Here, <patientID> should correspond with the patient ID in your tables (datasheet.csv and segmentation_metadata_unilateral.csv). This convention assists in linking the imaging data with the respective clinical information efficiently.

Summary

  • Step 1: Ensure your dataset is placed within /workspace/odelia-breast-mri/user/data-and-scratch/data/multi_ext.
  • Step 2: Organize your clinical tabular data, external validation dataset, your own site training data, and external validation table as described.
  • Step 3: Name folders within train_val and test as <patientID>_right or <patientID>_left, matching the patient IDs in your datasheets.

Following these structured steps will help in maintaining a well-organized dataset, thereby enhancing data management and processing in your projects.

Running Swarm Learning Nodes

To run a Swarm Network node -> Swarm SWOP Node -> Swarm SWCI node. Please open a terminal for each of the nodes to run. Observe the following commands:

SN

  • To run a Swarm Network (or sentinel) node:
./workspace/automate_scripts/launch_sl/run_sn.sh -s <sentinel_ip_address> -d <host_index>

or

runsn

SWOP

  • To run a Swarm SWOP node:
./workspace/automate_scripts/launch_sl/run_swop.sh -w <workspace_name> -s <sentinel_ip_address>  -d <host_index>

or

runswop

SWCI

  • To run a Swarm SWCI node(SWCI node is used to generate training task runners, could be initiated by any host, but currently we suggest only the sentinel host is allowed to initiate):
./workspace/automate_scripts/launch_sl/run_swci.sh -w <workspace_name> -s <sentinel_ip_address>  -d <host_index>

or

runswci
  • To check the logs from training:
./workspace/automate_scripts/launch_sl/check_latest_log.sh

or

cklog [--ml] [--swci] [--swop] [--sn] 
  • To stop the Swarm Learning nodes, --[node_type] is optional, if not specified, all the nodes will be stopped. Otherwise, could specify --sn, --swop for example.
./workspace/swarm_learning_scripts/stop-swarm --[node_type]

or

stopswarm [--node_type]
  • To view results, see logs under workspace/<workspace_name>/user/data-and-scratch/scratch

Please observe Troubleshooting.md section 10 for successful running logs of swop and sn nodes. The process will keep running and displaying logs so you will need to open a new terminal to run other commands.

  • Typical run time can take 3 hours for experiments trained on the DUKE dataset with ResNet50-3D with three nodes involved.

Workflow

Workflow.png Swarm model training protocol .png

Node list

Nodes will be added to vpn and will be able to communicate with each other after setting up the Swarm Learning Environment with Install

Project Node Name Location Hostname Data Maintainer
Sentinel node TUD Dresden, Germany swarm @Jeff
ODELIA VHIO Madrid, Spain radiomics @Adrià
UKA Aachen, Germany swarm @Gustav
RADBOUD Nijmegen, Netherlands swarm @Tianyu
MITERA Paul, Greece
RIBERA Lopez, Spain
UTRECHT
CAMBRIDGE Nick, Britain
ZURICH Sreenath, Switzerland
SWAG swarm

| DECADE | | | swarm | | |

| Other nodes | UCHICAGO | Chicago, USA | swarm | | @Sid |

Models implemented

TUD benchmarking on Duke breast mri dataset:TUD experiments result.png

Report: Swarm learning report.pdf

Maintainers

TUD Swarm learning team

@Jeff.

Wanna a 24-hours support? Configure your TeamViewer with the following steps and contact me through slack. Thanks @Adrià for instructions.

  1. Enable remote control in the ubuntu settings ubuntu_remote_control.png
  2. Install TeamViewer and login with username: [email protected] and password: 2wuHih4qC5tEREM
  3. Add the computer to the account ones, so that it can be controlled. You have it here: linkTV add device.png
  4. I'd advise you to set the computer to never enter the sleeping mode or darken the screen just in case. Also, if you want to use different users remember this has to be done in all them and the TV session need to be signed in all them as well.

Milestone

See this link

NotionPage

See this link

Contributing

Feel free to dive in! Open an issue or submit PRs.

Before creating a pull request, please take some time to take a look at our wiki page, to ensure good code quality and sufficient documentation. Don't need to follow all of the guidelines at this moment, but it would be really helpful!

Contributors

This project exists thanks to all the people who contribute. @Oliver @Kevin

Credits

This project uses platform from the following repositories:

License

MIT