The Foldseek_Tweaker can adjust the number of alphabet sizes for Foldseek and is helpful for beginners to understand the Foldseek code of training.
Foldseek is a breakthrough that transforms the structure of proteins into sequences, greatly accelerating the sequence alignment!
However, it has chosen 20 as its alphabet size for the trade-off between effectiveness and expressiveness. So, imagine we want to pursue one side. Can we adjust this size? Let's try!
Here are some configurations from our device for reference:
- CUDA == 11.7
- Python == 3.9
- torch == 1.13.0+cu117
- biopython == 1.83
Please be mindful of version compatibility during your actual setup.
- Bash Version:
sudo bash Encoder.sh tmp/pdb 20
- 20: Alphabet Size You Want
- tmp/pdb: Data Directory
The result is named as seqs.csv and is saved in the tmp_20 folder. We have trained 40, 80, 128, and 512 for you to choose from.
- Bash Version:
sudo bash learnAlphabet.sh 20 100 data/pdbs_train.txt data/pdbs_val.txt tmp_20/
- 20: Alphabet Size You Want
- 100: Seed for Model Selection
- data/pdbs_train.txt: Training Data File (See Data Preprocess)
- data/pdbs_val: Validation Data File (See Data Preprocess)
- tmp_20/: Output Directory
- Local: Training Step by Step at one seed
- Train
train_vqvae_local.py
- Encode
encode_pdbs_local.py
- Evalue
create_submat2.py
- Train
- (Optional) Please make sure your protein data has been put in tmp/pdb, or you can run this code on the terminal to test example data.
if [ ! -d tmp/pdb ]; then curl https://wwwuser.gwdg.de/~compbiol/foldseek/scop40pdb.tar.gz | tar -xz -C tmp fi
- Compile ssw_test on your terminal
if [ ! -f tmp/ssw_test ]; then git clone --depth 1 https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library tmp/ssw (cd tmp/ssw/src && make) cp tmp/ssw/src/ssw_test tmp/ssw_test fi
- Create Training Data
create_vqvae_training_data_local.py