Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data preprocessing with coarse graining does not seem to work #31

Open
stratisMarkou opened this issue Jun 26, 2023 · 1 comment
Open
Assignees

Comments

@stratisMarkou
Copy link

Following the readme instructions I have downloaded the crossdocked, unzipped it and am trying to run the preprocessing script on it with and without the flag --ca_only.

Running

python process_crossdock.py my_data_dir --no_H

runs without errors, but running

python process_crossdock.py .data --no_H --ca_only

fails, giving the error

KeyError "'R' not in amino acid dict (.data/crossdocked_pocket10/WNK1_HUMAN_202_483_0/5tf9_A_rec_5wdy_a6s_lig_tt_min_0_pocket10.pdb, .data/crossdocked_pocket10/WNK1_HUMAN_202_483_0/5tf9_A_rec_5wdy_a6s_lig_tt_min_0.sdf)" WNK1_HUMAN_202_483_0/5tf9_A_rec_5wdy_a6s_lig_tt_min_0_pocket10.pdb WNK1_HUMAN_202_483_0/5tf9_A_rec_5wdy_a6s_lig_tt_min_0.sdf
#failed: 10: 100%|█████████████| 10/10 [00:00<00:00, 128.31it/s]
Traceback (most recent call last):
  File "/home/stratis/repos/DiffSBDD/process_crossdock.py", line 364, in <module>
    lig_coords = np.concatenate(lig_coords, axis=0)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: need at least one array to concatenate

It looks like in the second case, the script is failing to find certain entries in the amino acid dict and is skipping all protein-ligand complexes, resulting in an empty list for lig_coords which can't be concatenated. Looking at the dataset_params dictionary, it seems that there's two sets of preprocessing parameter settings crossdock_full and crossdock. Changing line 24 in the preprocessing script from dataset_info = dataset_params['crossdock_full'] to dataset_info = dataset_params['crossdock'] and running the preprocessing with --ca_only works without any errors, but I'm not sure the resulting data is correctly preprocessed. Is there something wrong with the preprocessing script or am I doing something wrong on my side?

@arneschneuing
Copy link
Owner

Hi Stratis,
I think the process_crossdock.py file is indeed outdated and should be updated. As far as I can tell, your solution should be fine as a temporary fix because dataset_params['crossdock'] contains the correct amino acid types required for the coarse-grained model (maybe @yuanqidu can confirm). We will try to upload a correct version as soon as possible.
Sorry for the inconvenience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants