Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access to .npy datasets #1

Open
preritt opened this issue May 14, 2022 · 10 comments
Open

Access to .npy datasets #1

preritt opened this issue May 14, 2022 · 10 comments

Comments

@preritt
Copy link

preritt commented May 14, 2022

Hi,
Thank you for releasing the package!
I wanted to check the procedure to access the offline datasets. It seems these are not part of the repo. I am not sure if I am missing something.

For example, I get the following error when using
task = design_bench.make('ChEMBL-ResNet-v0')
FileNotFoundError: [Errno 2] No such file or directory:
/chembl-GI50-CHEMBL1964047/chembl-y-2.npy'

Thank you!

@brandontrabucco
Copy link
Member

brandontrabucco commented May 16, 2022

Hello preritt,

Thanks for your interest in the benchmark. If you would like to download the entire benchmark at once to access the raw .npy files, they are available at the following gcp bucket:

https://github.com/rail-berkeley/design-bench/blob/new-api/design_bench/disk_resource.py#L7

This post may be of interest if you are not familiar with gsutil:

https://stackoverflow.com/questions/58581873/how-to-download-an-entire-bucket-in-gcp

Generally speaking, the dataset files are downloaded as needed from gcp when design_bench.make is called. Could you share the full script producing the error, and the full stack trace?

Warm regards,
Brandon

@preritt
Copy link
Author

preritt commented May 21, 2022

Hi Brandon,

Sorry for the delayed response.
Thanks for the information!
Here is the code I used

import design_bench

# task = design_bench.make('TGFP-Transformer-v0')
# task = design_bench.make('TFBind8-Exact-v0')
task = design_bench.make('ChEMBL-ResNet-v0')

This is the error

`Traceback (most recent call last):

  File "/BerkleyDesignBenchVer01/testBerkleyV1.py", line 12, in <module>
    task = design_bench.make('ChEMBL-ResNet-v0')

  File "/BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench/registration.py", line 328, in make
    oracle_kwargs=oracle_kwargs, **kwargs)

  File "/BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench/registration.py", line 157, in make
    oracle_kwargs=oracle_kwargs, **kwargs)

  File "BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench/registration.py", line 111, in make
    oracle_kwargs=oracle_kwargs_final, **kwargs)

  File "BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench/task.py", line 245, in __init__
    dataset = import_name(dataset)(**kwargs)

  File "BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench/datasets/discrete/chembl_dataset.py", line 310, in __init__
    soft_interpolation=soft_interpolation, **kwargs)

  File "/BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench/datasets/discrete_dataset.py", line 279, in __init__
    super(DiscreteDataset, self).__init__(*args, **kwargs)

  File "/BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench/datasets/dataset_builder.py", line 470, in __init__
    for i, y in enumerate(self.iterate_samples(return_x=False)):

  File "/BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench/datasets/dataset_builder.py", line 865, in iterate_samples
    return_x=return_x, return_y=return_y):

  File "/BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench/datasets/dataset_builder.py", line 762, in iterate_batches
    y_shard_data = self.get_shard_y(shard_id)

  File "BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench/datasets/dataset_builder.py", line 566, in get_shard_y
    return np.load(self.y_shards[shard_id].disk_target)

  File "BerkleyDesignBenchV1/lib/python3.7/site-packages/numpy/lib/npyio.py", line 416, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))

FileNotFoundError: [Errno 2] No such file or directory: 'BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench_data/chembl-GI50-CHEMBL1964047/chembl-y-2.npy'`

I'll try the GCP method and get back in case of error.

Thank you so much for your response!

@brandontrabucco
Copy link
Member

brandontrabucco commented May 21, 2022

Could you try calling design_bench.make on a ChEMBL task with the following format:

https://github.com/rail-berkeley/design-bench/blob/new-api/design_bench/__init__.py#L809

For example, design_bench.make("ChEMBL_MCHC_CHEMBL3885882_MorganFingerprint-RandomForest-v0")

@preritt
Copy link
Author

preritt commented May 22, 2022

I tried the following:
task =design_bench.make("ChEMBL_MCHC_CHEMBL3885882_MorganFingerprint-RandomForest-v0")
However, I got the following error now.

Traceback (most recent call last):

  File "condaEnvs/BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench/registration.py", line 201, in spec
    return self.task_specs[task_name]

KeyError: 'ChEMBL_MCHC_CHEMBL3885882_MorganFingerprint-RandomForest-v0'


During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "perspectaTestsVer2/perspectaV1/myCodesV9Della/BerkleyDesignBenchVer01/testBerkleyV1.py", line 13, in <module>
    task =design_bench.make("ChEMBL_MCHC_CHEMBL3885882_MorganFingerprint-RandomForest-v0")

  File "condaEnvs/BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench/registration.py", line 328, in make
    oracle_kwargs=oracle_kwargs, **kwargs)

  File "condaEnvs/BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench/registration.py", line 155, in make
    return self.spec(task_name).make(

  File "condaEnvs/BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench/registration.py", line 232, in spec
    UNKNOWN_MESSAGE.format(task_name))

ValueError: No registered task with name: ChEMBL_MCHC_CHEMBL3885882_MorganFingerprint-RandomForest-v0

@brandontrabucco
Copy link
Member

Could you check which version number of the benchmark you have installed?

@preritt
Copy link
Author

preritt commented May 22, 2022

It is 2.0.12

design-bench 2.0.12 pypi_0 pypi

@brandontrabucco
Copy link
Member

The latest version of the benchmark is 2.0.20, could you try that version?

@preritt
Copy link
Author

preritt commented May 22, 2022

I have the correct version now:

design-bench 2.0.20 pypi_0 pypi

Not sure why, but now I get an import error when using:
import design_bench

runcell(0, '/BerkleyDesignBenchVer01/testBerkleyV1.py')
Traceback (most recent call last):

  File "/BerkleyDesignBenchVer01/testBerkleyV1.py", line 8, in <module>
    import design_bench

  File "condaEnvs/BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench/__init__.py", line 766, in <module>
    feature_extractor=MorganFingerprintFeatures(dtype=np.int32),

  File "condaEnvs/BerkleyDesignBenchV1/lib/python3.7/site-packages/design_bench/oracles/feature_extractors/morgan_fingerprint_features.py", line 74, in __init__
    os.path.join(DATA_DIR, 'smiles_vocab.txt'))

  File "condaEnvs/BerkleyDesignBenchV1/lib/python3.7/site-packages/deepchem/feat/smiles_tokenizer.py", line 89, in __init__
    self.max_len_single_sentence = self.max_len - 2

AttributeError: 'SmilesTokenizer' object has no attribute 'max_len'

@brandontrabucco
Copy link
Member

Ah, this can happen if an incompatible version of deepchem is installed. Can you try installing the version of deepchem listed here: https://github.com/brandontrabucco/design-baselines/blob/master/requirements.txt#L29

I'm not sure if that's the only package that may need an update, so perhaps check the whole requirements file.

@preritt
Copy link
Author

preritt commented May 22, 2022

Thanks a lot! I did a pip install on the requirements and it resolved the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants