2024-2
We've released ChatCell, a new paradigm that leverages natural language to make single-cell analysis more accessible and intuitive. Please visit our homepage and Github page for more information.2024-1
Our paper Domain-Agnostic Molecular Generation with Chemical Feedback is accepted by ICLR 2024.2024-1
Our paper Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models is accepted by ICLR 2024.2023-10
We open-source MolGen-7b, which now supports de novo molecule generation!2023-6
We open-source KnowLM, a knowledgeable LLM framework with pre-training and instruction fine-tuning code (supports multi-machine multi-GPU setup).2023-6
We release Mol-Instructions, a large-scale biomolecule instruction dataset for large language models.2023-5
We propose Knowledge graph-enhanced molecular contrAstive learning with fuNctional prOmpt (KANO) onNature Machine Intelligence
, exploiting fundamental domain knowledge in both pre-training and fine-tuning.2023-4
We provide a NLP for science paper-list at https://github.com/zjunlp/NLP4Science_Papers.2023-3
We release our pre-trained and fine-tuned model on 🤗 Hugging Face at MolGen-large and MolGen-large-opt.2023-2
We provide a demo on 🤗 Hugging Face at Space.
To run the codes, You can configure dependencies by restoring our environment:
conda env create -f MolGen/environment.yml -n $Your_env_name$
and then:
conda activate $Your_env_name$
You can download the pre-trained and fine-tuned models via Huggingface: MolGen-large and MolGen-large-opt.
Moreover, the dataset used for downstream tasks can be found here.
The expected structure of files is:
moldata
├── checkpoint
│ ├── molgen.pkl # pre-trained model
│ ├── syn_qed_model.pkl # fine-tuned model for QED optimization on synthetic data
│ ├── syn_plogp_model.pkl # fine-tuned model for p-logP optimization on synthetic data
│ ├── np_qed_model.pkl # fine-tuned model for QED optimization on natural product data
│ ├── np_plogp_model.pkl # fine-tuned model for p-logP optimization on natural product data
├── finetune
│ ├── np_test.csv # nature product test data
│ ├── np_train.csv # nature product train data
│ ├── plogp_test.csv # synthetic test data for plogp optimization
│ ├── qed_test.csv # synthetic test data for plogp optimization
│ └── zinc250k.csv # synthetic train data
├── generate # generate molecules
├── output # molecule candidates
└── vocab_list
└── zinc.npy # SELFIES alphabet
-
- First, preprocess the finetuning dataset by generating candidate molecules using our pre-trained model. The preprocessed data will be stored in the folder
output
.
cd MolGen bash preprocess.sh
- Then utilize the self-feedback paradigm. The fine-tuned model will be stored in the folder
checkpoint
.
bash finetune.sh
- First, preprocess the finetuning dataset by generating candidate molecules using our pre-trained model. The preprocessed data will be stored in the folder
-
To generate molecules, run this script. Please specify the
checkpoint_path
to determine whether to use the pre-trained model or the fine-tuned model.cd MolGen bash generate.sh
We conduct experiments on well-known benchmarks to confirm MolGen's optimization capabilities, encompassing penalized logP, QED, and molecular docking properties. For detailed experimental settings and analysis, please refer to our paper.
![image](https://private-user-images.githubusercontent.com/61076726/276499812-c32bf106-d43c-4d1d-af48-8caed03305bc.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgyNzMzMjcsIm5iZiI6MTcxODI3MzAyNywicGF0aCI6Ii82MTA3NjcyNi8yNzY0OTk4MTItYzMyYmYxMDYtZDQzYy00ZDFkLWFmNDgtOGNhZWQwMzMwNWJjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjEzVDEwMDM0N1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWZkYzc1ZmY4ODk1ZDU2NWRlYjZjZWZjNDgzOTA3MWJlNTQ3NTk5NzJhNWVjMmQ0NzVkZDY4MzBkMjMwYmEyNzImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.BTFGo_TfRkCFSsXK8cQjcDJM4ew8950mUY_MqibyabI)
![image](https://private-user-images.githubusercontent.com/61076726/276500230-51533e08-e465-44c8-9e78-858775b59b4f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgyNzMzMjcsIm5iZiI6MTcxODI3MzAyNywicGF0aCI6Ii82MTA3NjcyNi8yNzY1MDAyMzAtNTE1MzNlMDgtZTQ2NS00NGM4LTllNzgtODU4Nzc1YjU5YjRmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjEzVDEwMDM0N1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTBlYjc5N2E1MDhiNmNhNzY4ZDViMzQ3ZDcxMTM5MDZiYzc5MWJhMzBiN2Q2YmQ1YzU1OTM1ZDFjMDNjNWJmZDQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.RzthyPV99E-hSnrrkzzZ_XwaAzjKM-SFkXnVnPjelKk)
![image](https://private-user-images.githubusercontent.com/61076726/276500453-6f17a630-88e4-46f6-9cb1-9c3637a264fc.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgyNzMzMjcsIm5iZiI6MTcxODI3MzAyNywicGF0aCI6Ii82MTA3NjcyNi8yNzY1MDA0NTMtNmYxN2E2MzAtODhlNC00NmY2LTljYjEtOWMzNjM3YTI2NGZjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjEzVDEwMDM0N1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTFiNTc0NjQ4YzFlNDg5YWFjZTUxODE3ZWU2ZDVlMjQ1NTg1Yzk4ZDdhODFkNzkwYTYzOWY5ODNhYzYzM2JhMTEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.76dwDzs9-PBdqaIB-cpaI5lWAoKEwl7kWYZAIAskvKA)
![image](https://private-user-images.githubusercontent.com/61076726/276500485-4b934314-5f23-4046-a771-60cdfe9b572d.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgyNzMzMjcsIm5iZiI6MTcxODI3MzAyNywicGF0aCI6Ii82MTA3NjcyNi8yNzY1MDA0ODUtNGI5MzQzMTQtNWYyMy00MDQ2LWE3NzEtNjBjZGZlOWI1NzJkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjEzVDEwMDM0N1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTc5MDI0MzM4YmRmYjQ4Y2E4ZTVjZTdmMzc1YTExYTJjNjRhY2RhZjUzZTQ4MmUxMWU4OGM2MTc0M2RlYTE2YWMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.c3Z2_NVtcfMMOb6FK7b0YQVa1CXsJFY2IwtaXwQOVl4)
![image](https://private-user-images.githubusercontent.com/61076726/276500518-bca038cc-637a-41fd-9b53-48ac67c4f182.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgyNzMzMjcsIm5iZiI6MTcxODI3MzAyNywicGF0aCI6Ii82MTA3NjcyNi8yNzY1MDA1MTgtYmNhMDM4Y2MtNjM3YS00MWZkLTliNTMtNDhhYzY3YzRmMTgyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjEzVDEwMDM0N1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQ1N2NkY2IzZTY4MmM2Y2I1ZWFlNzhkYTQwZmYzZmNmZDg3NWZhNTRmMThlZjY5MTRmYWI5MzdlMWNjM2UyYzcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.ikMprMoaRFzPmoHOFx5NpaDReqCZE9zFOFJXk989DmY)
If you use or extend our work, please cite the paper as follows:
@inproceedings{fang2023domain,
author = {Yin Fang and
Ningyu Zhang and
Zhuo Chen and
Xiaohui Fan and
Huajun Chen},
title = {Domain-Agnostic Molecular Generation with Chemical feedback},
booktitle = {{ICLR}},
publisher = {OpenReview.net},
year = {2024},
url = {https://openreview.net/pdf?id=9rPyHyjfwP}
}