This repository records code used in the study What Information is Necessary and Sufficient to Predict Materials Properties using Machine Learning? This repository is based on four materials property prediction machine learning (ML) framework repositories, MEGNet, CGCNN, CrabNet, and Roost.
Please cite the following work if you want to use CompStruct
@misc{https://doi.org/10.48550/arxiv.2206.04968,
doi = {10.48550/ARXIV.2206.04968},
url = {https://arxiv.org/abs/2206.04968},
author = {Tian, Siyu Isaac Parker and Walsh, Aron and Ren, Zekun and Li, Qianxiao and Buonassisi, Tonio},
keywords = {Materials Science (cond-mat.mtrl-sci), Computational Physics (physics.comp-ph), FOS: Physical sciences, FOS: Physical sciences},
title = {What Information is Necessary and Sufficient to Predict Materials Properties using Machine Learning?},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
We recommend users to build separate virtual environments for every ML framework to avoid package conflicts. We give our installation instructions based on the ones in respective framework repositories. We first introduce ways to create virtual environments and then give respective intructions for installing each framework.
Ways to create virtual enviroments with Anaconda1
-
Download and install Anaconda
-
Navigate to the CompStruct repository directory (from above).
-
Open Anaconda prompt in this directory.
-
If there is a conda environment
.yml
file, run the following command from Anaconda prompt to automatically create an environment and install packages needed from the*.yml
file:conda env create --file *.yml
(environment name defined in the*.yml
file)
Else, run the following command to create a new conda environment with specific python version first to install packages later.
conda create -n your_env_name python=3.x
where specific environement name and python version should be inputted, e.g.,
conda env create megnet python=3.6
. -
Run one of the following commands from Anaconda prompt depending on your operating system to activate the environment:
conda activate crabnet
source activate crabnet
This environment upon activation can be used for installing required packages and then running code.
For more information about creating, managing, and working with Conda environments, please consult the relevant help page.
Before proceeding further to installation of any following packages, clone or download the CompStruct repository, and navigate to the CompStruct repository directory in your Anaconda prompt.
- Build a python 3.7 virtual environment
- Once the environment is activated, run
pip install -r requirements_megnet.txt
Avoid using pip install megnet
as instructed by MEGNet installation because we have modified the MEGNet source code and included the modified code in this repository. Our modified code is based on MEGNet
version 1.2.8
- Build a python 3.6 virtual environment
- Once the environment is activated, run
pip install -r requirements_cgcnn.txt
Included CGCNN code is modified based on the CGCNN repository on Dec 3, 2021.
First Option
-
Run
conda env create --file conda-env_crabnet.yml
conda env create --file conda-env-cpuonly_crabnet.yml
if you only have a CPU and no GPU in your system
This creates a virtual environment named
crabnet
and also installs required packages. -
Activate the built environment by running one of the following commands from Anaconda prompt depending on your operating system
conda activate crabnet
source activate crabnet
Second Option
- Build a python 3.8 virtual environment
- Once the environment is activated, open
conda-env_crabnet.yml
andpip install
all the packages listed there.
IMPORTANT - if you want to reproduce the publication Figures 1 and 2 in CrabNet:2
The PyTorch-builtin function for outting the multi-headed attention operation defaults to averaging the attention matrix across all heads. Thus, in order to obtain the per-head attention information, we have to edit a bit of PyTorch's source code so that the individual attention matrices are returned.
To properly export the attention heads from the PyTorch nn.MultiheadAttention
implementation within the transformer encoder layer, you will need to manually modify some of the source code of the PyTorch library.
This applies to PyTorch v1.6.0, v1.7.0, and v1.7.1 (potentially to other untested versions as well).
For this, open the file:
C:\Users\{USERNAME}\Anaconda3\envs\{ENVIRONMENT}\Lib\site-packages\torch\nn\functional.py
(where USERNAME
is your Windows user name and ENVIRONMENT
is your conda environment name (if you followed the steps above, then it should be crabnet
))
At the end of the function defition of multi_head_attention_forward
(line numbers may differ slightly):
L4011 def multi_head_attention_forward(
# ...
# ... [some lines omitted]
# ...
L4291 if need_weights:
L4292 # average attention weights over heads
L4293 attn_output_weights = attn_output_weights.view(bsz, num_heads, tgt_len, src_len)
L4294 return attn_output, attn_output_weights.sum(dim=1) / num_heads
L4295 else:
L4296 return attn_output, None
Change the specific line
return attn_output, attn_output_weights.sum(dim=1) / num_heads
to:
return attn_output, attn_output_weights
This prevents the returning of the attention values as an average value over all heads, and instead returns each head's attention matrix individually. For more information see:
- pytorch/pytorch#34537
- pytorch/pytorch#32590
- https://discuss.pytorch.org/t/getting-nn-multiheadattention-attention-weights-for-each-head/72195/
Included CrabNet code is modified based on the CrabNet repository on Dec 3, 2021.
First Option
-
Run
conda env create --file conda-env_roost.yml
This creates a virtual environment named
roost
and also installs required packages. -
Activate the built environment by running one of the following commands from Anaconda prompt depending on your operating system
conda activate roost
source activate roost
If you are not using
cudatoolkit
version 11.1 or do not have access to a GPU this setup will not work for you. If so please check the following pages PyTorch, PyTorch-Scatter for how to install the core packages and then install the remaining requirements as detailed in requirements_roost.txt.
Second Option
-
Build a python 3.8 virtual environment
-
Once the environment is activated, run
pip install -r requirements_roost.txt --find-links https://data.pyg.org/whl/torch-1.9.0+cu111.html
The extra
--find-links
is to cater for the installation of PyTorch Scatter. This link is forpytorch
version 1.9.0 andcudatoolkit
version 11.1 as default in the Roost installation. If you do not have access to a GPU or are not usingcudatoolkit
version 11.1, you can refer to PyTorch Scatter to change thecu111
tocpu
(CPU only), orcu102
(cudatookit
version 10.2) etc. in the--find-links
link.
Included Roost code is modified based on the Roost repository on Dec 3, 2021.
To get results as shown in Figure 3, S1, S2 and S3 for various datasets, run main results. Before running each main result, activate the respective environment and navigate to CompStruct repository directory. Obtain datasets first before running any results.
- Download compressed data file
data.tar.gz
from https://figshare.com/articles/dataset/data_tar_gz/20161235. - Move
data.tar.gz
to CompStruct repository directory. - Run
tar -xvf data.tar.gz
in Anaconda prompt after navigating into CompStruct repository directory.
Datasets will automatically appear in folder data after uncompressing using tar
.
Frameworks | Run Options |
---|---|
MEGNet | Run python main_megnet.py using default parameters |
Run python main_megnet.py --help to get parameters for toggling |
|
CGCNN | Run python main_cgcnn.py using default parameters |
Run python main_cgcnn.py --help to get parameters for toggling |
|
Roost | Run python main_roost.py --train --evaluate using default parameters |
Run python main_roost.py --help to get parameters for toggling |
|
CrabNet | Run python main_crabnet.py using default parameters |
Run python main_crabnet.py --help to get parameters for toggling |
Run results are stored in respective framework folders in results, trained models in respective folders in models, and predicted vs. actual properties in respective folders in plots. Folder models and plots will only appear after the main_*.py
scripts are run. Currently folder results hosts results from the runs in the study, and the results will be replaced if the main scripts are run.
Run .py
files for respective figures in publication figures.
- Navigate to publication figures folder
- Run
python figure_3.py
to generate Figure 3, and other scripts to generate other figures.
Scripts | Description |
---|---|
main_megnet.py |
Main files to run for training various ML frameworks |
main_cgcnn.py |
|
main_roost.py |
|
main_crabnet.py |
|
requirements_megnet.txt |
Installation file for MEGNet |
requirements_cgcnn.txt |
Installation file for CGCNN |
requirements_roost.txt |
Installation files for Roost |
conda-env_roost.yml |
|
conda-env_crabnet.yml |
Installation files for CrabNet |
conda-env-cpuonly_crabnet.yml |
Folders | Description |
---|---|
megnet | modified MEGNet code |
cgcnn | modified CGCNN code |
roost | modified Roost code |
crabnet | modified CrabNet code |
embeddings | hosts one-hot embeddings used by various frameworks |
data | hosts saved datasets used by various frameworks. All data were queried from Materials Project on Nov 26, 2021. Only appears after downloading and uncompressing data.tar.gz from figshare according to Obtain datasets. |
utils | hosts auxillary functions |
results | hosts saved results from the runs in the study. Will be replaced once main_*.py are run. Inside each framework folder, the .pickle files record respective data segregation , property , and score , where score is (MAE, MAE reference3). Inside the prediction folder in each framework folder, the .pickle files record y_train , y_train_hat , y_val , y_val_hat , y_test , and y_test_hat . |
publication figure | hosts scripts for generating publications figures |
The code was primarily written by Siyu Isaac Parker Tian, under the supervision of Tonio Buonassisi and Qianxiao Li.
Footnotes
-
explanations of this section borrow from CrabNet Installation ↩
-
The reference MAE is calculated by taking every predicted property in the test set to be the mean of actual property values in the test set, following the convention used in the Matbench paper. ↩