Authors: Ian Pendleton, Michael Tynes, Aaron Dharna
Science Contact: jschrier .at. fordham.edu, ian .at. pendletonian.com
Technical Debugging: vshekar .at. haverford.edu, gcattabrig .at. haverford.edu,
Retrieves experiment files from supported locations and processes to an intermediary JSON file on users local machine. The generated JSON files are used to generate a 2d CSV of the data in a format
compatible with most machine learning software (e.g. SciKit learn). Additional configuration is required to map the existing
data structures to headers which resemble the users desired configuration. These mappings are typically trivial for computer
scientists, but may be more challenging for non-domain experts or individuals unfamiliar with manipulating dataframes. The
dataset is augmented with chemical calculations such as concentrations, temperatures derived from models of plate temperature,
and other empirical observations. In the final steps the dataset is supplemented with chemical features and calcs derived from ChemAxon, RDKit, and local datasets saved to this repository. Additional information on how to control the generation of _feat_
and _calc_
columns can be found in the user documentation here.
The original ESCALATE publication can be found here.
User documents, relating to a complete cycle of escalate, can be found here.
This build process has been tested on MacOS High Sierra (10.13.5), MacOS Catalina (10.15.3), Ubuntu Bionic Beaver (18.04), and Windows 10 (version 1909 OS Build 18363.418)
Windows Users: Please note that while windows has been tested it is not the recommended Operating System. Everything is more challenging, the installation is messier, logging is limited, and the file system interaction is more brittle.
-
Create new python 3.8 environment in conda and activate:
conda create -n escalate_report python=3.8
conda activate escalate_report
-
Install the latest version of the pip package manager
conda install pip
-
Then install requirments (still in escalate_report)
pip install -r requirements.txt
-
Then install conda dependent pieces:
conda install -c conda-forge rdkit
-
Execute:
conda update conda
conda env create -f environment.yml
The
conda env create
command will automatically create an escalate_report environment
Pip install the following python packages prior to use:
- pandas, json, numpy, gspread, pydrive, cerberus, google-api-python-client==1.7.4, xlrd, xlwt, tqdm, pytest,
conda install -c conda-forge rdkit
Please report any failures of the above message to the repo admins
-
Download the securekey files and move them into the root folder (
./
, aka. current working directory, aka.ESCALATE_report-master/
if downloaded from git). Do not distribute these keys! (Contact a dev for access) -
Ensure that the files 'client_secrets.json' and 'creds.json' are both present in the root folder (
./
, aka. current working directory, aka.ESCALATE_report-master/
if downloaded from git). The correct folder for these keys is the one which contains the runme.py script. -
Stop here if you don't want to use ChemAxon for feature generation. Rdkit and the available ESCALATE features will still be generated.
- Note: ESCALATE will throw warnings if chemaxon features are implemented in
type_command.csv
, these can be ignored if that is the desired functionality
- Note: ESCALATE will throw warnings if chemaxon features are implemented in
-
Download and install ChemAxon JChemSuite and obtain a ChemAxon License Free for academic use
-
Follow the installation instruction found on ChemAxons website Be sure to not the location of the JChemSuite installation (i.e.
~/opt/chemaxon/jchemsuite/bin
on linux or/Applications/JChemSuite/bin/
on MacOSX)- There are also docs on license install using a graphical user interface (GUI) here: https://docs.chemaxon.com/display/docs/Licenses.html
-
You will need to specify the location of your chemaxon installation locations in
./expworkup/devconfig.py
at the bottom of the file.
Currently supported google_drive_target_name
(user defined folder names):
- MIT Data: MIT_PVLab
- HC and LBL Data: 4-Data-WF3_Iodide, 4-Data-WF3_Alloying, 4-Data-Bromides, 4-Data-Iodides
- Development: dev
A more detailed instruction manual including videos overviewing how to operated the code can be found in the ESCALATE user manual
Definitions
<my_local_folder>
: is the name of the folder where files should be created. This will be automatically created by ESCALATE_report if it does not exist. The specified name will also be used as the final exported csv (i.e. if <my_local_folder> is perovskitedata, perovskitedata.csv will be generate)
<google_drive_target_name>
: one or more of the available datasets. see examples below
-
You can always get runtime information by executing:
python runme.py --help
-
To execute a normal run with chemaxon, rdkit, and ESCALATE calcs (see installation instructions above for more details)
python runme.py <my_local_folder> -d <google_drive_target_name>
-
To improve the clarity of column headers specify them in the
dataset_rename.json
file. All columns can be viewed in the initial run by executing:python runme.py <my_local_folder> -d <google_drive_target_name> --raw 1
-
Columns that do not conform to the
_{category}_
(e.g.,_feat_
,_rxn_
) will be omitted unless--raw 1
is enabled!- A list of the columns not conforming to the naming scheme will be exported to './<my_local_folder>/logging/UNNAMED_REPORT_COLUMNS.txt'.
- The USER can specify an appropriate name in dataset_rename.json
- To see all columns with naming directly from datasource use:
--raw 1
- Conflicting namespaces will be purged!
-
Significant flexibility is enabled for
_feat_
(via, type_command.csv) and_calc_
(via, ./utils/calc_command.py) specification. For examples, discussion, and limitations of these specifications please see the USER docs._calc_
generation can be skipped by using the--disablecalcs True
flag on the CLI- To speed up calc and feature development the first portion of the code can be skipped by:
- Running the code with
--offline 1
- After the first iteration completes running future instances with
--offline 2
- Running the code with
-
A file named
<my_local_folder>.csv
will contain the 2d CSV of the dataset using the configured headers from the data or the mapping developed for the lab. Thedata/
folder will contain the generated JSONs. -
Intermediate dataframes can be exported in bulk by specifying:
python runme.py <my_local_folder> -d <google_drive_target_name> --debug 1
To add additional target directories please see the how-to guide here. If you would like to add these to the existing datasets, please issue a git merge request after you add the necessary information.
More detailed instructions can be found in the ESCALATE user manual.
If you are using Windows10 please follow these instructions on what you will need to setup your environment. Consider using Ubuntu or wsl instead!
-
Ensure that versioned data repo and escalation are installed
-
Create an issue on versioned repo with new
crank-number
-
python runmy.py <my_local_folder> -d <google_drive_target_name> -v <crank-number>
-
This will generate files for upload to versioned data repo with the names:
- <
crank-number
>.<dataset-name
>.csv - <
crank-number
>.<dataset-name
>.index.csv
- <
-
Move these files to the
/pathto/versioned-dataset/data/perovskite/<my_local_folder>
-
Follow Readme.md instructions for versioned=datasets
-
python runmy.py <my_local_folder> -d <google_drive_target_name> -v <crank-number> -s <state-set_file_name.csv>
-
Follow 5-6 above
-
python runme.py 4-Data-Iodides -d 4-Data-Iodides
-
python runme.py 4-Data-Iodides -d 4-Data-Iodides 4-Data-WF3_Iodide 4-Data-WF3_Alloying
-
python runme.py dev -d dev --debug 1 --raw 1 --offline 1
-
python runme.py perovskitedata -d 4-Data-Iodides --verdata 0111 --state example.csv
- FAQs
- Trouble Shooting Help: please send log file, any terminal output and a brief explanation to ipendlet .at. haverford.edu for help.
- Tutorials