-
Notifications
You must be signed in to change notification settings - Fork 1
ETL
Discussion of CLI option --etl True
or --etl 1
Last updated on May 26th, 2020. Report V1.12
The intention of this toggle is to provide easy export for tables of data. The tables included are those requested by Gary Cattabriga with the intention of importing into the ESCALATE V3 database.
The two options which can be used to generate these files are as follows:
-
--debug True
: exports all dataframe intermediates prefixed with 'REPORT_' csvfiles with default names -
--etl True
: removes the header and footer from the 'REPORT_' csvfiles exported with the --debug option. If debug is false, this won't do anything.
These messages can be accessed by running python runme.py --help
The following is a brief description of the files that are generated when the options above are toggle on (true). The summary includes the name of the primary tables generated along with a brief description of what is included and the intention of the table.
- index : inchikeys
- columns : features
Purpose : This set of data is used to assign features to experiments (merged on 'name' / 'runUID') via the the REPORT_UID_LOADTABLE.csv
Other Notes : features can include things like 'XXPASSTHROUGHXX', smiles, smiles_standardize, and chemical types. These are supplementary data to the features. All but the XXPASSTHROUGHXX are removed prior to curation of the final dataframe. Passthrough columns are data which are not features, but are renamed downstream:
- ex.
XXPASSTHROUGHXX_I_count
-->_raw_inorganic_0_i_count
- index : name (runUIDs), ex. 2019-04-19T14_28_09.238360+00_00_LBL_A1
- columns : unique inchikeys, ex. inchikey_0
Purpose : Provide a list of unique inchikeys of compounds used in the course of a single experiment (no duplicates). This is excellent for merging the feature table.
Other Notes : None
- index : None (by default), though InChI Key (ID) is likely the best unique index target
- columns : all of the chemical details
Purpose : Stores the data associated with each inchikey from a particular lab. The report code used the UID scheme to infer the laboratory origin of each experiment folder.
Other Notes : <lab>
is defined in the devconfig.py file
- index : name (runUIDs), ex. 2019-04-19T14_28_09.238360+00_00_LBL_A1
- columns : all of the measured properties from all experiments in the dataset.
- index : multiindexed on [name, inchikey] This means that each unique inchikey in a run/experiment adds an additional row to the dataframe
- columns : mmol of the runUID (name) + inchikey in the experiment. Units are 'mmol' type is float
- index : name (runUIDs), ex. 2019-04-19T14_28_09.238360+00_00_LBL_A1
- columns : mmol of a particular inchikey. column headers are only the inchikey value, but the units in the table are mmol (i.e., column header == QHJPGANWSLEMTI-UHFFFAOYSA-N)
Other Notes : variation on the REPORT_MMOL_CALCS.csv. Instead of having the inchikeys as part of the index, the mmol of each inchikey is in the column
- index : name (runUIDs), ex. 2019-04-19T14_28_09.238360+00_00_LBL_A1
- columns :
_raw_<type>_<instance #>_molarity
ex. _raw_acid_0_molarity or _raw_inorganic_0_molarity units are molarity and each instance is a unique chemical in the experiment.
Other Notes : The instance number is not maintained between experiments. An inchikey could first appear in 'instance 0' and later be assigned to 'instance `'
- index : name (runUIDs), ex. 2019-04-19T14_28_09.238360+00_00_LBL_A1
- columns :
_raw_<type>_molarity
sums all of the same type into a single column
Other Notes : even if there are multiple unique inchikeys of the same type the values will be summed
- index : multiindexed on [name, inchikey] This means that each unique inchikey in a run/experiment adds an additional row to the dataframe
- columns : mmol of the runUID (name) + inchikey in the experiment. Units are 'molarity' type is float
Other Notes : as of report v1.12 the molarity calculation uses the SolUD model derived volumes. This can be found in the publication (link pending)
- index : name (runUIDs), ex. 2019-04-19T14_28_09.238360+00_00_LBL_A1
- columns : mmol of a particular inchikey. column headers are only the inchikey value, but the units in the table are molarity (i.e., column header == QHJPGANWSLEMTI-UHFFFAOYSA-N)
Other Notes : as of report v1.12 the molarity calculation uses the SolUD model derived volumes. This can be found in the publication (link pending
The following serves as a guide to columns generated with the calc_command.py tool. Each calculation will export a <shortname>.csv
file; the more calculations entered in the calc_command.py file during run execution, the more of these types of files will be generated. A few examples are described in detail to provide an idea of the potential variation in the csv files.
More details about how to use calc_command.py can be found in the report section of the user manual. These descriptions cover the structure of the tables generated from each function.
Simplest Arithmetic Calc: _RAW_MOLFRACTION_ACID.csv
Calc generated from a user defined function: _CALC_ACID_SOLVENT_AVERAGE_HANSEN_DELTAP.csv
Calc generated from a user defined function calling on columns from a regex: _FEAT_HALIDE_ELECTRONEGATIVITY.csv
ESCALATE REPORT
Try a search on this wiki!