Skip to content
Ian Pendleton edited this page May 30, 2020 · 2 revisions

Discussion of CLI option --etl True or --etl 1

Last updated on May 26th, 2020. Report V1.12

Summary of ETL Process

The intention of this toggle is to provide easy export for tables of data. The tables included are those requested by Gary Cattabriga with the intention of importing into the ESCALATE V3 database.

The two options which can be used to generate these files are as follows:

  • --debug True : exports all dataframe intermediates prefixed with 'REPORT_' csvfiles with default names

  • --etl True : removes the header and footer from the 'REPORT_' csvfiles exported with the --debug option. If debug is false, this won't do anything.

These messages can be accessed by running python runme.py --help

Table Overview

The following is a brief description of the files that are generated when the options above are toggle on (true). The summary includes the name of the primary tables generated along with a brief description of what is included and the intention of the table.

REPORT_INCHI_FEATURES_TABLE.csv

  • index : inchikeys
  • columns : features

Purpose : This set of data is used to assign features to experiments (merged on 'name' / 'runUID') via the the REPORT_UID_LOADTABLE.csv

Other Notes : features can include things like 'XXPASSTHROUGHXX', smiles, smiles_standardize, and chemical types. These are supplementary data to the features. All but the XXPASSTHROUGHXX are removed prior to curation of the final dataframe. Passthrough columns are data which are not features, but are renamed downstream:

  • ex. XXPASSTHROUGHXX_I_count --> _raw_inorganic_0_i_count

REPORT_UID_LOADTABLE.csv

  • index : name (runUIDs), ex. 2019-04-19T14_28_09.238360+00_00_LBL_A1
  • columns : unique inchikeys, ex. inchikey_0

Purpose : Provide a list of unique inchikeys of compounds used in the course of a single experiment (no duplicates). This is excellent for merging the feature table.

Other Notes : None

REPORT_LBL_INVENTORY.csv (or REPORT_<lab>_INVENTORY.csv)

  • index : None (by default), though InChI Key (ID) is likely the best unique index target
  • columns : all of the chemical details

Purpose : Stores the data associated with each inchikey from a particular lab. The report code used the UID scheme to infer the laboratory origin of each experiment folder.

Other Notes : <lab> is defined in the devconfig.py file

REPORT_MEASURES.csv

  • index : name (runUIDs), ex. 2019-04-19T14_28_09.238360+00_00_LBL_A1
  • columns : all of the measured properties from all experiments in the dataset.

REPORT_MMOL_CALCS.csv

  • index : multiindexed on [name, inchikey] This means that each unique inchikey in a run/experiment adds an additional row to the dataframe
  • columns : mmol of the runUID (name) + inchikey in the experiment. Units are 'mmol' type is float

REPORT_MMOL_INCHICOLS_CALCS.csv

  • index : name (runUIDs), ex. 2019-04-19T14_28_09.238360+00_00_LBL_A1
  • columns : mmol of a particular inchikey. column headers are only the inchikey value, but the units in the table are mmol (i.e., column header == QHJPGANWSLEMTI-UHFFFAOYSA-N)

Other Notes : variation on the REPORT_MMOL_CALCS.csv. Instead of having the inchikeys as part of the index, the mmol of each inchikey is in the column

REPORT_MOLARITY_BYTYPE_BYINSTANCE_CALCS.csv

  • index : name (runUIDs), ex. 2019-04-19T14_28_09.238360+00_00_LBL_A1
  • columns : _raw_<type>_<instance #>_molarity ex. _raw_acid_0_molarity or _raw_inorganic_0_molarity units are molarity and each instance is a unique chemical in the experiment.

Other Notes : The instance number is not maintained between experiments. An inchikey could first appear in 'instance 0' and later be assigned to 'instance `'

REPORT_MOLARITY_BYTYPE_CALCS.csv

  • index : name (runUIDs), ex. 2019-04-19T14_28_09.238360+00_00_LBL_A1
  • columns : _raw_<type>_molarity sums all of the same type into a single column

Other Notes : even if there are multiple unique inchikeys of the same type the values will be summed

REPORT_MOLARITY_CALCS.csv

  • index : multiindexed on [name, inchikey] This means that each unique inchikey in a run/experiment adds an additional row to the dataframe
  • columns : mmol of the runUID (name) + inchikey in the experiment. Units are 'molarity' type is float

Other Notes : as of report v1.12 the molarity calculation uses the SolUD model derived volumes. This can be found in the publication (link pending)

REPORT_MOLARITY_INCHICOLS_CALCS.csv

  • index : name (runUIDs), ex. 2019-04-19T14_28_09.238360+00_00_LBL_A1
  • columns : mmol of a particular inchikey. column headers are only the inchikey value, but the units in the table are molarity (i.e., column header == QHJPGANWSLEMTI-UHFFFAOYSA-N)

Other Notes : as of report v1.12 the molarity calculation uses the SolUD model derived volumes. This can be found in the publication (link pending

The following serves as a guide to columns generated with the calc_command.py tool. Each calculation will export a <shortname>.csv file; the more calculations entered in the calc_command.py file during run execution, the more of these types of files will be generated. A few examples are described in detail to provide an idea of the potential variation in the csv files.

More details about how to use calc_command.py can be found in the report section of the user manual. These descriptions cover the structure of the tables generated from each function.

Simplest Arithmetic Calc: _RAW_MOLFRACTION_ACID.csv

Calc generated from a user defined function: _CALC_ACID_SOLVENT_AVERAGE_HANSEN_DELTAP.csv

Calc generated from a user defined function calling on columns from a regex: _FEAT_HALIDE_ELECTRONEGATIVITY.csv