-
Notifications
You must be signed in to change notification settings - Fork 53
Build Data Models
The process may be carried out by using the Cellbase CLI:
cellbase/build/bin$ ./cellbase.sh build
The following options are required: -d, --data -i, --input
Usage: cellbase.sh build [options]
Options:
-a, --assembly STRING Name of the assembly, if empty the first assembly in configuration.json will be used
--common STRING Directory where common multi-species data will be downloaded, this is mainly protein and expression
data [<OUTPUT>/common]
-C, --config STRING CellBase configuration.json file. Have a look at
cellbase/cellbase-core/src/main/resources/configuration.json for an example
* -d, --data STRING Comma separated list of data to build: genome, gene, disgenet, hpo, variation, cadd, regulation,
protein, conservation, drug, clinvar, cosmic and GWAS CAatalog. 'all' build everything.
-h, --help Display this help and exit [false]
* -i, --input STRING Input directory with the downloaded data sources to be loaded
-L, --log-level STRING Set the logging level, accepted values are: debug, info, warn, error and fatal [info]
-o, --output STRING Output directory where the JSON data models are saved [/tmp]
-s, --species STRING Name of the species to be built, valid format include 'Homo sapiens' or 'hsapiens' [Homo sapiens]
-v, --verbose BOOLEAN [Deprecated] Set the level of the logging [false]
The build
process will integrate data from the different sources into the corresponding data models. Use the Cellbase CLI for building the data models. For example, build all human (GRCh37) data models reading the files from the /tmp/data/cellbase/v4/homo_sapiens_grch37/
directory created in section Download Sources and save the result at /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/
:
cellbase/build/bin$ mkdir /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb
cellbase/build/bin$ ./cellbase.sh build -a GRCh37 --common /tmp/data/cellbase/v4/common/ -d all -i /tmp/data/cellbase/v4/homo_sapiens_grch37/ -o /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/ -s hsapiens
Note: building process for the whole CellBase dataset may require up to 16GB of RAM and may take up to ~24h, depending on the hardware.
After completion of the build process, your output directory shall look like:
cellbase/build/bin$ ls /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/
cadd.json.gz
clinvar.json.gz
conservation_10.json.gz
conservation_11.json.gz
conservation_12.json.gz
conservation_13.json.gz
conservation_14.json.gz
conservation_15.json.gz
conservation_16.json.gz
conservation_17.json.gz
conservation_18.json.gz
conservation_19.json.gz
conservation_1.json.gz
conservation_20.json.gz
conservation_21.json.gz
conservation_22.json.gz
conservation_2.json.gz
conservation_3.json.gz
conservation_4.json.gz
conservation_5.json.gz
conservation_6.json.gz
conservation_7.json.gz
conservation_8.json.gz
conservation_9.json.gz
conservation_M.json.gz
conservation_X.json.gz
conservation_Y.json.gz
cosmic.json.gz
gene.json.gz
genome_info.json
genome_sequence.json.gz
protein.json.gz
protein_protein_interaction.json.gz
prot_func_pred_chr_10.json.gz
prot_func_pred_chr_11.json.gz
prot_func_pred_chr_12.json.gz
prot_func_pred_chr_13.json.gz
prot_func_pred_chr_14.json.gz
prot_func_pred_chr_15.json.gz
prot_func_pred_chr_16.json.gz
prot_func_pred_chr_17.json.gz
prot_func_pred_chr_18.json.gz
prot_func_pred_chr_19.json.gz
prot_func_pred_chr_1.json.gz
prot_func_pred_chr_20.json.gz
prot_func_pred_chr_21.json.gz
prot_func_pred_chr_22.json.gz
prot_func_pred_chr_2.json.gz
prot_func_pred_chr_3.json.gz
prot_func_pred_chr_4.json.gz
prot_func_pred_chr_5.json.gz
prot_func_pred_chr_6.json.gz
prot_func_pred_chr_7.json.gz
prot_func_pred_chr_8.json.gz
prot_func_pred_chr_9.json.gz
prot_func_pred_chr_MT.json.gz
prot_func_pred_chr_X.json.gz
prot_func_pred_chr_Y.json.gz
regulatory_region.json.gz
variation_chr10.json.gz
variation_chr11.json.gz
variation_chr12.json.gz
variation_chr13.json.gz
variation_chr14.json.gz
variation_chr15.json.gz
variation_chr16.json.gz
variation_chr17.json.gz
variation_chr18.json.gz
variation_chr19.json.gz
variation_chr1.json.gz
variation_chr20.json.gz
variation_chr21.json.gz
variation_chr22.json.gz
variation_chr2.json.gz
variation_chr3.json.gz
variation_chr4.json.gz
variation_chr5.json.gz
variation_chr6.json.gz
variation_chr7.json.gz
variation_chr8.json.gz
variation_chr9.json.gz
variation_chrMT.json.gz
variation_chrX.json.gz
variation_chrY.json.gz
If build
was successful, you can proceed to loading the data models into the database: Load Data Models.