Skip to content

Build Data Models

javild edited this page Apr 13, 2016 · 5 revisions

The process may be carried out by using the Cellbase CLI:

cellbase/build/bin$ ./cellbase.sh build
The following options are required: -d, --data -i, --input 

Usage:   cellbase.sh build [options]

Options:
      -a, --assembly       STRING     Name of the assembly, if empty the first assembly in configuration.json will be used 
          --common         STRING     Directory where common multi-species data will be downloaded, this is mainly protein and expression 
                                      data [<OUTPUT>/common] 
      -C, --config         STRING     CellBase configuration.json file. Have a look at 
                                      cellbase/cellbase-core/src/main/resources/configuration.json for an example 
    * -d, --data           STRING     Comma separated list of data to build: genome, gene, disgenet, hpo, variation, cadd, regulation, 
                                      protein, conservation, drug, clinvar, cosmic and GWAS CAatalog. 'all' build everything. 
      -h, --help                      Display this help and exit [false]
    * -i, --input          STRING     Input directory with the downloaded data sources to be loaded 
      -L, --log-level      STRING     Set the logging level, accepted values are: debug, info, warn, error and fatal [info]
      -o, --output         STRING     Output directory where the JSON data models are saved [/tmp]
      -s, --species        STRING     Name of the species to be built, valid format include 'Homo sapiens' or 'hsapiens' [Homo sapiens]
      -v, --verbose        BOOLEAN    [Deprecated] Set the level of the logging [false]

The build process will integrate data from the different sources into the corresponding data models. Use the Cellbase CLI for building the data models. For example, build all human (GRCh37) data models reading the files from the /tmp/data/cellbase/v4/homo_sapiens_grch37/ directory created in section Download Sources and save the result at /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/:

cellbase/build/bin$ mkdir /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb
cellbase/build/bin$ ./cellbase.sh build -a GRCh37 --common /tmp/data/cellbase/v4/common/ -d all -i /tmp/data/cellbase/v4/homo_sapiens_grch37/ -o /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/ -s hsapiens

Note: building process for the whole CellBase dataset may require up to 16GB of RAM and may take up to ~24h, depending on the hardware.

After completion of the build process, your output directory shall look like:

cellbase/build/bin$ ls /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/
cadd.json.gz
clinvar.json.gz
conservation_10.json.gz
conservation_11.json.gz
conservation_12.json.gz
conservation_13.json.gz
conservation_14.json.gz
conservation_15.json.gz
conservation_16.json.gz
conservation_17.json.gz
conservation_18.json.gz
conservation_19.json.gz
conservation_1.json.gz
conservation_20.json.gz
conservation_21.json.gz
conservation_22.json.gz
conservation_2.json.gz
conservation_3.json.gz
conservation_4.json.gz
conservation_5.json.gz
conservation_6.json.gz
conservation_7.json.gz
conservation_8.json.gz
conservation_9.json.gz
conservation_M.json.gz
conservation_X.json.gz
conservation_Y.json.gz
cosmic.json.gz
gene.json.gz
genome_info.json
genome_sequence.json.gz
protein.json.gz
protein_protein_interaction.json.gz
prot_func_pred_chr_10.json.gz
prot_func_pred_chr_11.json.gz
prot_func_pred_chr_12.json.gz
prot_func_pred_chr_13.json.gz
prot_func_pred_chr_14.json.gz
prot_func_pred_chr_15.json.gz
prot_func_pred_chr_16.json.gz
prot_func_pred_chr_17.json.gz
prot_func_pred_chr_18.json.gz
prot_func_pred_chr_19.json.gz
prot_func_pred_chr_1.json.gz
prot_func_pred_chr_20.json.gz
prot_func_pred_chr_21.json.gz
prot_func_pred_chr_22.json.gz
prot_func_pred_chr_2.json.gz
prot_func_pred_chr_3.json.gz
prot_func_pred_chr_4.json.gz
prot_func_pred_chr_5.json.gz
prot_func_pred_chr_6.json.gz
prot_func_pred_chr_7.json.gz
prot_func_pred_chr_8.json.gz
prot_func_pred_chr_9.json.gz
prot_func_pred_chr_MT.json.gz
prot_func_pred_chr_X.json.gz
prot_func_pred_chr_Y.json.gz
regulatory_region.json.gz
variation_chr10.json.gz
variation_chr11.json.gz
variation_chr12.json.gz
variation_chr13.json.gz
variation_chr14.json.gz
variation_chr15.json.gz
variation_chr16.json.gz
variation_chr17.json.gz
variation_chr18.json.gz
variation_chr19.json.gz
variation_chr1.json.gz
variation_chr20.json.gz
variation_chr21.json.gz
variation_chr22.json.gz
variation_chr2.json.gz
variation_chr3.json.gz
variation_chr4.json.gz
variation_chr5.json.gz
variation_chr6.json.gz
variation_chr7.json.gz
variation_chr8.json.gz
variation_chr9.json.gz
variation_chrMT.json.gz
variation_chrX.json.gz
variation_chrY.json.gz

If build was successful, you can proceed to loading the data models into the database: Load Data Models.