This is a tutorial for (1) automatically functionally annotating the variants of each participating whole-genome/whole-exome sequencing (WGS/WES) study and integrating the functional annotations with the genotype data using FAVORannotator, (2) generating study-specific variant summary statistics of each participating WGS/WES study using MetaSTAARlite Worker, and (3) performing association meta-analysis of WGS/WES studies using MetaSTAARlite. The software prerequisites, dependencies and installation can be found in the MetaSTAARlite package.
Pre-step of association meta-analysis using MetaSTAARlite (same as STAARpipeline)
R/Bioconductor package SeqArray provides functions to convert the genotype data (in VCF/BCF/PLINK BED/SNPRelate format) to SeqArray GDS format. For more details on usage, please see the R/Bioconductor package SeqArray [manual]. A wrapper for the seqVCF2GDS
/seqBCF2GDS
function in the SeqArray package can be found here (Credit: Michael R. Brown and Jennifer A. Brody).
R package gds2bgen provides functions to convert the genotype data (in BGEN format) to SeqArray GDS format. For more details on usage, please see the R package gds2bgen. An example for the seqBGEN2GDS
function in the gds2bgen package can be found here (Credit: Xiuwen Zheng).
Note 1: As a file integrity check, it is expected that variant in the GDS file can be uniquely identified based on its CHR-POS-REF-ALT combination. That is, there shouldn't be two variants in the GDS file with identical CHR-POS-REF-ALT records. It is also expected that the physical positions of variants in the GDS file (of each chromosome) should be sorted in ascending order.
Note 2: After the (study-specific) GDS file is generated, there is supposed to be a channel in the GDS file (default is annotation/filter
) where all variants passing the quality control (QC) should be labeled as "PASS"
. If there is no such channel for a given post-QC GDS file (where all variants in the GDS file are pass variants), one can create a new channel in the GDS file by setting the value of all variants as "PASS"
. An example script can be found here. Then, in all scripts of MetaSTAARlite, QC_label <- "annotation/filter"
should be updated to QC_label <- "annotation/info/QC_label"
.
FAVORannotator (CSV version 1.0.0) depends on the xsv software and the FAVOR database in CSV format. Please install the xsv software and download the FAVOR essential database CSV files from FAVOR website (under the "FAVORannotator" tab's top panel, 31.2 GB for chr1 CSV) or Harvard Dataverse before using FAVORannotator (CSV version 1.0.0).
The following steps are for the widely used operating system (Ubuntu) on a virtual machine.
- Install Rust and Cargo:
$ curl https://sh.rustup.rs -sSf | sh
- Source the environment:
$ source $HOME/.cargo/env
- Install xsv using Cargo:
$ cargo install xsv
Script: Varinfo_gds.R
Input: GDS files of each chromosome and the FAVOR database information FAVORdatabase_chrsplit.csv. For more details, please see the R script.
Output: CSV files of the variants list. For each chromosome, the number of CSV files is listed in FAVORdatabase_chrsplit.csv.
Note: The physical positions of variants in the GDS file (of each chromosome) should be sorted in ascending order.
Script: Annotate.R
Input: CSV files of the variants list to be annotated, the FAVOR database information FAVORdatabase_chrsplit.csv,
the FAVOR database, and the directory xsv software. For more details, please see the R script.
Anno_chrXX.csv
: a CSV file containing annotated variants list of chromosome XX.Anno_chrXX_STAARpipeline.csv
: a CSV file containing the variants list with annotations required for STAARpipeline of chromosome XX. The annotations in this file is a subset ofAnno_chrXX.csv
.
Script: gds2agds.R
Input: GDS files and the CSV files of annotated variants list (Anno_chrXX.csv
or Anno_chrXX_STAARpipeline.csv
). For more details, please see the R script.
Note: FAVORannotator also supports the database in SQL format. Please see the FAVORannotator tutorial for detailed usage of FAVORannotator (SQL version).
R package FastSparseGRM provides functions and a pipeline to efficiently calculate genetic principal components (PCs) and the ancestry-adjusted sparse genetic relatedness matrix (GRM). It accounts for population heterogeneity using genetic PCs which are automatically calculated as part of the pipeline. The genetic PCs can be used as fixed effect covariates to account for the population stratification and the sparse GRM can be used to model the random effects to account for the sample relatedness in a mixed effects phenotype-genotype association testing model implemented in MetaSTAARlite. For more details on usage, please see the R package FastSparseGRM.
Script: Association_Analysis_PreStep.r
agds_dir.Rdata
: a vector containing directory of GDS/aGDS files of all chromosomes.Annotation_name_catalog.Rdata
: a data frame containing the annotation name and the corresponding channel name in the aGDS file. Alternatively, one can skip this part in the R script by providingAnnotation_name_catalog.csv
with the same information. An example ofAnnotation_name_catalog.csv
can be found here.
STAARpipeline_Null_Model.r
fits the STAAR null model using the STAARpipeline package.STAARpipeline_Null_Model_GENESIS.r
fits the null model using the GENESIS package and convert it to the STAAR null model using the STAARpipeline package.
Input: Phenotype data and sparse genetic relatedness matrix. For more details, please see the R scripts.
Step 2.1: Generate variant summary statistics for individual (single-variant) analysis using MetaSTAARlite Worker
Generate and store variant summary statistics (score statistics) using MetaSTAARlite Worker.
Note: The number of output files is the summation of the column "individual_analysis_num" for the object in jobs_num.csv
(please use this file for all studies), which is 294.
Step 2.2: Generate variant summary statistics for gene-centric coding analysis using MetaSTAARlite Worker
Script: MetaSTAARlite_worker_Gene_Centric_Coding.r and MetaSTAARlite_worker_Gene_Centric_Coding_Long_Masks.r
Generate and store variant summary statistics (score statistics, functional annotations, and sparse weighted LD matrices) for coding rare variants using the MetaSTAARlite package. The gene-centric coding analysis provides five functional categories to aggregate coding rare variants of each protein-coding gene: (1) putative loss of function (stop gain, stop loss and splice) RVs, (2) missense RVs, (3) disruptive missense RVs, (4) putative loss of function and disruptive missense RVs, and (5) synonymous RVs.
MetaSTAARlite_worker_Gene_Centric_Coding.r
generates and stores variant summary statistics for all protein-coding genes across the genome. There are 379 jobs using this script.MetaSTAARlite_worker_Gene_Centric_Coding_Long_Masks.r
generates and stores variant summary statistics for some specific long masks, and might require larger memory compared toMetaSTAARlite_worker_Gene_Centric_Coding.r
. There are 2 jobs using this script.
Note: For rare variant meta-analysis (e.g. combined MAF < 1%), one can set cov_maf_cutoff = 0.05
(by default) when generating sparse weighted LD matrices for each study.
Step 2.3: Generate variant summary statistics for gene-centric noncoding analysis using MetaSTAARlite Worker
Script: MetaSTAARlite_worker_Gene_Centric_Noncoding.r, MetaSTAARlite_worker_Gene_Centric_Noncoding_Long_Masks.r, MetaSTAARlite_worker_Gene_Centric_ncRNA.r and MetaSTAARlite_worker_Gene_Centric_ncRNA_Long_Masks.r
Generate and store variant summary statistics (score statistics, functional annotations, and sparse weighted LD matrices) for noncoding rare variants using the MetaSTAARlite package. The gene-centric noncoding meta-analysis provides eight functional categories of regulatory regions to aggregate noncoding rare variants: (1) promoter RVs overlaid with CAGE sites, (2) promoter RVs overlaid with DHS sites, (3) enhancer RVs overlaid with CAGE sites, (4) enhancer RVs overlaid with DHS sites, (5) untranslated region (UTR) RVs, (6) upstream region RVs, (7) downstream region RVs, and (8) noncoding RNA (ncRNA) RVs.
MetaSTAARlite_worker_Gene_Centric_Noncoding.r
generates and stores variant summary statistics for all protein-coding genes across the genome. There are 379 jobs using this script.MetaSTAARlite_worker_Gene_Centric_Noncoding_Long_Masks.r
generates and stores variant summary statistics for some specific long masks, and might require larger memory compared toMetaSTAARlite_worker_Gene_Centric_Noncoding.r
. There are 8 jobs using this script.MetaSTAARlite_worker_Gene_Centric_ncRNA.r
generates and stores variant summary statistics for ncRNA genes across the genome. There are 222 jobs using this script.MetaSTAARlite_worker_Gene_Centric_ncRNA_Long_Masks.r
generates and stores variant summary statistics for some specific long masks, and might require larger memory compared toMetaSTAARlite_worker_Gene_Centric_ncRNA
. There is 1 job using this script.
Output: 387 * 2 = 774 Rdata files with the user-defined names for protein-coding genes and 223 * 2 = 446 Rdata files with the user-defined names for ncRNA genes.
Note: For rare variant meta-analysis (e.g. combined MAF < 1%), one can set cov_maf_cutoff = 0.05
(by default) when generating sparse weighted LD matrices for each study.
Generate and store variant summary statistics (score statistics, functional annotations, and sparse weighted LD matrices) for user-defined custom masks of rare variants using the MetaSTAARlite package.
Input: aGDS files, the STAAR null model, and the custom mask definition file. For more details, please see the R script.
An example file for a list of custom masks (5-column "CHR-POS-REF-ALT-MaskName" format) in the LDLR locus is given in custom_mask_LDLR.csv
.
Note: For rare variant meta-analysis (e.g. combined MAF < 1%), one can set cov_maf_cutoff = 0.05
(by default) when generating sparse weighted LD matrices for each study. Users may want to perform parallel computation by splitting the analysis of custom masks into multiple jobs.
Perform single-variant meta-analysis for common and low-frequency variants across the genome using the MetaSTAARlite package.
Input: Variant summary statistics files from Step 2.1 for each participating study. For more details, please see the R script.
Note: The number of output files is the summation of the column "individual_analysis_num" for the object in jobs_num.csv
, which is 294.
Perform gene-centric meta-analysis for coding rare variants using the MetaSTAARlite package. The gene-centric coding meta-analysis provides five functional categories to aggregate coding rare variants of each protein-coding gene: (1) putative loss of function (stop gain, stop loss and splice) RVs, (2) missense RVs, (3) disruptive missense RVs, (4) putative loss of function and disruptive missense RVs, and (5) synonymous RVs.
MetaSTAARlite_Gene_Centric_Coding.r
performs gene-centric coding meta-analysis for all protein-coding genes across the genome. There are 379 jobs using this script.MetaSTAARlite_Gene_Centric_Coding_Long_Masks.r
performs gene-centric coding meta-analysis for some specific long masks, and might require larger memory compared toMetaSTAARlite_Gene_Centric_Coding.R
. There are 2 jobs using this script.
Input: Variant summary statistics files from Step 2.2 for each participating study. For more details, please see the R scripts.
Script: MetaSTAARlite_Gene_Centric_Noncoding.r, MetaSTAARlite_Gene_Centric_Noncoding_Long_Masks.r, MetaSTAARlite_Gene_Centric_ncRNA.r and MetaSTAARlite_Gene_Centric_ncRNA_Long_Masks.r
Perform gene-centric meta-analysis for noncoding rare variants using the MetaSTAARlite package. The gene-centric noncoding meta-analysis provides eight functional categories of regulatory regions to aggregate noncoding rare variants: (1) promoter RVs overlaid with CAGE sites, (2) promoter RVs overlaid with DHS sites, (3) enhancer RVs overlaid with CAGE sites, (4) enhancer RVs overlaid with DHS sites, (5) untranslated region (UTR) RVs, (6) upstream region RVs, (7) downstream region RVs, and (8) noncoding RNA (ncRNA) RVs.
MetaSTAARlite_Gene_Centric_Coding.r
performs gene-centric noncoding meta-analysis for all protein-coding genes across the genome. There are 379 jobs using this script.MetaSTAARlite_Gene_Centric_Coding_Long_Masks.r
performs gene-centric noncoding meta-analysis for some specific long masks, and might require larger memory compared toMetaSTAARlite_Gene_Centric_Coding.r
. There are 8 jobs using this script.MetaSTAARlite_Gene_Centric_ncRNA.r
performs gene-centric noncoding meta-analysis for ncRNA genes across the genome. There are 222 jobs using this script.MetaSTAARlite_Gene_Centric_ncRNA_Long_Masks.r
performs gene-centric noncoding meta-analysis for some specific long masks, and might require larger memory compared toMetaSTAARlite_Gene_Centric_ncRNA
. There is 1 job using this script.
Input: Variant summary statistics files from Step 2.3 for each participating study. For more details, please see the R scripts.
Output: 387 Rdata files with the user-defined names for protein-coding genes and 223 Rdata files with the user-defined names for ncRNA genes.
Script: MetaSTAARlite_Custom_Mask.r
Perform meta-analysis for user-defined custom masks of rare variants using the MetaSTAARlite package.
Input: Variant summary statistics files from Step 2.C for each participating study. For more details, please see the R script.
Note: Users may want to perform parallel computation by splitting the analysis of custom masks into multiple jobs.
Step 5.0 (Optional): Select independent variants from a known variants list to be used in conditional meta-analysis
An example file for a list of known total cholesterol associated variants (4-column "CHR-POS-REF-ALT" format) in the LDLR locus is given in known_loci_LDLR.csv
.
Note: It is typically assumed that variants in known_loci
for conditional analysis are present in each participating study of the meta-analysis, which implies that the MAFs for these variants would not be extremely rare.
Summarize single-variant meta-analysis results.
Input: Individual analysis results generated by MetaSTAARlite. For more details, please see the R script.
Step 5.1.2: Generate variant summary statistics for individual (single-variant) conditional analysis using MetaSTAARlite Worker
Generate and store variant summary statistics (score statistics and conditional LD matrices) using MetaSTAARlite Worker.
Input: aGDS files, the STAAR null model and a list of known variants. For more details, please see the R script.
Perform conditional meta-analysis of unconditionally significant variants by adjusting a list of known variants.
Input: Variant summary statistics files from Step 5.1.2 for each participating study and (significant) individual analysis results from Step 5.1.1. For more details, please see the R script.
Summarize gene-centric coding meta-analysis results.
Input: Gene-centric coding analysis results generated by MetaSTAARlite. For more details, please see the R script.
Step 5.2.2: Generate variant summary statistics for gene-centric coding conditional analysis using MetaSTAARlite Worker
Generate and store variant summary statistics (score statistics, functional annotations, sparse weighted LD matrices, and conditional LD matrices) for coding rare variants using MetaSTAARlite Worker.
Input: aGDS files, the STAAR null model, and a list of known variants. For more details, please see the R script.
Perform conditional meta-analysis of unconditionally significant coding masks by adjusting a list of known variants.
Input: Variant summary statistics files from Step 5.2.2 for each participating study. For more details, please see the R script.
Summarize gene-centric noncoding meta-analysis results.
Input: Gene-centric noncoding analysis results generated by MetaSTAARlite. For more details, please see the R script.
Step 5.3.2: Generate variant summary statistics for gene-centric noncoding conditional analysis using MetaSTAARlite Worker
Script: MetaSTAARlite_worker_Gene_Centric_Noncoding_cond.r and MetaSTAARlite_worker_Gene_Centric_ncRNA_cond.r
Generate and store variant summary statistics (score statistics, functional annotations, sparse weighted LD matrices, and conditional LD matrices) for noncoding rare variants using MetaSTAARlite Worker.
Input: aGDS files, the STAAR null model, and a list of known variants. For more details, please see the R scripts.
Perform conditional meta-analysis of unconditionally significant noncoding masks by adjusting a list of known variants.
Input: Variant summary statistics files from Step 5.3.2 for each participating study. For more details, please see the R scripts.
Note: Users may develop custom scripts to summarize the analysis results by following Step 5.2.1/5.3.1, as the number of output files from Step C may vary.
Step 5.C.2: Generate variant summary statistics for custom mask conditional analysis using MetaSTAARlite Worker
Generate and store variant summary statistics (score statistics, functional annotations, sparse weighted LD matrices, and conditional LD matrices) for user-defined custom masks of rare variants using MetaSTAARlite Worker.
Input: aGDS files, the STAAR null model, the custom mask definition file, and a list of known variants. For more details, please see the R script.
Script: MetaSTAARlite_Custom_Mask_cond.r
Perform conditional meta-analysis of unconditionally significant user-defined custom masks by adjusting a list of known variants.