Skip to content
Xudong Zou edited this page Apr 5, 2022 · 36 revisions

Welcome to the 3aQTL_pipe wiki!

Before starting the pipeline

Download test dataset

NOTE: The RNA-seq data used in current pipeline were aligned to human reference genome version 19 (hg19), users reuse this pipeline in their own dataset should download correct version of reference genome consistent with their RNA-seq data.

Prepare the environment

1. Download and install the following softwares or scripts

wget -c https://repo.anaconda.com/archive/Anaconda3-5.3.0-Linux-x86_64.sh
bash Anaconda3-5.3.0-Linux-x86_64.sh
conda install -c r r
conda install -c conda-forge r-dplyr
conda install -c bioconda Bioconductor-impute
git clone https://github.com/3UTR/DaPars2.git
conda install -c bioconda bedtools
conda install -c bioconda vcftools
conda install -c bioconda plink=1.90
conda install -c bioconda samtools
conda install -c bioconda r-peer
conda install -c bioconda matrixeqtl
# start R
R
# use the following R function to install the latest susieR
> remote::install_github("stephenslab/susieR")
git clone https://github.com/Xu-Dong/3aQTL_pipe.git

Quick Start

Required data list before starting this pipe

Data description Example Data
RNA-seq alignment files (bam files) The 89 bam files in GEUVADIS RNA-seq project (CEU subpopulation)
Genotype data in VCF format The VCF files in GEUVADIS RNA-seq project
A text file contains all samples sample_list.txt (in the 3aQTL_pipe repository)
A reference genome annotation (BED) hg19_refseq_whole_gene.bed
ID mapping between Refseq ID and gene symbol hg19_refseq_transcript2geneSymbol.txt
A text file contains VCF file(s) vcf_list.txt
A tab-delimited text file contains known covariates, e.g. gender known_covariates.txt
  1. Download 3'aQTL_pipe and prepare required data files
# -- 1. Use git to get a clone of 3'aQTL_pipe
~$ git clone https://github.com/Xu-Dong/3aQTL_pipe.git
~$ cd 3aQTL_pipe

# -- 2. Move/copy the data in above table to 3aQTL_pipe 
If users want to test the pipeline on GEUVADIS dataset ("CEU" sub-population), you can use the "hg19_refseq_whole_gene.bed", "hg19_refseq_transcript2geneSymbol.txt" in the github repository and prepare a "sample_list.txt" and a "vcf_list.txt" for the dataset according to the example files we provided in the github repository.

Note: the bam files in "sample_list.txt" and VCF file(s) in "vcfList.txt" should including path. 
For example:
sample1_id  /path/to/bam/sample1_id.bam # change "/path/to/bam/" to the location where you put bam files

/path/to/vcf/testdat.vcf # change "/path/to/vcf/" to the location where you put vcf files
  1. Start the pipe
# 2.1 -- change directory to 3aQTL_pipe and check the required data files existing
cd 3aQTL_pipe

# 2.2 -- prepare input data for Dapars2
bash ./src/bam2wig.sh sample_list.txt 8 
# the second integer (8) denotes the number of threads will be used for parallel processing

bash ./src/extract_sequencing_depth.sh sample_list.txt ./wig 8 wigFile_and_readDepth.txt
# four parameters denote: a text file contains all samples and bam files, the location of bedgraph files, number of threads, output file

python ./src/DaPars_Extract_Anno.py -b hg19_refseq_whole_gene.bed -s hg19_refseq_transcript2geneSymbol.txt -o refseq_3utr_annotation.bed

cat refseq_3utr_annotation.bed | cut -f 1|sort|uniq |grep -v “MT” > chrList.txt 

python ./src/generate_configure_for_Dapars2.py  \
--annotation_3utr refseq_3utr_annotation.bed \
--wigFile_depth wigFile_and_readDepth.txt \
--coverage_threshold 15 \
--threads 8

# 2.3 -- run Dapars2 analysis on each chromosome
python ./src/Dapars2_Multi_Sample.py Dapars2_running_configure.txt chr1 # repeat this command with other chromosomes or do it parallelly

# 2.3 -- alternatives: run Dapars2 analysis on all chromosomes 
python ./src/DaPars2_Multi_Sample_Multi_Chr.py Dapars2_running_configure.txt chrList.txt # process all chromosomes in "chrList.txt"

# 2.4 -- merge Dapars2 results on individual chromosomes into one
Rscript ./src/merge_dapars2_res_by_chr.R Dapars2_out sample_list.txt chrList.txt # "Dapars2_out" is the prefix of Dapars2 output setting by "generate_configure_for_dapars2.py" in step 2.2


# 2.5 -- prepare genotype matrix for Matrix-eQTL
bash ./src/prepare_genotype_matrix_for_matrixEQTL.sh vcf_list.txt 0.05 sample_list.txt hg19 
# "0.05" denotes the MAF cutoff for filtering common variants

# 2.6 -- prepare phenotype matrix for Matrix-eQTL
Rscript ./src/3UTR_impute_peer.R known_covariates.txt 5 
# If no known covariates provided, use "NA" for the first parameter, the second parameter denotes the top N PCA components as covariates

# 2.7 -- harmonize genotype matrix, phenotype matrix, and covariates matrix
Rscript ./src/prepare_input_files_matrixeqtl.R

# 2.8 -- prepare 3'UTR location file and SNP location file for Matrix-eQTL
python ./src/extract_3UTR_location.py --dapars_res Dapars2_res.all_chromosomes.txt --output ./Matrix_eQTL/3UTR_location.txt
python ./src/extract_SNP_location.py --genotype_bed ./Matrix_eQTL/genotype_matrix.bed --output ./Matrix_eQTL/snp_location.txt

# 2.9 -- run Matrix-eQTL to mapping associations between common genetic variants and APA 
Rscript ./src/run_Matrix_eQTL.R 1e6 1e-2 1e-5 
# the three parameters denote: window size; p value threshold for cis-3'aQTLs; p value threshold for trans-3'aQTLs

# boxplot of a significant association in IRF5
Rscript ./src/QTL_plot.R chr7_128640188_A_G_hg19 NM_001347928.2|IRF5|chr7|+

# 3.0 -- prepare input for susieR
bash ./src/prepare_finemapping_input.sh -e ./src -w 1000000 -q 0.05 
# 

# 3.1 -- run susieR for fine-mapping
bash ./src/Fine_mapping_by_susieR.sh -w 1000000 -p 0.1 -L 10 -V 0.2 -t 8 

# 3.2 -- merge fine mapping results of individual genes
Rscript ./src/merge_susieR_res.R ./FineMapping aGenes.loc_1000000.txt  

# two output files will be generated: “susieR_res.all_genes.txt” and “susieR_res.stat.txt”

Docker image for 3aQTL_pipe

We have built a docker image in name of "3aqtl_pipe" for the whole pipeline includes all source codes of scripts involved in this pipeline. The docker image "3aqtl_pipe" has been pushed into Docker hub, users can pull down and run a new container from this image and use the pipeline through the created container.

Quick guide

We assumed users have docker installed in the server. If not, please contact the administrator of the server to install docker at first.

  1. Pull down the docker image "3aqtl_pipe"
docker pull lilab123/3aqtl_pipe:miniv4 # “miniv4" denotes the tag info
docker image ls # list all images including "lilab123/3aqtl_pipe:miniv4
  1. Run a docker container from the pulled docker image Before creating a container, prepare all bam files and vcf files in a directory (e.g. "bam_vcf", could be two sub directories within "bam_vcf").
docker run -it --name="3aqtl_container" -v /local/path/to/bam_vcf:/home/bam_vcf 3aqtl_pipe:miniv4 /bin/bash
# -v option creates volume for directory in local host and it will be map to the directory in container "/home/bam_vcf" (will be created if not exits)
# the initial location of the container was set to /home and  directory
ls # list the contents in current location,you will see a "3aQTL_pipe" directory and a "bam_vcf" directory

"3aQTL_pipe" contains all source codes of 3'aQTL pipeline

  1. Create a "sample_list.txt" and "vcf_list.txt" in "3aQTL_pipe" directory The two files contains samples and corresponding location of bam files, and location of vcf files, therefore they should be created after run a container.

Here is an example way to do this:

#we assume the bam files and vcf files are located in the directory "bam_vcf" in the container initial location, if not you need to change 
# the location in commands below

cd 3aQTL_pipe
for i in `ls ../bam_vcf/*.bam`;do sample=${i##*/};sample=${sample%%.*};echo -e "$sample\t$i";done > sample_list.txt
ls ../bam_vcf/*.vcf* > vcf_list.txt

After this, users can jump to "Quick Start" to go through the whole pipeline step by step.

Examples of output

1. Example of APA quantification across samples

An example of format of Dapars2 output can be found in the wiki of Dapars2

2. Example of 3'aQTL mapping by Matrix-eQTL

Matrix-eQTL will report the association statistics of tested SNP-Gene pair in tab-delimited text file as shown below:

image

3. Example of fine-mapping results

image

Authors

Xudong Zou, Ruofan Ding, Wenyan Chen, Gao Wang, Shumin Cheng, Wei Li, Lei Li

Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China

Citation

  • Code and Execution:

[Ref TBD]

  • The first 3'aQTL atlas of human tissues:

An atlas of alternative polyadenylation quantitative trait loci contributing to complex trait and disease heritability

Lei Li, Kai-Lieh Huang, Yipeng Gao, Ya Cui, Gao Wang, Nathan D. Elrod, Yumei Li, Yiling Elaine Chen, Ping Ji, Fanglue Peng, William K. Russell, Eric J. Wagner & Wei Li. Nature Genetics,53,994-1005 (2021). DOI:https://doi.org/10.1038/s41588-021-00864-5

https://www.nature.com/articles/s41588-021-00864-5

Contact

For any issues, please create a GitHub Issue.