NGS workflow
This is a series of Next generation sequencing (illumina short reads sequencing) workflow created by Snakemake
Table of Contents
- Setup
- Usage
- Bulk RNA Sequencing workflow
- Whole Genome Sequencing workflow
- Whole Genome Bisulfite Sequencing workflow
- Chromatin Immunoprecipitation (ChIP) sequencing workflow
- ATAC-seq workflow
- HiC workflow
- RNA Immunoprecipitation (RIP) sequencing workflow
- Ribo sequencing workflow
- Single cell multi-omics workflow
All workflow all have its requirements softwares which were shown in environment.yaml. You could install via:
conda install --file environment.yaml
For dry use
snakemake -s snakemake.smk -np
Run workflow
snakemake -s snakemake.smk -c 4
-c: Default use 4 cores
This is a RNA-seq workflow (from fastq.gz to count)
It provide two forms of RNA-seq workflow (mapped by STAR or Hisat2).
The diagram of workflow was shown in folder Flow diagram
Before the RNA-seq, the genome index should be built first.
The bulid method was shown:
STAR:
STAR
--runMode genomeGenerate
--genomeDir ~/STARindex
--runThreadN 40
--genomeFastaFiles chr1.fa
--sjdbGTFfile chr1.gtf
--sjdbOverhang 149
If your reads length is much shorter than 150 bp, your should change --sjdbOverhang to reads length - 1
Hisat2
hisat2_extract_splice_sites.py chr1.gtf > splice.txt
hisat2_extract_exons.py chr1.gtf > exons.txt
hisat2-build -p 40 chr1.fa --ss splice.txt --exon exons.txt chr1_index
If you don't have a high-quality reference genome for your species or want to identify more transcripts such as long non-coding RNA, using stringtie after mapping is suggested.
This is a Whole genome sequencing workflow (from fastq.gz to vcf.gz)
Note: This vcf.gz needs further quality control in further study.
If you install GATK with conda, please check the version of gatk via:
gatk --version
if your GATK version is lower than 4.0, please install gatk manually from github:https://github.com/broadinstitute/gatk/releases
Note: GTAK rely on java environment, if you don't have java, should use install java first:
conda instll openjdk -y
Before the WGS, the genome index should be built first by bwa, samtools,and GATK:
bwa index genome.fa
samtools faidx genome.fa
gatk CreateSequenceDictionary -R genome.fa -O genome.dict
This is a Whole genome bisulfite sequencing workflow (from fastq.gz to bed and other downsteam formats).
There are two types of workflow: map by bwa or by bismark, you can choose one of the workflow.
For bwa-meth:
bwa-meth is available at https://github.com/brentp/bwa-meth, please insatall it and its dependencies first
Then bulid index:
bwameth.py index genome.fa
The final output file has two formats: methylKit and cytosine_report. The detailed description of the output file format were in https://github.com/dpryan79/MethylDackel
For bismark:
bismark is available at conda and https://github.com/FelixKrueger/Bismark, Genome index also need to build first:
bismark_genome_preparation ./genome/
Before running this snakemake, there are many parameters that need to change in the snakemake file, such as the aligner, the mismatch number, and whether the strand specific library, etc. you should read the document of bismark carefully.
This is a ChIP-seq workflow (from fastq.gz to peak). It was also suitable for CUT&RUN and CUT&Tag experiments. Before the ChIP-seq workflow, the genome index of bowtie2 should be built first.
bowtie2-build genome.fa genome
There are also many parameters need change in snakemake file. Its depends on your ChIP experiment research object (Transcription factors or Histone modification), research species (Genome size) and replication numbers. you should read the paper of MACS2 carefully.
Deeptools was recommended to generate the meta-plot and heatmap on a reference point or certain genomic regions.
HOMER and MEME suite were recommended to perform peak annotaion and TF Motif enrichment and analysis.
ChromHMM was recommended to peroform chromatin state segmentation.
This is a ATAC-seq workflow (from fastq.gz to peak).
The workflow of the ATAC-seq is similar to the ChIP-seq, so the genome index was the same, the only difference is Peak calling (ATAC-seq without input sample).
This workflow is also appropriate for other open chromatin sequencing methods such as Dnase-Seq, MNase-Seq, and FAIRE-Seq, just might need to change the parameters of MACS2.
The genome blacklist could be found in Blacklist
After Peak calling, To get all peaks information and for downsteam analysis, you should use the following code to generate reproducible peak:
indentifyReproduciblePeaks.R
Because for multi sample, it is recommended to merge peaks with iterative Overlap Peak Merging Procedure, which was first introduced in Corces & Granja et. al. Science 2018 and recommonded in Grandi et al. Nat Protoc 2022, the code was modified from ArchR.
The differential peaks could calculated with DEseq2, edgeR, DiffBind or THOR. (The identification of differential peaks is still controversial and you should read the literature carefully).
HINT and TOBIAS recommended to find the TF footprints
This a HiC workflow (from fastq to hic file)
In HiC data, i use bwa to align reads to reference genome, the genome index is the same in whole genome sequence.
Then we need to generate the enzyme cutting site from genome.
python digest_genome.py -r dpnii -o hg38_digest.bed hg38.fa
awk -F '\t' '{print $1, $2, $3}' hg38_digest.bed > hg38_digest2.bed
The type of enzyme is depend on the HiC experiment, the built in enzyme is mboi: ["^GATC"], dpnii: ["^GATC"], bglii: ["A^GATCT"],hindiii: ["A^AGCTT"], arima: ["G^ANTC"].
Most scripts were modified form renlab
The downsteam analysis could preformed by Juicer
This is a RIP-seq workflow (from fastq.gz to peak), which is also suit for m6a-seq(meRIP)
The mapping method was consistent with RNA-seq and the Call peak method was consistent with to ChIP-seq.
For identification of differential peaks, exomePeak2 was recommended.
The workflow is rely on the package RiboMiner
- Generate the longest longest transcript annotation file
prepare_transcripts -g hg38.gtf -f hg38.fa -o hg38_ribo_utils
OutputTranscriptInfo -c hg38_ribo_utils/transcripts_cds.txt
-g ~/Desktop/hg38/hg38_transcript/genes.gtf
-f hg38_ribo_utils/transcripts_sequence.fa
-o ./hg38_ribo_utils/hg38_longest.transcripts.info.txt -O ./hg38_ribo_utils/hg38_all.transcripts.info.txt
GetProteinCodingSequence -i hg38_ribo_utils/transcripts_sequence.fa
-c ./hg38_ribo_utils/hg38_longest.transcripts.info.txt -o hg38_ribo_utils/hg38 --mode whole --table 1
- Build tRNA, rRNA index and longest transcript index
bowtie2-build Homo_sapiens_trRNA.fa hg38_trRNA
hisat2-build hg38_transcript_sequences.fa hg38_longest_transcript
- The snakemake workflow will generated the basic results, for the downstream analysis you should refer the document of RiboMiner
This is workflow is used for upsteam analysis of Single cell RNA-seq and Single cell ATAC-seq.
Currently only support the 10X genomics sequencing platform.
Before the analyse, you should download Cell Ranger form the website of 10X genomics.Cell Ranger was an integrated software for Single cell RNA-seq, Single cell ATAC-seq and 10X multiome.
If your used public data, you should rename you fastq following the naming rules of cellranger.
Then index your genome:
cellranger mkref
--genome genome
--fasta genome.fa
--genes genome_annotation.gtf
--nthreads=40
cellranger-arc mkref
--config=hg38_cellranger_arc.config --nthreads=75
The index of cellranger-arc is also suitable for scATAC-seq
For scRNA-seq:
cellranger count
--id sample_ID
--sample sample_ID
--localcores 40
--nosecondary
--fastqs fastq_folder/
--transcriptome genome_index
For scATAC-seq:
cellranger-atac count
--id=sample_name
--reference=genome_index
--fastqs=fastq_folder/
--localcores=40
For multiome:
cellranger-arc count --id=sample_ID
--reference=genome_index
--libraries=/home/jdoe/runs/libraries.csv
--localcores=40
--localmem=256
For other types of single cell data, follow the tutorials of STARsolo and sinto
If you famililar with R, Seurat, monocle, and ArchR framework were recommended to perform downstream analysis of singlecell multi-omics data.
If you famililar with python, Scanpy, scVI, and SnapATAC2 framework were recommended to perform downstream analysis of singlecell multi-omics data.