Skip to content

yuanlizhanshi/NGS_Workflow

Repository files navigation

NGS workflow

This is a series of Next generation sequencing (illumina short reads sequencing) workflow created by Snakemake

Table of Contents

Setup

All workflow all have its requirements softwares which were shown in environment.yaml. You could install via:

conda install --file environment.yaml

Usage

For dry use

snakemake -s snakemake.smk -np

Run workflow

snakemake -s snakemake.smk -c 4

-c: Default use 4 cores

Bulk RNA Sequencing workflow

This is a RNA-seq workflow (from fastq.gz to count)

It provide two forms of RNA-seq workflow (mapped by STAR or Hisat2).

The diagram of workflow was shown in folder Flow diagram

Before the RNA-seq, the genome index should be built first.

The bulid method was shown:

STAR:

STAR
--runMode genomeGenerate
--genomeDir ~/STARindex
--runThreadN 40
--genomeFastaFiles chr1.fa
--sjdbGTFfile chr1.gtf
--sjdbOverhang 149

If your reads length is much shorter than 150 bp, your should change --sjdbOverhang to reads length - 1

Hisat2

hisat2_extract_splice_sites.py chr1.gtf > splice.txt
hisat2_extract_exons.py chr1.gtf > exons.txt
hisat2-build -p 40 chr1.fa --ss splice.txt --exon exons.txt chr1_index

If you don't have a high-quality reference genome for your species or want to identify more transcripts such as long non-coding RNA, using stringtie after mapping is suggested.


Whole Genome Sequencing workflow

This is a Whole genome sequencing workflow (from fastq.gz to vcf.gz)
Note: This vcf.gz needs further quality control in further study.

If you install GATK with conda, please check the version of gatk via:

gatk --version

if your GATK version is lower than 4.0, please install gatk manually from github:https://github.com/broadinstitute/gatk/releases

Note: GTAK rely on java environment, if you don't have java, should use install java first:

conda instll openjdk -y

Before the WGS, the genome index should be built first by bwa, samtools,and GATK:

bwa index genome.fa
samtools faidx genome.fa
gatk CreateSequenceDictionary -R genome.fa -O genome.dict


Whole Genome Bisulfite Sequencing workflow

This is a Whole genome bisulfite sequencing workflow (from fastq.gz to bed and other downsteam formats).
There are two types of workflow: map by bwa or by bismark, you can choose one of the workflow.

For bwa-meth:
bwa-meth is available at https://github.com/brentp/bwa-meth, please insatall it and its dependencies first
Then bulid index:

bwameth.py index genome.fa

The final output file has two formats: methylKit and cytosine_report. The detailed description of the output file format were in https://github.com/dpryan79/MethylDackel

For bismark:
bismark is available at conda and https://github.com/FelixKrueger/Bismark, Genome index also need to build first:

bismark_genome_preparation ./genome/

Before running this snakemake, there are many parameters that need to change in the snakemake file, such as the aligner, the mismatch number, and whether the strand specific library, etc. you should read the document of bismark carefully.


Chromatin Immunoprecipitation (ChIP) sequencing workflow

This is a ChIP-seq workflow (from fastq.gz to peak). It was also suitable for CUT&RUN and CUT&Tag experiments. Before the ChIP-seq workflow, the genome index of bowtie2 should be built first.

bowtie2-build genome.fa genome

There are also many parameters need change in snakemake file. Its depends on your ChIP experiment research object (Transcription factors or Histone modification), research species (Genome size) and replication numbers. you should read the paper of MACS2 carefully.

Deeptools was recommended to generate the meta-plot and heatmap on a reference point or certain genomic regions.

HOMER and MEME suite were recommended to perform peak annotaion and TF Motif enrichment and analysis.

ChromHMM was recommended to peroform chromatin state segmentation.


ATAC-seq workflow

This is a ATAC-seq workflow (from fastq.gz to peak).

The workflow of the ATAC-seq is similar to the ChIP-seq, so the genome index was the same, the only difference is Peak calling (ATAC-seq without input sample).

This workflow is also appropriate for other open chromatin sequencing methods such as Dnase-Seq, MNase-Seq, and FAIRE-Seq, just might need to change the parameters of MACS2.

The genome blacklist could be found in Blacklist

After Peak calling, To get all peaks information and for downsteam analysis, you should use the following code to generate reproducible peak:

indentifyReproduciblePeaks.R

Because for multi sample, it is recommended to merge peaks with iterative Overlap Peak Merging Procedure, which was first introduced in Corces & Granja et. al. Science 2018 and recommonded in Grandi et al. Nat Protoc 2022, the code was modified from ArchR.

The differential peaks could calculated with DEseq2, edgeR, DiffBind or THOR. (The identification of differential peaks is still controversial and you should read the literature carefully).
HINT and TOBIAS recommended to find the TF footprints


HiC workflow

This a HiC workflow (from fastq to hic file)

In HiC data, i use bwa to align reads to reference genome, the genome index is the same in whole genome sequence.

Then we need to generate the enzyme cutting site from genome.

python digest_genome.py -r dpnii -o hg38_digest.bed hg38.fa
awk -F '\t' '{print $1, $2, $3}' hg38_digest.bed > hg38_digest2.bed

The type of enzyme is depend on the HiC experiment, the built in enzyme is mboi: ["^GATC"], dpnii: ["^GATC"], bglii: ["A^GATCT"],hindiii: ["A^AGCTT"], arima: ["G^ANTC"].

Most scripts were modified form renlab

The downsteam analysis could preformed by Juicer


RNA Immunoprecipitation (RIP) sequencing workflow

This is a RIP-seq workflow (from fastq.gz to peak), which is also suit for m6a-seq(meRIP)

The mapping method was consistent with RNA-seq and the Call peak method was consistent with to ChIP-seq.
For identification of differential peaks, exomePeak2 was recommended.


Ribo sequencing workflow

The workflow is rely on the package RiboMiner

  1. Generate the longest longest transcript annotation file

prepare_transcripts -g hg38.gtf -f hg38.fa -o hg38_ribo_utils

OutputTranscriptInfo -c hg38_ribo_utils/transcripts_cds.txt
-g ~/Desktop/hg38/hg38_transcript/genes.gtf
-f hg38_ribo_utils/transcripts_sequence.fa
-o ./hg38_ribo_utils/hg38_longest.transcripts.info.txt -O ./hg38_ribo_utils/hg38_all.transcripts.info.txt

GetProteinCodingSequence -i hg38_ribo_utils/transcripts_sequence.fa
-c ./hg38_ribo_utils/hg38_longest.transcripts.info.txt -o hg38_ribo_utils/hg38 --mode whole --table 1

  1. Build tRNA, rRNA index and longest transcript index

bowtie2-build Homo_sapiens_trRNA.fa hg38_trRNA

hisat2-build hg38_transcript_sequences.fa hg38_longest_transcript

  1. The snakemake workflow will generated the basic results, for the downstream analysis you should refer the document of RiboMiner

Single cell multi-omics workflow

This is workflow is used for upsteam analysis of Single cell RNA-seq and Single cell ATAC-seq. Currently only support the 10X genomics sequencing platform.
Before the analyse, you should download Cell Ranger form the website of 10X genomics.Cell Ranger was an integrated software for Single cell RNA-seq, Single cell ATAC-seq and 10X multiome. If your used public data, you should rename you fastq following the naming rules of cellranger.
Then index your genome:

cellranger mkref
--genome genome
--fasta genome.fa
--genes genome_annotation.gtf
--nthreads=40

cellranger-arc mkref
--config=hg38_cellranger_arc.config --nthreads=75

The index of cellranger-arc is also suitable for scATAC-seq

For scRNA-seq:

cellranger count
--id sample_ID
--sample sample_ID
--localcores 40
--nosecondary
--fastqs fastq_folder/
--transcriptome genome_index

For scATAC-seq:

cellranger-atac count
--id=sample_name
--reference=genome_index
--fastqs=fastq_folder/
--localcores=40

For multiome:

cellranger-arc count --id=sample_ID
--reference=genome_index
--libraries=/home/jdoe/runs/libraries.csv
--localcores=40
--localmem=256

For other types of single cell data, follow the tutorials of STARsolo and sinto

If you famililar with R, Seurat, monocle, and ArchR framework were recommended to perform downstream analysis of singlecell multi-omics data.

If you famililar with python, Scanpy, scVI, and SnapATAC2 framework were recommended to perform downstream analysis of singlecell multi-omics data.

About

Some NGS workflow created by snakemake

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published