NB - this code / pipeline remains mostly untouched since it was first written by me in 2013 / 14! It was designed for a specific purpose and the intention was for it to be used by trained individuals, i.e., NOT general usage. It would be designed differently were it written today. Placing it here on GitHub is primarily to archive it. Although it runs AOK in its current state, anyone re-using this code should be aware of this and make changes as you see fit.
Automated next generation DNA sequencing analysis pipeline 'suited' for clinical tests, with >99.9% sensitivity to Sanger sequencing at read-depth>18 over target regions over interest.
This clinical-grade analysis pipeline, ClinicalGradeDNAseq, is a watered-down and modified version of the work by Blighe, Beauchamp, and colleagues at Sheffield Diagnostic Genetics Service, Sheffield Children's National Health Service (NHS) Foundation Trust, Sheffield, UK, and their efforts to introduce a clinical-grade next generation sequencing (NGS) analysis pipeline fully validated against Sanger di-deoxy sequencing.
The pipeline is built using open source programs mixed with customised scripts. A wrapper script manages command line parameters and then executes the master analysis script. Control is then returned to the wrapper, where results files are transferred to a remote server via SSH/sFTP. A master and concise log is kept, with date- and time-stamps.
The unique feature of the analysis pipeline that increases sensitivity to Sanger sequencing is in the variant calling step, where a final aligned BAM is split into 3 'sub-BAMs' of 75%, 50%, and 25% random reads. Variants are then called on all 4 BAMs, after which a consensus VCF is produced.
This pipeline has been modified from the original version and proceeds in a 8-step process:
- Adaptor and read quality trimming - TrimGalore! (Krueger F), FastQC (Andrews S), cutadapt (Martin M, 2011)
- Alignment - bwa mem (Li & Durbin, 2009)
- Marking and removing PCR duplicates - Picard (Broad Institute of MIT and Harvard), SAMtools (Li et al., 2009)
- Remove low mapping quality reads - SAMtools (Li et al., 2009)
- QC - SAMtools (Li et al., 2009), custom scripts
- Downsampling / random read sampling - Picard (Broad Institute of MIT and Harvard)
- Variant calling - SAMtools/BCFtools (Li et al., 2009)
- Annotation - Variant Effect predictor (McLaren et al., 2016)
- Paired-end FASTQ files
- Chromosome ordered BED file in hg19 / hg38 (BED files can be sorted with sort -k1,1V -k2,2n)
- Global installations of Cutadapt and unix2dos (included in dos2unix)
- Valid credentials for returning results files to remote server (username / password) via SSH/sFTP
-
Run the ‘PipelineWrapper’ wrapper script, which will check command-line parameters, execute the master script, and then return results files via SSH/sFTP to remote server. Use the following parameters:
- FASTQ mate-pair 1 (absolute file path)
- FASTQ mate-pair 2 (absolute file path)
- Reference genome FASTA (absolute file path)
- Run number (e.g. Plate6, Plate7, etc.) (alphanumeric)
- Patient ID (alphanumeric)
- BED file (absolute file path)
- Minimum quality for bases at read ends, below which bases will be cut (integer)
- Minimum allowed read length (integer)
- Adaptor for trimming off read ends ('illumina' / 'nextera' / 'small_rna')
- Minimum read depth for calling a variant (integer)
- Minimum allowed mapping quality (integer)
- Stringency for calling variants ('relaxed' / 'normal')
- Directory where results will be output (absolute file path)
- User initials (alphanumeric)
- *_AnalysisLog.txt - analysis log (short)
- Master.log - analysis log (comprehensive)
- *_R1_001.fastq.gz_trimming_report.txt - details on base and read trimming for mate-pair 1
- *_R1_001_val_1_fastqc.html - FastQC report for mate-pair 1 (after trimming)
- *_R2_001.fastq.gz_trimming_report.txt - details on base and read trimming for mate-pair 2
- *_R2_001_val_2_fastqc.html - FastQC report for mate-pair 2 (after trimming)
- *_Alignment.txt - alignment metrics
- *_ReadsOffTarget.txt - number of reads falling outside regions specified in BED file
- *_PCRDuplicates.txt - details on identified PCR duplicates
- *_CoverageTotal.bedgraph - coverage for all mapped locations (contiguous bases at same read depth are merged into regions)
- *_MeanCoverageBED.bedgraph - mean read depth for each region specified in supplied BED file
- *_PerBaseDepthBED.bedgraph - per base read depth for each base in each region specified in supplied BED file
- *_Aligned_Sorted_PCRDuped_FiltMAPQ.bam - aligned BAM file with sorted reads, PCR duplicates removed, and reads below mapping quality threshold removed
- *_Aligned_Sorted_PCRDuped_FiltMAPQ.bam.bai - index for above BAM file
- *_Final.vcf - final VCF file
- *_AnnotationVEP.html - HTML report of variant annotation, with consequences for all known transcript isoforms
- *_AnnotationVEP.tsv - as above but in tab-separated values (TSV) format
- PipelineWrapper.sh, line 120: /home/ubuntu/pipeline/AnalysisMasterVersion1.sh "${Read1}" "${Read2}" ... - absolute path filename for AnalysisMasterVersion1.sh
- PipelineWrapper.sh, line 127: remoteDir="/remote/SAMBA/share/" - Remote server directory to which results files will be transferred via SSH/sFTP
- PipelineWrapper.sh, line 139, 150: sshpass -e sftp [email protected] << ! - Remote server IP address or host name to which results files will be transferred via SSH/sFTP
- AnalysisMasterVersion1.sh, lines 25-35 - root directories (absolute paths) of required programs
- Andrews S, FastQC, https://www.bioinformatics.babraham.ac.uk/projects/fastqc/, last accessed 28th August 2017.
- Broad Institute of MIT and Harvard, Picard, http://broadinstitute.github.io/picard/, last accessed 28th August 2017
- Krueger F, Trim Galore!, https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/, last accessed 28th August 2017.
- Li H and Durbin R (2009), Fast and accurate short read alignment with Burrows-Wheeler transform, Bioinformatics 25(14): 1754–1760.
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup (2009), The Sequence Alignment/Map format and SAMtools, Bioinformatics 25(16):2078-9.
- Martin M (2011), Cutadapt removes adapter sequences from high-throughput sequencing reads, EMBnet.journal 17(1): 10-12.
- McLaren W, Gil L, Hunt S, Riat H, Ritchie G, Thormann A, Flicek P, Cunningham F (2016), The Ensembl Variant Effect Predictor, Genome Biology 17:122.
- Kevin Blighe (Sheffield Children's NHS Foundation Trust)
- Nick Beauchamp (Sheffield Children's NHS Foundation Trust)
- Darren Grafham (Sheffield Children's NHS Foundation Trust)
- Sheffield Diagnostics Genetics Service