Source code for the key steps (including quality control of metagenomic raw reads and phage contig indentification) of VirMiner, which is available at http://sbb.hku.hk/VirMiner/.
The souce code was integrated into two parts: VirMiner PipelineForQC and VirMiner PipelineForVirContigIndentification
Here we provide the command-line perl scripts to quality control for metagenomic data, which was used in VirMiner. It can be used to process raw reads of metagenomic samples in FASTQ format by removing the adapters, low quality reads, bases or PCR duplicates.
Linux/macOS: A version of Perl is already installed
Windows: You may need to install one of the versions available at perl.org.
After installation, run in linux system or terminal apps, such as Terminal on macOS, and Command Prompt on Windows.
command-line perl scripts: ./scripts/PipelineForQC/fqc.pl
Run perl ./scripts/PipelineForQC/fqc.pl -h
to see the parameters of this command line.
$ perl ./scripts/PipelineForQC/fqc.pl -h
Usage for single: $0 $command -i s_1_IDX1_1.fastq -o s_1_IDX1
for Paired: $0 $command -f s_1_IDX1_1.fastq -r s_1_IDX1_2.fastq -p -o s_1_IDX1
$command <str> : adpter,quality,adp_qual and duplication
Options:
-i <str> : Fastq file (Single)
-f <str> : Forward fastq file (Paired-End)
-r <str> : Reverse fastq file (Paired-End)
-a <str> : Adapter sequence (Default: GATCGGAAGA)
-c <Ns> : Discard sequences lower than N quality. (Default: 20)
-q <Ns> : Quality format 33/64 (Default: 33)
-l <Ns> : Discard sequences shorter than N nucleotides. (Default: 30)
-o <str> : Output prefix
-p : Paired-End Model (Default: Single)
-v : Prints version of the program
-h : Prints this usage summary
A sample "run" command:
input metagenomic raw reads in pair-end FASTQ format (test_1.fastq, test_2.fastq):
perl fqc.pl all -p -f test_1.fastq -r test_2.fastq -o test_qc
Output clean reads after quality control : test_qc_1.fastq
and test_qc_2.fastq
.
input metagenomic raw reads in single FASTQ format (test_fastq):
perl fqc.pl all -i test.fastq -o test_qc
Output clean reads after quality control : test_qc.fastq
.
Before running this pipeline to identify phage contigs, you need to prepare input files: 1) fasta file of assembled contigs; 2) the clean reads in pair-end FASTQ format (refers to the output file of VirMiner pipelineForQC).
1.rpsblast (version 2.2.26)
2.Diamond
3.KOBAS (version 2.0) (Note: This software is required for KO annotation. Please follow "install.txt" in "kobas-2.1.1.tar.gz" to install kobas and copy "/your/path/to/kobas/scripts/annotate.py" to "/your/path/to/PipelineForVirContigIdentification/bin" folder.)
4.hmmsearch
5.bwa (version 0.7.12)
6.BLASTP (version 2.2.26)
7.R package randomForest
8.GeneMark
1.CDD
2.updated POG database (uPOGs)
3.viral hallmark
4.viral protein family
5.KO
6.pre-built_random_forest_model
You can download the files of databases from here (http://sbb.hku.hk/VirMiner/databases/).
You can download the test samples (including P5E0_test, P5E7_test etc.) from VirMiner website (http://147.8.185.62/VirMiner/tasks/exampleData/quality_control/) then place them in VirMiner/data/quality_control/
. For your better understanding, I will take P5E0_test as an example below to show how to run the scripts.
Firstly, create a folder named "VirMiner" under you own directory using the comand line:
mkdir VirMiner
Create a folder named "bin" under VirMiner folder using the comand line:
mkdir VirMiner/bin
Then put all the scripts of PipelineForVirContigIdentification in the "bin" folder, it should be like this:
Then make a new folder named "database" under the VirMiner folder (VirMiner/database
) using the command line:
mkdir VirMiner/database
Placing these files of databases in /VirMiner/database/
, you should have these files in your database folder:
Create a folder named "data" under VirMiner folder for deposing your data using the comand line:
mkdir VirMiner/data
Create the following folders under data folder to deposit the output files for each step using the comand line:
mkdir VirMiner/data/quality_control
mkdir VirMiner/data/genome_assembly
mkdir VirMiner/data/gene_prediction
mkdir VirMiner/data/functional_annotation
mkdir VirMiner/data/POG_2016_annotation
mkdir VirMiner/data/average_depth_relative_abundance
mkdir VirMiner/data/viral_contig_identification
Option 1. Clean reads in pair-end FASTQ format and the assembled contigs in FASTA format (PLEASE NOTE THAT AVOID SPACES IN YOUR CONTIG IDs):
Firstly you need to change the pair-end FASTQ file names and make it ended with "_qc_1.fastq" or "_qc_2.fastq",for example, "P5E0_test_qc_1.fastq" and "P5E0_test_qc_2.fastq". Then place them in VirMiner/data/quality_control/
. It should be like this:
Secondly, create a folder named "sample_name.assembly.idba" under genome_assembly folder.
mkdir VirMiner/data/genome_assembly/sample_name.assembly.idba
For example, if you have a sample named "P5E0", the command could be used like this:
mkdir VirMiner/data/genome_assembly/P5E0_test.assembly.idba
Then rename your contig file to "contig.fa" and put it in the directory: VirMiner/data/genome_assembly/sample_name.assembly.idba
, it should be like this:
Option 2. Clean reads in pair-end FASTQ format only:
Firstly you need to change the pair-end FASTQ file names and make it ended with "_qc_1.fastq" or "_qc_2.fastq",for example, "P5E0_test_qc_1.fastq" and "P5E0_test_qc_2.fastq". Then place them in VirMiner/data/quality_control/
in this case, you can choose IDBA_UD to do genome assembly using the command_line (if your pair-end FASTQ file named "P5E0_test_qc_1.fastq" and "P5E0_test_qc_2.fastq"):
/your/path/to/VirMiner/bin/fq2fa --merge /your/path/to/VirMiner/data/quality_control/P5E0_test_qc_1.fastq /your/path/to/VirMiner/data/quality_control/P5E0_test_qc_2.fastq /your/path/to/VirMiner/data/quality_control/P5E0_test_qc.fa
/your/path/to/VirMiner/idba_ud --min_contig 300 --mink 20 --maxk 101 --step 10 -r /your/path/to/VirMiner/data/quality_control/P5E0_test_qc.fa -o /your/path/to/VirMiner/data/genome_assembly/P5E0_test.assembly.idba --pre_correction
Notice: you may change the setting of --maxk to your raw read length.
In the folder VirMiner/data/gene_prediction
:
- The predicted gene in GFF format, which showed information of the start and end of predicted genes in contigs.
- The protein sequences of predicted genes.
- The number of predicted genes on each contig.
In the folder VirMiner/data/functional_annotation
:
- Genes annotated to KO groups.
- Genes annotated to Pfam groups.
- Genes annotated to viral protein families.
- Genes identified as viral hallmark genes.
- The number and the percentage of predicted genes annotated to KO groups on each contig.
- The number and the percentage of predicted genes annotated to Pfam groups on each contig.
- The number and the percentage of predicted genes annotated to viral protein families on each contig.
- The number of identified viral hallmark genes on each contig.
In the folder VirMiner/data/POG_2016_annotation
:
- Genes annotated to general POGs.
- Genes annotated to POGs with high VQ (VQ >0.8) that could be considered as virus-specific.
In the folder VirMiner/data/average_depth_relative_abundance
:
- The mapped reads count for each contig.
- The average depth for each contig.
In the folder VirMiner/data/viral_contig_identification
:
- The metrics table including functional information like KO, pfam, viral hallmark, viral protein families etc. and other metrics characterizing each contig such as average depth, which is used for phage contigs identification.
- The extracted of all the above metrics for predicted phage contigs.
- The sequence of predicted phage contigs in FASTA format.
1 If you have clean reads in pair-end FASTQ format only as your input file:
As mentioned above, firstly you can choose IDBA_UD to do genome assembly.
A sample "run" command:
Assume you have prepared these input files: /your/path/to/VirMiner/data/quality_control/P5E0_test_qc_1.fastq
and /your/path/to/VirMiner/data/quality_control/P5E0_test_qc_2.fastq
/your/path/to/VirMiner/bin/fq2fa --merge /your/path/to/VirMiner/data/quality_control/P5E0_test_qc_1.fastq /your/path/to/VirMiner/data/quality_control/P5E0_test_qc_2.fastq /your/path/to/VirMiner/data/quality_control/P5E0_test_qc.fa
/your/path/to/VirMiner/idba_ud --min_contig 300 --mink 20 --maxk 101 --step 10 -r /your/path/to/VirMiner/data/quality_control/P5E0_test_qc.fa -o /your/path/to/VirMiner/data/genome_assembly/P5E0_test.assembly.idba --pre_correction
cd /your/path/to/VirMiner/data
sh /your/path/to/VirMiner/bin/Pipeline_For_Viral_Contig_Indentification.sh P5E0_test
2 If you already have clean reads in pair-end FASTQ format and the assembled contigs in FASTA format as your input files:
A sample "run" command:
Assume you have prepared these input files: /your/path/to/VirMiner/data/genome_assembly/P5E0_test.assembly.idba/contig.fa
,/your/path/to/VirMiner/data/quality_control/P5E0_test_qc_1.fastq
and /your/path/to/VirMiner/data/quality_control/P5E0_test_qc_2.fastq
cd /your/path/to/VirMiner/data
sh /your/path/to/VirMiner/bin/Pipeline_For_Viral_Contig_Indentification.V2.sh P5E0_test