nf-core/viralrecon
is an open-source Nextflow pipeline that performs much of the bioinformatics users with viral amplicon data would want: read quality control, alignment, variant-calling, consensus-sequence calling, lineage classification, etc. Rather than writing our own pipeline to do these things, we simply downloaded and configured viralrecon
according to the following specifications.
Depending on what resources were available, we ran viralrecon
on input FASTQ files on either University of Wisconsin-Madison's Center for High Throughput Computing (CHTC) or on a local Apple Mac Studio owned by David H. O'Connor's lab.
CHTC maintains a high-performance computing cluster using the Slurm scheduler, which viralrecon
supports. To run the pipeline on this manuscript's samples there, we first made a directory in the cluster's scratch
space and downloaded viralrecon
, like so:
- First connect to the cluster with
ssh <USERNAME>@spark-login.chtc.wisc.edu
- Change into a scratch working directory with
cd /scratch/<USERNAME>/<WORKING_DIR>
- Download the
viralrecon
files withgit clone --branch 2.6.0 https://github.com/nf-core/viralrecon.git .
- Transfer the files onto the cluster with
rsync -azvP <FASTQ_DIR> <USERNAME>@spark-login.chtc.wisc.edu:/scratch/<USERNAME>/<WORKING_DIR>
. - Alongside the FASTQs, We also include a CSV samplesheet that is required as the input for
viralrecon
. This samplesheet must have the following column headers representing the sample identifier, the "R1" FASTQ, and the associated "R2" FASTQ for each sample"
sample,fastq_1,fastq_2
- Finally, we transferred the BED-formatted file of primer coordinates that is included in the repo
viralrecon_setup
directory.
Separately, we also installed nextflow
and singularity
in the user's home directory using conda
. To set up conda
on a cluster yourself, we recommend following the official instructions here. In short, download the miniconda3
installer with:
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
Then, initialize conda
with:
~/miniconda3/bin/conda init bash
Make conda available in the command line with:
echo 'export PATH=$PATH:~/miniconda3/bin' >> ~/.bashrc
And finally, use conda
to install nextflow
and singularity
:
conda install -c bioconda nextflow singularity
Being a Nextflow pipeline, viralrecon
makes heavy use of configuration files. To configure viralrecon
to run via the slurm scheduler, we used the following configuration file:
process {
// container/environment settings
singularity.enabled = true
singularity.cacheDir = "work/singularity/"
// slurm job settings
executor = "slurm"
clusterOptions = "--partition=shared"
memory = '64 GB'
// lower iVar threshold settings to retain more low-frequency variants
withName: 'IVAR_VARIANTS' {
ext.args = '-t 0.01 -q 20 -m 100'
}
}
This file is also available as slurm.config
in the repo viralrecon_setup
directory. Overall, the config file enables the Singularity container engine instead of the default Docker engine, sets the executor
to Slurm, tells the CHTC cluster to use its shared resources partition, and requests 64 gigabytes of RAM per individual task. We also lower the default variant frequency threshold for the pipeline's variant-caller.
Finally, to run the pipeline on all of the assembled input, we use the following command for Illumina reads:
nextflow run . \
-config viralrecon-hpc.config \ # <———— USER-PROVIDED FILE
-profile singularity \
--input <SAMPLESHEET_NAME>.csv \ # <———— USER-PROVIDED FILE
--outdir results \
--platform illumina \
--protocol amplicon \
--genome 'MN908947.3' \
--primer_bed DIRECTwithBoosterAprimersfinal.bed \ # <———— USER-PROVIDED FILE
--primer_left_suffix '_LEFT' \
--primer_right_suffix '_RIGHT' \
--skip_assembly
And the following for Nanopore reads:
nextflow run . \
-c run_configs/29759/viralrecon-hpc.config
--input <SAMPLESHEET_NAME>.csv \ # <———— USER-PROVIDED FILE
--outdir results \
--platform nanopore \
--genome 'MN908947.3' \
--primer_set_version 1200 \ # <———— USER-PROVIDED SETTING FOR THE MIDNIGHT PRIMER SET
--fastq_dir <FASTQ_DIR>/ \ # <———— USER-PROVIDED FOLDER
--artic_minion_caller medaka \
--artic_minion_medaka_model r941_min_high_g360 \
Note, as per the viralrecon
documentation, that the Nanopore samplesheet format is a bit different, e.g.:
sample,barcode
21X983255,1
70H209408,2
49Y807476,3
70N209581,4
All the above steps were broadly similar for setup on a Mac Desktop, with a few exceptions:
- Configuration was much simpler; see the file
mac_studio.config
:
process {
// lower iVar threshold settings to retain more low-frequency variants
withName: 'IVAR_VARIANTS' {
ext.args = '-t 0.01 -q 20 -m 100'
}
}
- We use the Docker container engine instead of Singularity, which must be manually installed on the Desktop.
- More transferring of input files was performed by hand in the Finder file explorer instead of with the
rsync
command line utility. - Only Illumina reads were processed locally.
As such, our viralrecon
run commands on the Mac Studio looked as follows:
nextflow run . \
--input <SAMPLESHEET_NAME>.csv \ # <———— USER-PROVIDED FILE
--outdir results \
--platform illumina \
--protocol amplicon \
--genome 'MN908947.3' \
--primer_bed DIRECTwithBoosterAprimersfinal.bed \ # <———— USER-PROVIDED FILE
--primer_left_suffix '_LEFT' \
--primer_right_suffix '_RIGHT' \
--skip_assembly