Table of Contents
This is a Snakemake workflow based on the obitools suite of programs, that analyzes DNA metabarcoding data.
Sequence analysis is performed with the obitools (Boyer et al. 2016) and sumaclust (Mercier et al. 2013) through a Snakemake pipeline (Mölder et al. 2021).
In order to run the workflow, the following languages/programs are required:
Please note that the workflow is currently running exclusively on Unix systems.
Clone the repository:
git clone https://github.com/AnneSoBen/obitools_workflow.git
The repository contains five folders:
config/
: contains the configuration file of the Snakemake workflow (config.yaml
). This is where the value of the options for the various commands used is defined.log/
: where log files of each rule are written.resources/
: where you should download/copy your raw data (cf. Download your data)results/
: where all output files are written.workflow/
: contains the Snakemake workflow (Snakefile
), the configuration file of the submission parameters on the cluster (cluster.yaml
) and the script to submit the workflow on the cluster (sub_smk.sh
).
Download/copy your data in the resources/
folder. Three files are required:
- forward and reverse fastq files
- the corresponding ngsfilter file
They should be named as follows: prefix_R1.fastq
, prefix_R2.fastq
, prefix_ngsfilter.tab
And be put in a subfolder whose name is the prefix of the files (see Example).
Before running the workflow, the configuration file (config/config.yaml
) has to be edited. The parameters that can be set are listed in the table below:
parameter | description | concerned rule(s) | default value | comment |
---|---|---|---|---|
tomerge | whether to merge libraries before dereplication | merge_demultiplex | FALSE | should be set to 'TRUE' if you analyse several libraries that you want to merge |
resourcesfolder | relative path to the folder containing resource files (fastq files and ngsfilter) | split_fastq, demultiplex | ../resources | should not be changed, unless you want to rename the folder |
resultsfolder | relative path to the folder where output files will be written | all | ../results | should not be changed, unless you want to rename the folder |
fastqfiles | prefix of the name of the resource fastq files and ngsfilter | all | wolf_diet | must be changed to match your files name prefix |
mergedfile | prefix of the name of the output files if tomerge=TRUE | merge_demultiplex, split_fasta, derepl, merge_derepl, basicfilt, clustering, merge_clust, tab_format | wolf_diet | must be changed for the merged files name prefix you want |
split_fastq:nfiles | number of files to create when splitting fastq files for pairing | split_fastq | 2 | should be changed according to the size of your dataset: the bigger it is, the more you will want to split your initial files - useful only on multi-threaded systems |
minscore | minimum alignment score required for pairing | alifilt | 40.00 | set according to Taberlet et al. 2018 |
split_fasta:nfiles | number of files to create when splitting demultiplexed fasta files for dereplication | split_fasta | 2 | should be changed according to the size of you dataset: the bigger it is, the more you will want to split your initial file(s) |
minlength | minimum sequence length (in bp) | basicfilt | 80 | must be changed according to the minimum length expected for your barcode |
mincount | minimum number of reads per unique sequence | basicfilt | 1 | it's up to you! |
minsim | similarity threshold for clustering | clustering | 0.97 | it's up to you! |
If you run the workflow on a SLURM cluster, you must also check the workflow/cluster.yaml
that sets up the ressources available for each rule.
Then, run the workflow:
cd workflow
conda activate snakemake
snakemake -c1 --use-conda
Alternatively, you can run the workflow with a single command on a SLURM cluster by submitting the sub_smk.sh
file:
cd workflow
sbatch sub_smk.sh
If you want to test the workflow, download the toy dataset from the obitools tutorial (https://pythonhosted.org/OBITools/wolves.html) in the resources/
folder:
wget -O resources/wolf_tutorial.zip https://pythonhosted.org/OBITools/_downloads/wolf_tutorial.zip
unzip resources/wolf_tutorial.zip -d resources/
mv resources/wolf_tutorial resources/wolf_diet
rm resources/wolf_tutorial.zip
Rename the files to fit the template decribed above (or create symbolic links):
cd resources/wolf_diet
ln -s wolf_F.fastq wolf_diet_R1.fastq
ln -s wolf_R.fastq wolf_diet_R2.fastq
ln -s wolf_diet_ngsfilter.txt wolf_diet_ngsfilter.tab
You should get this directory and file structure:
tree
.
├── config
│ └── config.yaml
├── LICENSE
├── log
├── README.md
├── resources
│ └── wolf_diet
│ ├── db_v05_r117.fasta
│ ├── embl_r117.ndx
│ ├── embl_r117.rdx
│ ├── embl_r117.tdx
│ ├── wolf_diet_ngsfilter.tab -> wolf_diet_ngsfilter.txt
│ ├── wolf_diet_ngsfilter.txt
│ ├── wolf_diet_R1.fastq -> wolf_F.fastq
│ ├── wolf_diet_R2.fastq -> wolf_R.fastq
│ ├── wolf_F.fastq
│ └── wolf_R.fastq
├── results
└── workflow
├── cluster.yaml
├── Snakefile
└── sub_smk.sh
Note that the name of the subfolder containing your source files (fastq and ngsfilter files) should be the prefix of the files.
The config.yaml file is already modified to fit this data.
Now run the workflow:
cd ../../workflow/
conda activate snakemake
snakemake -c1 --use-conda
You may want to merge libraries, for example if technical replicates are split in different libraries. To allow this, the value of "tomerge" in the config/config.yaml
file should be set to TRUE
. The prefix of your library files should be listed in the config/config.yaml
file, such as:
tomerge:
TRUE
resourcesfolder:
../resources/
resultsfolder:
../results/
fastqfiles:
- myfirstlibfileprefix
- mysecondlibfileprefix
mergedfile:
mymergedlibs
The source files of each library should be in separate subfolders. For example:
└─ resources
└── myfirstlibprefix
| ├── myfirstlibprefix_ngsfilter.tab
| ├── myfirstlibprefix_R1.fastq
| └── myfirstlibprefix_R2.fastq
└── mysecondlibprefix
├── mysecondlibprefix_ngsfilter.tab
├── mysecondlibprefix_R1.fastq
└── mysecondlibprefix_R2.fastq
Two ngsfilter files will be necessary: resources/myfirstlibfileprefix/myfirstlibfileprefix_ngsfilter.tab
and resources/myfirstlibfileprefix/mysecondlibfileprefix_ngsfilter.tab
.
The value of "mergedfile" corresponds to the prefix of the merged files from the dereplication to the end of the workflow.
You may want to clean up potential molecular artifacts: have a look at the R package metabaR!
Thanks to Lucie Zinger, Frédéric Boyer, Céline Mercier and Clément Lionnet for their help with the obitools! Also thanks to the ECOFEED project for funding the development of the first version of this workflow.
Anne-Sophie Benoiston. (2022). AnneSoBen/obitools_workflow: v1.0.2. GitHub. https://doi.org/10.5281/zenodo.6676577.
🚩 Don't forget to cite this repository if you use it for your research 🙂
Boyer, F., Mercier, C., Bonin, A., Bras, Y. L., Taberlet, P., & Coissac, E. (2016). obitools: A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16(1), 176‑182.
Mercier, C., Boyer, F., Bonin, A., & Coissac, E. (2013). SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences. In Programs and Abstracts of the SeqBio 2013 workshop. Abstract (pp. 27-29).
Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., ... & Köster, J. (2021). Sustainable data analysis with Snakemake. F1000Research, 10.
Zinger, L., Lionnet, C., Benoiston, A. S., Donald, J., Mercier, C., & Boyer, F. (2021). metabaR: an R package for the evaluation and improvement of DNA metabarcoding data quality. Methods in Ecology and Evolution, 12(4), 586-592.