Authors: Martin Hölzer, Maximilian Arlt, Eva Aßmann
Protein-coding annotation extension using additional homology searches against larger databases.
The HyPro tool extends common protein-coding annotations made with Prokka using additional homology searches. The approach currently takes a gff input file, extracts the sequences of hypothetical proteins and searches against a selected database (available are UniProtKB, Uniref50, Uniref90, Unref100 and Protein DB) to find homologs. For searching, MMseqs2 is utilized which offers a fast and accurate sequence comparison.
The tool has been tested on a conda (v4.10.1) and docker engine (v20.10.7)
- main.nf : main script to be called by the user
- prokka_annotation.nf : process for de novo annotation of input fasta
- mmseqs2.nf : process for running additional annotation of input fasta using input query DB
- update_prokka.nf : process for extending prokka annotation by annotatinos found during MMseqs2 process
To use HyPro you only need Nextflow and Conda or Docker/Singularity installed.
After installing for example Nextflow and Conda you can either clone this repository to run HyPro or just use Nextflows pull functionality. We also provide a test genome. You can use different Nextflow configuration profiles in order to run the pipeline on different system settings and configurations (e.g. switching from Conda to Docker as the backend software packaging engine):
nextflow pull hoelzer-lab/hypro
nextflow run hoelzer-lab/hypro -r 0.0.4 -profile local,conda --fasta ~/.nextflow/assets/hoelzer-lab/hypro/test/data/GCF_000471025.2_ASM47102v2_genomic.fna
- -profile local,conda : using conda environments for prokka, MMseqs2 and mygene (see configs/conda.config)
- -profile local,docker : using docker containers for prokka, MMseqs2 and mygene (see configs/container.config)
When running HyPro multiple times, the tool will look for an existing DB of type --database first. If nothing could be found, it will download/build the specified DB and store it in nextflow-autodownload-databases/
. Alternatively, you may give HyPro a path to an existing DB created sometime before. Simply hand it over to the --customdb parameter. In this case, do not forget to specify the DB type that you use in --database!
nextflow run hoelzer-lab/hypro -r 0.0.4 -profile local,conda --fasta ~/.nextflow/assets/hoelzer-lab/hypro/test/data/GCF_000471025.2_ASM47102v2_genomic. --database uniprotokb --customdb some/path/to/uniprotkb
It is also possible to run HyPro on more than one input genome by using the --list parameter. Instead of passing a single fasta file to --fasta, you specify the path to a csv file with two colums per line (sample id, fasta file path) and set the --list flag to true
.
nextflow run hoelzer-lab/hypro -r 0.0.4 -profile local,conda --fasta test/input.csv --list true --database uniprotokb --customdb some/path/to/uniprotkb
Parameter | Type | Description |
---|---|---|
--help | Show help message and exit. | |
--fasta | String | Path to input genome fasta that shall be annotated. Input can also be a list of multiple fasta files formatted as a .csv file with two columns (sample id and file path)(see --list). |
--list | Boolean | Specify whether input is given as a list of files. Default: false |
--database | String | Specify the target db to search for annotation extension. Current available options: uniprotkb, uniref50, uniref90, uniref100, pdb. Note, that searching on uniref DBs will significantly extend runtime of HyPro. Default: uniprotkb |
--custom-db | String | Specify a path to an existing DB. If no DB is found, HyPro will build it. Requires an according --database configuration. |
--output | String | Specify PATH to a directory. HyPro will generate the output structure to PATH. Default: results |
--modus | String | Choose the modus of HyPro to search all hypothetical proteins (full) or leave those out which gained partial annotation (restricted). The dinstinction of fully un-annotated and partial annotated hypothetical proteins was observed for uniprot annotations. Options: full (default), restricted |
--threads | Integer | Define the number of threads to use by MMseqs search and convertalis. Default: 1 |
--prokka | String | Control parameters for prokka,e.g. if running HyPro on a bacteria genome that does not follow the standard code. |
Parameter | Type | Description |
---|---|---|
--evalue | Float | Include sequence matches with < e-value threshold into the profile. Requires a FLOAT >= 0.0. Default: 0.1 |
--min-aln-len | Integer | Specify the minimum alignment length as INT in range 0 to MAX aln length. Default: 0 |
--pident | Float | List only matches above this sequence identity for clustering. Enter a FLOAT between 0 and 1.0. Default: 0.0 |
HyPro loads all necessary data for the extension process automatically. It stores all needed information in the --output PATH
.
For each input fasta it creates a folder with the following files and directories:
prokka.tar.gz
- stores the prokka annotation for your input genomemmseqs2_run_*/
- to store MMseqs2 output and extended prokka annotation. The folder includes one file and two subdirectories:mmseqs2_outs/
storing themmseqs search
results in tab-separated format (one is the MMseqs2 output while the other contains bit-score-filtered unique hits)prokka_restored_updated/
- all extended files from prokka will be stored here (currently: gff, ffn, faa, gbk)
Note: HyPro will save the MMseqs outputs in blast-like format (tsv) with a unique name composed of the DB you used and the chosen alignment parameters. For example: mmseqs2_out_dbuniprotkb_e0.1_a0_p0.0.tsv
and mmseqs2_out_dbuniprotkb_e0.1_a0_p0.0_unique.tsv
will be stored in mmseqs2_run_dbuniprotkb_e0.1_a0_p0.0/mmseqs2_outs
. This means the results of an MMseqs2 run on the UniProtKB DB with an e-value cut-off set to 0.1, minimum alignment length of 0 nt and percent identitiy equal to 0 % (those are the default alignment parameters).
A summary file containing the most relevant information on the latest HyPro run with the given MMseqs2 parameterization is also stored in --output PATH
.
Additionally, the log files for each HyPro process and nextflow execution reports are saved in nextflow-run-infos/
. For processes that need to be run with every input sample, the log files are structured into folders named after the input fasta.
Program/Package | Version | Note |
---|---|---|
python | 3.7 | Might also work for other python3 versions |
pandas | 0.25.2 | Might also work for other versions |
mygene | 3.1.0 | Automatically installed when running HyPro on a conda or docker engine. Might also work for other versions. |
mmseqs2 | 10.6d92c | Automatically installed when running HyPro on a conda or docker engine. DEPRECATED: Install in conda environment |
prokka (recommended) | 1.14.6 | Used for de novo annotation of test data of chlamydia. Automatically installed when running HyPro on a conda or docker engine. |