ECTyper
is a standalone versatile serotyping module for Escherichia coli. It supports both fasta (assembled) and fastq (raw reads) file formats.
The tool provides convenient species identification coupled to quality control module giving a complete, transparent and reference laboratories suitable report on E.coli serotyping.
- python >= 3.5
- bcftools >= 1.8
- blast == 2.7.1
- seqtk >= 1.2
- samtools >= 1.8
- bowtie2 >= 2.3.4.1
- mash >= 2.0
- biopython >= 1.70
- pandas >= 0.23.1
- requests >= 2.0
-
If you do not have conda environment, get and install
miniconda
oranaconda
:bash miniconda.sh -b -p $HOME/miniconda echo ". $HOME/miniconda/etc/profile.d/conda.sh" >> ~/.bashrc source ~/.bashrc```
-
Install conda package from
bioconda
channelconda install -c bioconda ectyper
Second option is to install from the source.
- Install dependencies. On Ubuntu distro run
apt install samtools bowtie2 mash bcftools ncbi-blast+ seqtk
- Install python dependencies via
pip
:
pip3 install pandas biopython
- Clone the repository or checkout a particular release (e.g v1.0.0, etc.):
git clone https://github.com/phac-nml/ecoli_serotyping.git
git checkout v1.0.0 #optionally checkout release version
- Install ectyper:
python3 setup.py install
- Put the fasta/fastq files for serotyping analyses in one folder (concatenate paired raw reads files if you would like them to be considered a single entity)
ectyper -i [file path] -o [output_dir]
- View the results on the console or in
cat [output folder]/output.csv
ectyper -i ecoliA.fasta
for a single fileectyper -i ecoliA.fasta -o output_dir
for a single file, results stored inoutput_dir
ectyper -i ecoliA.fasta,ecoliB.fastq,ecoliC.fna
for multiple files (comma-delimited)ectyper -i ecoli_folder
for a folder (all files in the folder will be checked by the tool)
usage: ectyper [-h] [-V] -i INPUT [-c CORES] [-opid PERCENTIDENTITYOTYPE]
[-hpid PERCENTIDENTITYHTYPE] [-oplen PERCENTLENGTHOTYPE]
[-hplen PERCENTLENGTHHTYPE] [--verify] [-o OUTPUT] [-r REFSEQ] [-s] [--debug]
[--dbpath DBPATH]
ectyper v1.0 database v1.0 Prediction of Escherichia coli serotype from raw reads or assembled
genome sequences. The default settings are recommended.
optional arguments:
-h, --help show this help message and exit
-V, --version show program's version number and exit
-i INPUT, --input INPUT
Location of E. coli genome file(s). Can be a single file, a comma-
separated list of files, or a directory
-c CORES, --cores CORES
The number of cores to run ectyper with
-opid PERCENTIDENTITYOTYPE, --percentIdentityOtype PERCENTIDENTITYOTYPE
Percent identity required for an O antigen allele match [default 90]
-hpid PERCENTIDENTITYHTYPE, --percentIdentityHtype PERCENTIDENTITYHTYPE
Percent identity required for an H antigen allele match [default 95]
-oplen PERCENTLENGTHOTYPE, --percentLengthOtype PERCENTLENGTHOTYPE
Percent length required for an O antigen allele match [default 95]
-hplen PERCENTLENGTHHTYPE, --percentLengthHtype PERCENTLENGTHHTYPE
Percent length required for an H antigen allele match [default 50]
--verify Enable E. coli species verification
-o OUTPUT, --output OUTPUT
Directory location of output files
-r REFSEQ, --refseq REFSEQ
Location of pre-computed MASH RefSeq sketch. If provided, genomes
identified as non-E. coli will have their species identified using MASH.
For best results the pre-sketched RefSeq archive
https://gembox.cbcb.umd.edu/mash/refseq.genomes.k21s1000.msh is
recommended
-s, --sequence Prints the allele sequences if enabled as the final columns of the
output
--debug Print more detailed log including debug messages
--dbpath DBPATH Path to a custom database of O and H antigen alleles in JSON format. Check
Data/ectyper_database.json for more information
ECTyper
requires minimum options to run (-i
and -o
) but allows for extensive configuration to accomodate wide variaty of typing scenarios
Parameter | Explanation | Usage scenario |
---|---|---|
-opid |
Specify minimum %identity threshold just for O antigen match |
Poor coverage of O antigen genes or for exploratory work (recommended value is 90) |
-opcov |
Minimum %covereage threshold for a valid match against reference O antigen alleles |
Poor coverage of O antigen genes and a user wants to get O antigen call regardless (recommend value is 95) |
-hpid |
Specify minimum %identity threshold just for H antigen match |
Poor coverage of O antigen genes or for exploratory work (recommend value is 95) |
-hpcov |
Minimum %covereage threshold for a valid match against reference H antigen alleles |
Poor coverage of O antigen genes and a user wants to get O antigen call regardless (recommend value is 95) |
--verify |
Verify species of the input and run QC module providing information on the reliability of the result and any typing issues | User not sure if sample is E.coli and wants to obtain if serotype prediction is of sufficient quality for reporting purposes |
-r |
Specify custom MASH sketch of reference genomes that will be used for species inference | User has a new assembled genome that is not available in NCBI RefSeq database. Make sure to add metadata to assembly_summary_refseq.txt and provide custom accession number that start with GCF_ prefix |
--dbpath |
Provide custom appended database of O and H antigen reference alleles in JSON format following structure and field names as default database ectyper_alleles_db.json |
User wants to add new alleles to the alleles database to improve typing performance |
To provide an easier interpretation of the results and typing metrics, following QC codes were developed.
These codes allow to quickly filter "reportable" and "non-reportable" samples. The QC module is tightly linked to ECTyper allele database, specifically, MinPident
and MinPcov
fields.
For each reference allele minimum %identity
and %coverage
values were determined as a function of potential "cross-talk" between antigens (i.e. multiple potential antigen calls at a given setting).
The QC module covers the following serotyping scenarios. More scenarios might be added in future versions depending on user needs.
QC flag | Explanation |
---|---|
PASS (REPORTABLE) | Both O and H antigen alleles meet min %identity or %coverage thresholds (ensuring no antigen cross-talk) and single antigen predicted for O and H |
FAIL (-:- TYPING) | Sample is E.coli and O and H antigens are not typed. Serotype: -:- |
WARNING MIXED O-TYPE | A mixed O antigen call is predicted requiring wet-lab confirmation |
WARNING (WRONG SPECIES) | A sample is non-E.coli (e.g. E.albertii, Shigella, etc.) based on RefSeq assemblies |
WARNING (-:H TYPING) | A sample is E.coli and O antigen is not predicted (e.g. -:H18) |
WARNING (O:- TYPING) | A sample is E.coli and O antigen is not predicted (e.g. O17:-) |
WARNING (O NON-REPORT) | O antigen alleles do not meet min %identity or %coverage thresholds |
WARNING (H NON-REPORT) | H antigen alleles do not meet min %id or %cov thresholds |
WARNING (O and H NON-REPORT) | Both O and H antigen alleles do not meet min %identity or %coverage thresholds |
ECTyper
capitalizes on a concise minimum output coupled to easy results interpretation and reporting. ECTyper v1.0
serotyping results are available in a tab-delimited output.tsv
file consisting of the 16 columns listed below:
- Name: Sample name (usually a unique identifier)
- Species: the species column provides valuable species identification information in case of inadvertent sample contamination or mislabelling events
- O-type: O antigen
- H-type: H antigen
- Serotype: Predicted O and H antigen(s)
- QC: The Quality Control value summarizing the overall quality of prediction
- Evidence: How many alleles in total used to both call O and H antigens
- GeneScores: ECTyper O and H antigen gene scores in 0 to 1 range
- AllelesKeys: Best matching
ECTyper
database allele keys used to call the serotype - GeneIdentities(%):
%identity
values of the query alleles - GeneCoverages(%):
%coverage
values of the query alleles - GeneContigNames: the contig names where the query alleles were found
- GeneRanges: genomic coordinates of the query alleles
- GeneLengths: allele lengths of the query alleles
- Database: database release version and date
- Warnings: any additional warnings linked to the quality control status or any other error message(s).
Selected columns from the ECTyper
typical report are shown below.
Name | Species | Serotype | Evidence | QC | GeneScores | AlleleKeys | GeneIdentities(%) | GeneCoverages(%) | GeneContigNames | GeneRanges | GeneLengths | Database | Warnings |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15-520 | Escherichia coli | O174:H21 | Based on 3 allele(s) | PASS (REPORTABLE) | wzx:1; wzy:1; fliC:1; | O104-5-wzx-origin;O104-13-wzy;H7-6-fliC-origin; | 100;100;100; | 100;100;100; | contig00049;contig00001;contig00019; | 22302-23492;178-1290;6507-8264; | 1191;1113;1758; | v1.0 (2020-05-07) | - |
EC20151709 | Escherichia coli | O157:H43 | Based on 3 allele(s) | PASS (REPORTABLE) | wzx:1;wzy:0.999;fliC:1 | O157-5-wzx-origin;O157-9-wzy-origin;H43-1-fliC-origin; | 100;99.916;99.934; | 100;100;100; | contig00002;contig00002;contig00003; | 62558-63949;64651-65835;59962-61467; | 1392;1185;1506; | v1.0 (2020-05-07) | - |
Resource | Description | Type |
---|---|---|
PyPI | PyPI pacakge that could be installed via pip utility |
Terminal |
Conda | Conda package available from BioConda channel | Terminal |
Docker | Images containing completely initialized ECTyper with all dependencies | Terminal |
Singluarity | Images containing completely initialized ECTyper with all dependencies | Terminal |
GitHub | Install source code as any Python package | Terminal |
Galaxy ToolShed | Galaxy wrapper available for installation on a private/public instance | Web-based |
Galaxy Europe | Galaxy public server to execute your analysis from anywhere | Web-based |
IRIDA plugin | IRIDA instances could easily install additional pipeline | Web-based |