AgrVATE is a tool for rapid identification of Staphylococcus aureus agr locus type and also reports possible variants in the agr operon.
AgrVATE accepts a S. aureus genome assembly as input and performs a kmer search using an Agr-group specific kmer database to assign the Agr-group. The agr operon is then extracted using in-silico PCR and variants are called using an Agr-group specific reference operon.
Please cite the following paper if you use AgrVATE in your research. Thank you!
Raghuram V, Alexander AM, Loo HQ, Petit RA 3rd, Goldberg JB, Read TD. Species-Wide Phylogenomics of the Staphylococcus aureus Agr Operon Revealed Convergent Evolution of Frameshift Mutations. Microbiol Spectr. 2022 Jan 19;10(1):e0133421. doi: 10.1128/spectrum.01334-21. Epub ahead of print. PMID: 35044202; PMCID: PMC8768832.
Please see the PREREQUISITES section for all the tools required to run AgrVATE. For ease of use, I recommended you install AgrVATE using Conda.
conda create -n agrvate -c bioconda agrvate
conda activate agrvate
This will install all necessary dependencies EXCEPT Usearch. Due to Usearch's license, it cannot be provided with the conda installation. Please download and extract usearch11.0.667 (osx32 or linux32) from here and add it to your PATH
For example (Use the version appropriate for your operating system):
curl "https://www.drive5.com/downloads/usearch11.0.667_i86linux32.gz" --output usearch11.0.667_i86linux32.gz #Downloads usearch binary
gunzip usearch11.0.667_i86linux32.gz #Decompresses usearch binary
chmod 755 usearch11.0.667_i86linux32 #Changes permissions to executable
cp ./usearch11.0.667_i86linux32 $(dirname "$(which agrvate)") #Copies usearch binary to the same directory as agrvate
NOTE: Currently, only the 32-bit version of usearch is free to use. This version is not supported by WSL or MacOS (post-Catalina). Therefore, it is recommended to use AgrVATE on Linux machines or older versions MacOS. If you are unable to run usearch, use the -m
option to run MUMmer instead (IN BETA). However, please note that if there are large insertions/deletions in the agr-operon, MUMmer can split the alignment into 2 and the resulting extracted agr-operon will not be intact, in which case frameshift detection using snippy may miss these indels.
-
Usearch 32 bit linux
Robert C. Edgar, Search and clustering orders of magnitude faster than BLAST, Bioinformatics, Volume 26, Issue 19, 1 October 2010, Pages 2460–2461, https://doi.org/10.1093/bioinformatics/btq461 -
NCBI blast+
Camacho, C., Coulouris, G., Avagyan, V. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009). https://doi.org/10.1186/1471-2105-10-421 -
Snippy
Seemann T (2015). Snippy: fast bacterial variant calling from NGS reads. https://github.com/tseemann/snippy -
MUMmer
S. Kurtz. et al (2004). Versatile and open software for comparing large genomes. Genome Biology, R12. https://doi.org/10.1186/gb-2004-5-2-r12 -
HMMER
S.R. Eddy. Biological sequence analysis using profile hidden Markov models. http://hmmer.org/ -
SeqKit
Shen W, Le S, Li Y, Hu F (2016) SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS ONE 11(10): e0163962. https://doi.org/10.1371/journal.pone.0163962 -
Databases folder for agr group typing and variant calling
- DREME
DREME is not required for AgrVATE but it was used to build the kmer database for Agr-group typing (gp1234_motifs_all.fasta
)
Timothy L. Bailey, DREME: motif discovery in transcription factor ChIP-seq data, Bioinformatics, Volume 27, Issue 12, 15 June 2011, Pages 1653–1659, https://doi.org/10.1093/bioinformatics/btr261
agrvate_databases/ ├── agrD_hmm.hmm ├── agrD_hmm.hmm.h3f ├── agrD_hmm.hmm.h3i ├── agrD_hmm.hmm.h3m ├── agrD_hmm.hmm.h3p ├── agr_operon_primers.fa ├── gp1234_motifs_all.fasta └── references ├── gp1-operon_ref.gbk ├── gp2-operon_ref.gbk ├── gp3-operon_ref.gbk └── gp4-operon_ref.gbk └── mummer_ref_operon.fna
- DREME
agrvate -i filename.fasta [options]
- FLAGS:
-i
Input S. aureus genome in FASTA format [alternate:--input
]-t
Does agr typing only (skips agr operon extraction and frameshift detection) [alternate:--typing-only
]-m
Uses MUMmer dnadiff instead of usearch [alternate:--mummer
]-f
Force overwrite existing results directory [alternate:--force
]-d
Path to agrvate_databases (Not required if installed using Conda) [alternate:--databases
]-h
Print this help message and exit [alternate:--help
]-v
Print version and exit [alternate:--version
]
AgrVATE supports a single FASTA file as input, but the file can be a multi-fasta file. To run multiple S. aureus genomes, it is recommended to keep them as separate files in a common directory.
For example:
ls fasta_files/* | xargs -I {} agrvate -i {} [options]
A new directory with suffix -results
will be created where all the following files can be found
NOTE: There are 15 possible kmers for each agr group per genome. The analyses will continue even if only one kmer matches a given agr-group but it should be noted that < 5 kmers matching leads to a low confidence agr-group call. Col 3 in fasta-summary.tab
shows the number of kmers matched
-
fasta-summary.tab:
col 1: Filename col 2: Agr group (gp1/gp2/gp3/gp4). 'u' means unknown. If multiple agr groups were found (col 5 = m), the displayed agr group is the majority/highest confidence. col 3: Match score for agr group (maximum 15; 0 means untypeable; < 5 means low confidence) col 4: Canonical or non-canonical agrD ( 1 means canonical; 0 means non-canonical; u means unknown) col 5: If multiple agr groups were found, likely due to multiple S. aureus isolates in sequence ( s means single, m means multiple, u means unknown ) col 6: Number of frameshifts found in CDS of extracted agr operon ( Column is 'u' if agr operon was not extracted )
If multiple assemblies are run, use this command from parent directory to output a consolidated summary table for all samples
awk 'FNR==1 && NR!=1 { while (/^#/) getline; } 1 {print}' ./*-results/*-summary.tab > filename.tab
-
fasta-agr_gp.tab:
col 1: Assembly Contig ID col 2: ID of matched agr group kmer col 3: evalue col 4: Percentage identity of match col 5: Start position of kmer alignment on input sequence col 6: End position of kmer alignment on input sequence
-
fasta-agr_operon_frameshifts.tab:
Frameshift mutations in CDS of extracted agr operon detected by Snippy. An agr-group specific reference sequence is used to call variants.col 1: Filename col 2: Position on agr operon compared to reference col 3: Type of frameshift col 4: Effect of mutation col 5: Gene
-
fasta-blastn-log.txt:
Standard output of ncbi blastn -
fasta-agr_operon.fna:
Agr operon extracted from in-silico PCR using USEARCH -SEARCH_PCR in fasta format -
fasta-hmm.tab:
Tabular output of nhmmer This file is present only if the agr group is untypeable. -
fasta-hmm-log.txt:
Standard output of nhmmer This file is present only if the agr group is untypeable. -
fasta-pcr-log.tab:
Standard output of USEARCH -SEARCH_PCR -
fasta-snippy_log.txt:
Standard output of Snippy -
fasta-snippy/
All output files of Snippy -
fasta-mummer_log.txt:
Standard output of MUMmer dnadiff -
fasta-mummer/
All output files of MUMmer dnadiff
An error report summary file with suffix -error-report.tab
will be created in the working directory.
The error report file does not contain any results. It merely shows which steps of the process pipeline ran (pass
) and which steps did not (fail
).
pass
Does not necessarily mean a result was obtained, it only means the step completed successfully.fail
Does not necessarily mean there was an error, it only means that step was not performed. However, possible causes of error for each column are mentioned below.
The columns are ordered by how the processes are carried out. i.e col 1 is the first step and col 7 is the last. If one column shows fail
it means the programme exited at that step and therefore the remaining columns will also show fail
.
-
error-report.tab:
col 1: Input name - the argument supplied to the -i flag col 2: Input check - If fail, the input did not pass the valid fasta file criteria col 3: Databases check - If fail, the databases folder or the path to the databases was not valid. col 4: Outdir check - If fail, the results directory already exists and couldn't be overwritten. Use flag -f or --force. col 5: Agr typing - If fail, the Agr typing kmer search could not be performed. Check if blastn is installed correctly. col 6: Operon check - If fail, in-silico PCR was not performed by usearch or agr operon search was not performed by mummer. Check if usearch/mummer is installed correctly. col 7: Snippy check - If fail, agr operon frameshift detection was not performed. Check if snippy is installed correctly.
If multiple assemblies are run, use this command from parent directory to output a consolidated report table for all samples
awk 'FNR==1 && NR!=1 { while (/^#/) getline; } 1 {print}' ./*-error-report.tab > filename.tab
- Vishnu Raghuram