- BacQuerya
- BacQuerya-processing
- Installation
- Snakemake pipeline
- Local instances of BacQuerya
- Specifying elastic parameters and secrets
- Contributors
BacQuerya is a search engine that aims to consolidate and present all publicly available genomic metadata for bacterial pathogens. BacQuerya is currently in beta and as such, is unstable in some circumstances and only houses S. pneumoniae genomic metadata at this time.
The BacQuerya-processing pipeline sources genomic metadata by accession ID from public repositories that include: NCBI GenBank, BioSample, the European Nucleotide Archive and the Sequence Read Archive. Metadata from each of these sources is extracted and combined into a JSON document, that is then indexed using Elastic cloud (https://www.elastic.co) and is searchable from the BacQuerya website (https://github.com/bacpop/BacQuerya).
Scripts in this repository may be used either with the included snakemake pipeline or individually using the helper scripts.
To install BacQuerya-processing from source, run:
git clone https://github.com/bacpop/BacQuerya-processing.git
conda create -n snakemake --file=environment.yml
conda activate snakemake
conda install snakemake
Parameters for the automated Snakemake pipeline can be adjusted by modifying the config.yml
file, or from the command line. An example command to run the retrieve_ena_read_metadata
rule with 1 core on 7 threads would be:
snakemake --cores 1 --config n_cpu=7 retrieve_ena_read_metadata
Retrieves information of interest from NCBI GenBank by accession ID.
Inputs:
accession_file
: Filepath of a "\n" separated list of BioSample or assembly accession IDs for asssemblies available through NCBI GenBank or Refseq.attribute
: Retrieve assembly sequences, functional annotations or assembly statistics (specified asgenomes
,annotation
orassembly-stats
).email
: An email address to specify for Entrez programmatic access.output
: Output directory name for retrieved attributes.threads
: Number of threads for retrieval. Entrez allows up to 3 queries/second without an API key and 10 queries/second with an API key. To specify an API key, see Local instances of BacQuerya.
Equivalent shell command:
python extract_entrez_information-runner.py -s <accession_file> -e <email> --threads <threads> -o <output> -a <attribute>
Predicts genes in all assemblies in an input directory and outputs predicted annotations in GFF3 format. Inputs:
genome_dir
: Directory containing assemblies.output
: Output directory name for predicted annotations.threads
: Number of threads for prediction.
Converts publicly available functional annotations in GFF3 format to Prokka format for direct input into Panaroo.
Inputs:
assembly_directory
: Directory of uncompressed assembly sequences.index_file
: Filepath of JSON storing index numbers (Of the form{"isolateIndexNo": 0, "geneIndexNo": 0, "predictedIndexNo": 0}
).output
: Output directory name for reformatted annotations.threads
: Number of threads for reformatting.annotation_directory
: Directory of uncompressed functional annotations in GFF3 format. If specified, existing annotations are added to the reformatted annotations.prodigal_directory
: Directory of functional annotation files output by prodigal. If specified, prodigal-predicted annotations are added to the reformatted annotations. Ifannotation_directory
is also specified, predicted annotations supplement the existing annotations.
Equivalent shell command:
python panaroo_clean_inputs-runner.py -a <annotation_directory> -g <assembly_directory> -p <prodigal_directory> --index-file <index_file> -o <output> --threads <threads>
Runs panaroo (https://github.com/gtonkinhill/panaroo) on prokka-formatted functional annotation files.
Inputs:
input_directory
: Directory of prokka formatted function annotation files.output
: Directory for the panaroo outputs.threads
: Number of threads for running panaroo.
Equivalent shell command:
panaroo -i <input_directory>/*.gff -o <output> --clean-mode sensitive -t <threads>
Generates a JSON file of metadata for isolates with assemblies.
Inputs:
assembly_stats_directory
: Directory of uncompressed assembly statistics downloaded from NCBI GenBank.assembly_directory
: Directory of uncompressed assemblies downloaded from NCBI GenBank.index_file
: Filepath of JSON storing index numbers (Of the form{"isolateIndexNo": 0, "geneIndexNo": 0, "predictedIndexNo": 0}
).output
: Output directory for isolate metadata files.email
: An email address to specify for Entrez programmatic access.threads
: Number of threads for converting asssembly statistics to JSON.previous_run
: Directory storing previous snakemake outputs.
Outputs:
isolateAssemblyAttributes.json
: JSON file of isolate assembly metadata.biosampleIsolatePairs.json
: JSON file of isolate name key and BioSample accession ID values.indexIsolatePairs.json
: JSON file of isolate name key and isolate index number values.
Equivalent shell command:
python extract_assembly_stats-runner.py -a <assembly_stats_directory> -g <assembly_directory> -i <index_file> -o <output> -e <email> --previous-run previous_run --threads <threads>
Generates a JSON file of metadata for isolates with reads.
Inputs:
accession_file
: Filepath of a "\n" separated list of ERR or ERS accession IDs for asssemblies available through the ENA.index_file
: Filepath of JSON storing index numbers (Of the form{"isolateIndexNo": 0, "geneIndexNo": 0, "predictedIndexNo": 0}
).output
: Output directory for isolate metadata files.email
: An email address to specify for Entrez programmatic access.threads
: Number of threads for converting asssembly statistics to JSON.previous_run
: Directory storing previous snakemake outputs.
Outputs:
fastq_links.txt
: "\n" separated list of read sequence download URLs.isolateReadAttributes.json
: JSON file of isolate metadata.
Equivalent shell command:
python extract_read_metadata-runner.py -s <accession_file> -r ena -i <index_file> -o <output> -e <email> --previous-run previous_run --threads <threads>
Multithreaded download of read sets.
Inputs:
fastq_links
: Filepath of a "\n" separated list of read sequence download URLs.
Extracts gene metadata for assemblies from a panaroo output and adds genes per isolate to a JSON file of isolate metadata.
Inputs:
assembly_directory
: Directory of uncompressed assemblies downloaded from NCBI GenBank.annotation_directory
: Directory of uncompressed functional annotations in GFF3 format.graph_directory
: Directory of a panaroo graph constructed from the annotations in theannotation_directory
.isolate_metadata
: Directory of isolate metadata output byextract_assembly_stats.py
index_metadata
: Directly index gene metadata in elastic index in this script (True or False).output
: Output directory for gene metadata files.threads
: Number of threads to annotate "query isolates".index_name
: Name of elastic gene metadata index.run_type
: Whether this is a "reference" or "query" run (see reference vs query runs).update
: Update the input panaroo outputs with the new gene names and annotations.
Equivalent shell command:
python extract_genes-runner.py -s <assembly_directory> -a <annotation_directory> -g <graph_directory> -m <isolate_metadata> -i <index_metadata> -o <output> --threads <threads> --index-name <index_name> --prev-dir previous_run --run-type <run_type> [--update]
Extracts gene metadata for assemblies from a panaroo output and adds genes per isolate to a JSON file of isolate metadata.
Inputs:
graph_directory
: Directory of a panaroo graph constructed from the annotations in theannotation_directory
and updated by theextract_genes
rule.gene_metadata
: Directory output by theextract_genes
rule. RequirespanarooPairs.json
to ensure gene names are consistent across outputs.output
: Output directory for the aligned genes.threads
: Number of threads for alignment.
Equivalent shell command:
python generate_alignments-runner.py --graph-dir <graph_directory> --extracted-genes <gene_metadata> --output-dir <output> --threads <threads>
Indexes isolate assembly and read metadata in a searchable elastic index.
Inputs:
isolate_asssembly_metadata
: Filepath of JSON of isolate assembly metadata.isolate_read_metadata
: Directory of isolate read metadata output by theretrieve_ena_read_metadata
rule.index_name
: Name of elastic isolate metadata index.gene_metadata
: Directory output by theextract_genes
rule. Required to ensure genes contained have been added to the isolate assembly metadata.
Equivalent shell command:
python index_isolate_attributes-runner.py -f <isolate_asssembly_metadata> -e <isolate_read_metadata> -i <index_name> -g <gene_metadata>
Indexes gene metadata in a searchable elastic index.
Inputs:
gene_metadata
: Directory containing extracted gene metadata. If using the snakemake pipeline, this will beprevious_run/extracted_genes
as the metadata for current and previous snakemake runs will have been merged.graph_directory
: Directory of a panaroo graph constructed from the annotations in theannotation_directory
and updated by theextract_genes
rule. If using the snakemake pipeline, this will beprevious_run/panaroo_output
as the metadata for current and previous snakemake runs will have been merged.output
: Output directory for the constructed COBS index.kmer_length
: K-mer length at which to construct the COBS index (default=31).threads
: Number of threads to write fasta files for COBS input.index_name
: Name of elastic isolate metadata index.elastic-index
: Index genes in elastic gene index. If excluded, only a COBS gene index is constructed.
Equivalent shell command:
python index_gene_features-runner.py -t gene -i <gene_metadata> -g <graph_directory> -o <output> --kmer-length <kmer_length> --threads <threads> --index <index_name> [--elastic-index]
Users can choose the run_type
for the BacQuerya Snakemake pipeline as a reference
or query
run with:
snakemake --cores 1 --config run_type=<run_type>
This option impacts whether or not the genomic information in the current run is used to update the existing panaroo output of a previous reference
run. If reference
is specified, panaroo is run on the isolates in the current run and this output is merged with the previous output to become the new reference panaroo output. If query
is selected, prokka-formatted functional annotations are integrated into the reference output one by one to identify genes, but the reference panaroo output is not updated with the genomic information from these isolates.
BacQuerya-processing primarily has been designed to populate the indices for the BacQuerya website but we appreciate our processing and indexing pipeline may have use cases beyond our original intentions. This may include locally hosting indices, allowing users to index and search through sensitive genomic data not currently in the public domain. To do this, we anticipate that a number of scripts will require some customisation depending on the use case and we would be happy to answer any questions you may have on setting up one of these private instances. However, local instances of BacQuerya sourcing publicly available data can be set up relatively easily.
Elasticsearch is easy to install, free for local installations and interacts with the processing pipeline through the same API as the elastic cloud indices we use with BacQuerya. Elasticsearch is available for download from https://www.elastic.co/downloads/elasticsearch. Uncompress the download and add the following line to the end of the config/elasticsearch.yml
file within the bundle.
http.cors:
enabled: true
allow-origin: /https?:\/\/localhost(:[0-9]+)?/
To run Elasticsearch at http://localhost:9200, enter the bundle directory and start the Elastic instance by running:
- on windows
bin\elasticsearch.bat
- on macOS/Linux
bin/elasticsearch
Our Elastic parameters and API keys are not available for public use therefore, to make BacQuerya-processing communicate with your local elasticsearch instance you must create a secrets.py
file in the BacQuerya_processing
directory and define the following parameters for export:
ELASTIC_API_URL
: The elasticsearch endpoint. This is http://localhost:9200 for local instances.ELASTIC_ISOLATE_API_ID
: The API ID for an API key with write access for your elastic isolate index.ELASTIC_ISOLATE_API_KEY
: The API key with write access for your elastic isolate index.ELASTIC_GENE_API_ID
: The API ID for an API key with write access for your elastic gene index.ELASTIC_GENE_API_KEY
: The API key with write access for your elastic gene index.
To create API keys for your indices, see https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html. Users may also define an ENTREZ_API_KEY
here to increase the number of possible Entrez queries per second.
BacQuerya-processing was developed by Daniel Anderson and John Lees.