Skip to content

Commit

Permalink
Merge pull request #84 from oxfordmmm/feat/use-grumpy
Browse files Browse the repository at this point in the history
Feat/use grumpy
  • Loading branch information
JeremyWesthead authored Aug 28, 2024
2 parents aed3e1f + 9af9d08 commit 81f9d6d
Show file tree
Hide file tree
Showing 11 changed files with 448 additions and 1,786 deletions.
29 changes: 11 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,31 +12,26 @@ Provides a library of functions for use within scripts, as well as a CLI tool fo
API reference for developers, and CLI instructions can be found here: https://oxfordmmm.github.io/gnomonicus/
## Usage
```
usage: gnomonicus [-h] --vcf_file VCF_FILE --genome_object GENOME_OBJECT [--catalogue_file CATALOGUE_FILE] [--ignore_vcf_filter] [--progress] [--output_dir OUTPUT_DIR] [--json] [--fasta FASTA] [--minor_populations MINOR_POPULATIONS]
usage: gnomonicus [-h] [-v] --vcf_file VCF_FILE --genome_object GENOME_OBJECT [--catalogue_file CATALOGUE_FILE] [--ignore_vcf_filter] [--output_dir OUTPUT_DIR] [--json] [--csvs CSVS [CSVS ...]] [--debug]
[--resistance_genes] --min_dp MIN_DP
options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
--vcf_file VCF_FILE the path to a single VCF file
--genome_object GENOME_OBJECT
the path to a compressed gumpy Genome object or a genbank file
the path to a genbank file
--catalogue_file CATALOGUE_FILE
the path to the resistance catalogue
--ignore_vcf_filter whether to ignore the FILTER field in the vcf (e.g. necessary for some versions of Clockwork VCFs)
--progress whether to show progress using tqdm
--output_dir OUTPUT_DIR
Directory to save output files to. Defaults to wherever the script is run from.
--json Flag to create a single JSON output as well as the CSVs
--fasta FASTA Use to output a FASTA file of the resultant genome. Specify either 'fixed' or 'variable' for fixed length and variable length FASTA respectively.
--minor_populations MINOR_POPULATIONS
Path to a line separated file containing genome indices of minor populations.
```

## Helper usage
As the main script can utilise pickled `gumpy.Genome` objects, there is a supplied helper script. This converts a Genbank file into a pickled gumpy.Genome for significant time saving.
Due to the security implications of the pickle module, **DO NOT SEND/RECEIVE PICKLES**. This script should be used on a host VM before running nextflow to avoid reinstanciation.
Supports gzip compression to reduce file size significantly (using the `--compress` flag).
```
usage: gbkToPkl FILENAME [--compress]
--csvs CSVS [CSVS ...]
Types of CSV to produce. Accepted values are [variants, mutations, effects, predictions, all]. `all` produces all of the CSVs
--debug Whether to log debugging messages to the log. Defaults to False
--resistance_genes Flag to filter mutations and variants to only include genes present in the resistance catalogue
--min_dp MIN_DP Minimum depth for a variant to be considered in the VCF. Below this value, rows are interpreted as null calls.
```

## Install
Expand All @@ -58,9 +53,6 @@ A Docker image should be built on releases. To open a shell with gnomonicus inst
docker run -it oxfordmmm/gnomonicus:latest
```

## Updating
If a `gnomonicus` update changes the version of `gumpy` used, an `OutdatedGumpyException` will be thrown if using a pickled `Genome` object which was instantiated with the old version. This can be fixed by simply re-instantiating the `Genome` object by passing a genbank file once.

## Notes
When generating mutations, in cases of synonymous amino acid mutation, the nucelotides changed are also included. This can lead to a mix of nucleotides and amino acids for coding genes, but these are excluded from generating effects unless specified in the catalogue. This means that the default rule of `gene@*= --> S` is still in place regardless of the introduced `gene@*?` which would otherwise take precedence. For example:
```
Expand All @@ -87,13 +79,14 @@ When generating mutations, in cases of synonymous amino acid mutation, the nucel
'PHENOTYPE': 'S'
}
],
}
```
The nucelotide variation is included in the the `MUTATIONS`, but explictly removed from the `EFFECTS` unless it is specified within the catalogue.
In order for this variation to be included, a line in the catalogue of `S@F2F&S@t6c` would have to be present.

## User stories

1. As a bioinformatician, I want to be able to run `gnomonicus` on the command line, passing it (i) a GenBank file (or pickled `gumpy.Genome` object), (ii) a resistance catalogue and (iii) a VCF file, and get back `pandas.DataFrames` of the genetic variants, mutations, effects and predictions/antibiogram. The latter is for all the drugs described in the passed resistance catalogue.
1. As a bioinformatician, I want to be able to run `gnomonicus` on the command line, passing it (i) a GenBank file ~~(or pickled `gumpy.Genome` object)~~, (ii) a resistance catalogue and (iii) a VCF file, and get back `pandas.DataFrames` of the genetic variants, mutations, effects and predictions/antibiogram. The latter is for all the drugs described in the passed resistance catalogue.

2. As a GPAS developer, I want to be able to embed `gnomonicus` in a Docker image/NextFlow pipeline that consumes the outputs of [tb-pipeline](https://github.com/Pathogen-Genomics-Cymru/tb-pipeline) and emits a structured, well-designed `JSON` object describing the genetic variants, mutations, effects and predictions/antibiogram.

Expand Down
38 changes: 0 additions & 38 deletions bin/gbkToPkl

This file was deleted.

99 changes: 22 additions & 77 deletions bin/gnomonicus
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,11 @@ import logging
import os
import time

import gumpy
import grumpy
import piezo

import gnomonicus
from gnomonicus import (
loadGenome,
populateEffects,
populateMutations,
populateVariants,
Expand All @@ -36,7 +35,7 @@ if __name__ == "__main__":
parser.add_argument(
"--genome_object",
required=True,
help="the path to a compressed gumpy Genome object or a genbank file",
help="the path to a genbank file",
)
parser.add_argument(
"--catalogue_file",
Expand All @@ -50,12 +49,6 @@ if __name__ == "__main__":
default=False,
help="whether to ignore the FILTER field in the vcf (e.g. necessary for some versions of Clockwork VCFs)",
)
parser.add_argument(
"--progress",
action="store_true",
default=False,
help="whether to show progress using tqdm",
)
parser.add_argument(
"--output_dir",
required=False,
Expand All @@ -69,18 +62,6 @@ if __name__ == "__main__":
default=False,
help="Flag to create a single JSON output as well as the CSVs",
)
parser.add_argument(
"--fasta",
required=False,
default=None,
help="Use to output a FASTA file of the resultant genome. Specify either 'fixed' or 'variable' for fixed length and variable length FASTA respectively. Alternatively, use 'both' to produce both",
)
parser.add_argument(
"--minor_populations",
required=False,
default=None,
help="Path to a line separated file containing genome indices of minor populations.",
)
parser.add_argument(
"--csvs",
required=False,
Expand All @@ -100,18 +81,12 @@ if __name__ == "__main__":
action="store_true",
help="Flag to filter mutations and variants to only include genes present in the resistance catalogue",
)
parser.add_argument(
"--fasta_adjudication",
required=False,
default=None,
help="Path to a FASTA file to use for null rule adjudication. This returns matches for null-call rules in cases of corresponding `N` values in a FASTA file.",
)
parser.add_argument(
"--min_dp",
type=int,
required=False,
required=True,
default=None,
help="Minimum depth for a variant to be considered in the VCF. Below this value, rows are interpreted as null calls. Should allow for skipping FASTA adjudication in cases of low depth while adding coverage data.",
help="Minimum depth for a variant to be considered in the VCF. Below this value, rows are interpreted as null calls.",
)
options = parser.parse_args()

Expand Down Expand Up @@ -154,55 +129,45 @@ if __name__ == "__main__":
make_effects_csv = "effects" in options.csvs
make_prediction_csv = "predictions" in options.csvs

if options.csvs is None and options.json is False and options.fasta is None:
if options.csvs is None and options.json is False:
# No files will be created so warn the user
print(
"[WARNING]: No outputs selected. No files will be created. For help, try `gnomonicus --help`"
)
logging.warning("No outputs selected. No files will be created")

# Get reference genome
reference = loadGenome(options.genome_object, options.progress)
reference = grumpy.Genome(options.genome_object)
logging.debug("Loaded reference genome")

# Check if we have minor population indices
if options.minor_populations is not None:
# Load from line separated file
with open(options.minor_populations) as f:
minor_indices_ = [line.strip() for line in f]
minor_indices = []
# Try converting line by line so we can produce a nice error if invalid
for idx, index in enumerate(minor_indices_):
try:
minor_indices.append(int(index))
except Exception as e:
raise ValueError(f"Invalid index on line {idx}: {index}") from e
else:
minor_indices = []
# Build the mutated genome using gumpy
vcf = gumpy.VCFFile(
# Build the mutated genome
vcf = grumpy.VCFFile(
options.vcf_file,
ignore_filter=options.ignore_vcf_filter,
minor_population_indices=minor_indices,
min_dp=options.min_dp,
options.ignore_vcf_filter,
options.min_dp,
)
sample = reference + vcf
logging.debug("Applied the VCF to the reference")

# Get resistance catalogue
if options.catalogue_file:
resistanceCatalogue = piezo.ResistanceCatalogue(
options.catalogue_file, prediction_subset_only=options.resistance_genes
)
logging.debug("Loaded resistance catalogue")
minor_type = gnomonicus.get_minority_population_type(resistanceCatalogue)
else:
resistanceCatalogue = None
logging.info(
"No resistance catalogue provided, producing variants and mutations only"
)
# Default to COV if no resistance catalogue is provided
# not that it needs to be used, but minor populations are built into grumpy
minor_type = grumpy.MinorType.COV

sample = grumpy.mutate(reference, vcf)
logging.debug("Applied the VCF to the reference")

# Get the GenomeDifference for extracting genome level mutations
diff = reference - sample
diff = grumpy.GenomeDifference(reference, sample, minor_type)
logging.debug("Got the genome difference")

# Complain if there are no variants
Expand All @@ -217,11 +182,12 @@ if __name__ == "__main__":
diff,
make_variants_csv,
options.resistance_genes,
sample,
catalogue=resistanceCatalogue,
)
logging.debug("Populated and saved variants.csv")

mutations, referenceGenes, minor_errors = populateMutations(
mutations = populateMutations(
vcfStem,
options.output_dir,
diff,
Expand All @@ -244,13 +210,10 @@ if __name__ == "__main__":
options.output_dir,
resistanceCatalogue,
mutations,
referenceGenes,
vcfStem,
make_effects_csv,
make_prediction_csv,
fasta=options.fasta_adjudication,
reference=reference,
sample_genome=sample,
reference,
make_mutations_csv=make_mutations_csv,
)
logging.debug("Populated and saved effects.csv and predictions.csv")
Expand Down Expand Up @@ -304,22 +267,4 @@ if __name__ == "__main__":
options.vcf_file,
options.genome_object,
options.catalogue_file,
minor_errors,
)

if options.fasta and options.fasta.lower() in ["fixed", "variable"]:
fixed = options.fasta.lower() == "fixed"
# Write the resultant fasta file
sample.save_fasta(
os.path.join(options.output_dir, vcfStem + "-" + options.fasta + ".fasta"),
fixed_length=fixed,
)
elif options.fasta and options.fasta.lower() == "both":
sample.save_fasta(
os.path.join(options.output_dir, vcfStem + "-fixed.fasta"),
fixed_length=True,
)
sample.save_fasta(
os.path.join(options.output_dir, vcfStem + "-variable.fasta"),
fixed_length=False,
)
)
9 changes: 4 additions & 5 deletions bin/gnomonicus-table-update
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,9 @@
import argparse
import piezo
import pandas
import grumpy

from gnomonicus import loadGenomeAndGenes, populateEffects
from gnomonicus import populateEffects

if __name__ == "__main__":
# Argparser setup
Expand All @@ -14,7 +15,7 @@ if __name__ == "__main__":
parser.add_argument(
"--genome_object",
required=True,
help="Path to a compressed gumpy Genome object or a genbank file",
help="Path to a genbank file",
)
parser.add_argument(
"--catalogue_file",
Expand All @@ -27,7 +28,6 @@ if __name__ == "__main__":
options = parser.parse_args()

mutations = pandas.read_csv(options.mutations)
reference, reference_genes = loadGenomeAndGenes(options.genome_object, options.progress)
resistanceCatalogue = piezo.ResistanceCatalogue(options.catalogue_file)

# Pull out original VCF stem from mutations filename
Expand All @@ -38,10 +38,9 @@ if __name__ == "__main__":
options.output,
resistanceCatalogue,
mutations,
reference_genes,
vcf_stem,
True,
True,
reference=reference,
grumpy.Genome(options.genome_object),
append=True
)
Loading

0 comments on commit 81f9d6d

Please sign in to comment.