Merge pull request #84 from oxfordmmm/feat/use-grumpy

Feat/use grumpy
oxfordmmm · Aug 28, 2024 · 81f9d6d · 81f9d6d
2 parents aed3e1f + 9af9d08
commit 81f9d6d
Show file tree

Hide file tree

Showing 11 changed files with 448 additions and 1,786 deletions.
diff --git a/README.md b/README.md
@@ -12,31 +12,26 @@ Provides a library of functions for use within scripts, as well as a CLI tool fo
 API reference for developers, and CLI instructions can be found here: https://oxfordmmm.github.io/gnomonicus/ 
 ## Usage
 ```
-usage: gnomonicus [-h] --vcf_file VCF_FILE --genome_object GENOME_OBJECT [--catalogue_file CATALOGUE_FILE] [--ignore_vcf_filter] [--progress] [--output_dir OUTPUT_DIR] [--json] [--fasta FASTA] [--minor_populations MINOR_POPULATIONS]
+usage: gnomonicus [-h] [-v] --vcf_file VCF_FILE --genome_object GENOME_OBJECT [--catalogue_file CATALOGUE_FILE] [--ignore_vcf_filter] [--output_dir OUTPUT_DIR] [--json] [--csvs CSVS [CSVS ...]] [--debug]
+                  [--resistance_genes] --min_dp MIN_DP
 
 options:
   -h, --help            show this help message and exit
+  -v, --version         show program's version number and exit
   --vcf_file VCF_FILE   the path to a single VCF file
   --genome_object GENOME_OBJECT
-                        the path to a compressed gumpy Genome object or a genbank file
+                        the path to a genbank file
   --catalogue_file CATALOGUE_FILE
                         the path to the resistance catalogue
   --ignore_vcf_filter   whether to ignore the FILTER field in the vcf (e.g. necessary for some versions of Clockwork VCFs)
-  --progress            whether to show progress using tqdm
   --output_dir OUTPUT_DIR
                         Directory to save output files to. Defaults to wherever the script is run from.
   --json                Flag to create a single JSON output as well as the CSVs
-  --fasta FASTA         Use to output a FASTA file of the resultant genome. Specify either 'fixed' or 'variable' for fixed length and variable length FASTA respectively.
-  --minor_populations MINOR_POPULATIONS
-                        Path to a line separated file containing genome indices of minor populations.
-```
-
-## Helper usage
-As the main script can utilise pickled `gumpy.Genome` objects, there is a supplied helper script. This converts a Genbank file into a pickled gumpy.Genome for significant time saving.
-Due to the security implications of the pickle module, **DO NOT SEND/RECEIVE PICKLES**. This script should be used on a host VM before running nextflow to avoid reinstanciation.
-Supports gzip compression to reduce file size significantly (using the `--compress` flag).
-```
-usage: gbkToPkl FILENAME [--compress]
+  --csvs CSVS [CSVS ...]
+                        Types of CSV to produce. Accepted values are [variants, mutations, effects, predictions, all]. `all` produces all of the CSVs
+  --debug               Whether to log debugging messages to the log. Defaults to False
+  --resistance_genes    Flag to filter mutations and variants to only include genes present in the resistance catalogue
+  --min_dp MIN_DP       Minimum depth for a variant to be considered in the VCF. Below this value, rows are interpreted as null calls.
 ```
 
 ## Install
@@ -58,9 +53,6 @@ A Docker image should be built on releases. To open a shell with gnomonicus inst
 docker run -it oxfordmmm/gnomonicus:latest
 ```
 
-## Updating
-If a `gnomonicus` update changes the version of `gumpy` used, an `OutdatedGumpyException` will be thrown if using a pickled `Genome` object which was instantiated with the old version. This can be fixed by simply re-instantiating the `Genome` object by passing a genbank file once.
-
 ## Notes
 When generating mutations, in cases of synonymous amino acid mutation, the nucelotides changed are also included. This can lead to a mix of nucleotides and amino acids for coding genes, but these are excluded from generating effects unless specified in the catalogue. This means that the default rule of `gene@*= --> S` is still in place regardless of the introduced `gene@*?` which would otherwise take precedence. For example:
 ```
@@ -87,13 +79,14 @@ When generating mutations, in cases of synonymous amino acid mutation, the nucel
               'PHENOTYPE': 'S'
           }
       ],
+  }
 ```
 The nucelotide variation is included in the the `MUTATIONS`, but explictly removed from the `EFFECTS` unless it is specified within the catalogue.
 In order for this variation to be included, a line in the catalogue of `S@F2F&S@t6c` would have to be present.
 
 ## User stories
 
-1. As a bioinformatician, I want to be able to run `gnomonicus` on the command line, passing it (i) a GenBank file (or pickled `gumpy.Genome` object), (ii) a resistance catalogue and (iii) a VCF file, and get back `pandas.DataFrames` of the genetic variants, mutations, effects and predictions/antibiogram. The latter is for all the drugs described in the passed resistance catalogue.
+1. As a bioinformatician, I want to be able to run `gnomonicus` on the command line, passing it (i) a GenBank file ~~(or pickled `gumpy.Genome` object)~~, (ii) a resistance catalogue and (iii) a VCF file, and get back `pandas.DataFrames` of the genetic variants, mutations, effects and predictions/antibiogram. The latter is for all the drugs described in the passed resistance catalogue.
 
 2. As a GPAS developer, I want to be able to embed `gnomonicus` in a Docker image/NextFlow pipeline that consumes the outputs of [tb-pipeline](https://github.com/Pathogen-Genomics-Cymru/tb-pipeline) and emits a structured, well-designed `JSON` object describing the genetic variants, mutations, effects and predictions/antibiogram.
 

diff --git a/bin/gbkToPkl b/bin/gbkToPkl
diff --git a/bin/gnomonicus b/bin/gnomonicus
@@ -8,12 +8,11 @@ import logging
 import os
 import time
 
-import gumpy
+import grumpy
 import piezo
 
 import gnomonicus
 from gnomonicus import (
-    loadGenome,
     populateEffects,
     populateMutations,
     populateVariants,
@@ -36,7 +35,7 @@ if __name__ == "__main__":
     parser.add_argument(
         "--genome_object",
         required=True,
-        help="the path to a compressed gumpy Genome object or a genbank file",
+        help="the path to a genbank file",
     )
     parser.add_argument(
         "--catalogue_file",
@@ -50,12 +49,6 @@ if __name__ == "__main__":
         default=False,
         help="whether to ignore the FILTER field in the vcf (e.g. necessary for some versions of Clockwork VCFs)",
     )
-    parser.add_argument(
-        "--progress",
-        action="store_true",
-        default=False,
-        help="whether to show progress using tqdm",
-    )
     parser.add_argument(
         "--output_dir",
         required=False,
@@ -69,18 +62,6 @@ if __name__ == "__main__":
         default=False,
         help="Flag to create a single JSON output as well as the CSVs",
     )
-    parser.add_argument(
-        "--fasta",
-        required=False,
-        default=None,
-        help="Use to output a FASTA file of the resultant genome. Specify either 'fixed' or 'variable' for fixed length and variable length FASTA respectively. Alternatively, use 'both' to produce both",
-    )
-    parser.add_argument(
-        "--minor_populations",
-        required=False,
-        default=None,
-        help="Path to a line separated file containing genome indices of minor populations.",
-    )
     parser.add_argument(
         "--csvs",
         required=False,
@@ -100,18 +81,12 @@ if __name__ == "__main__":
         action="store_true",
         help="Flag to filter mutations and variants to only include genes present in the resistance catalogue",
     )
-    parser.add_argument(
-        "--fasta_adjudication",
-        required=False,
-        default=None,
-        help="Path to a FASTA file to use for null rule adjudication. This returns matches for null-call rules in cases of corresponding `N` values in a FASTA file.",
-    )
     parser.add_argument(
         "--min_dp",
         type=int,
-        required=False,
+        required=True,
         default=None,
-        help="Minimum depth for a variant to be considered in the VCF. Below this value, rows are interpreted as null calls. Should allow for skipping FASTA adjudication in cases of low depth while adding coverage data.",
+        help="Minimum depth for a variant to be considered in the VCF. Below this value, rows are interpreted as null calls.",
     )
     options = parser.parse_args()
 
@@ -154,55 +129,45 @@ if __name__ == "__main__":
         make_effects_csv = "effects" in options.csvs
         make_prediction_csv = "predictions" in options.csvs
 
-    if options.csvs is None and options.json is False and options.fasta is None:
+    if options.csvs is None and options.json is False:
         # No files will be created so warn the user
         print(
             "[WARNING]: No outputs selected. No files will be created. For help, try `gnomonicus --help`"
         )
         logging.warning("No outputs selected. No files will be created")
 
     # Get reference genome
-    reference = loadGenome(options.genome_object, options.progress)
+    reference = grumpy.Genome(options.genome_object)
     logging.debug("Loaded reference genome")
 
-    # Check if we have minor population indices
-    if options.minor_populations is not None:
-        # Load from line separated file
-        with open(options.minor_populations) as f:
-            minor_indices_ = [line.strip() for line in f]
-            minor_indices = []
-            # Try converting line by line so we can produce a nice error if invalid
-            for idx, index in enumerate(minor_indices_):
-                try:
-                    minor_indices.append(int(index))
-                except Exception as e:
-                    raise ValueError(f"Invalid index on line {idx}: {index}") from e
-    else:
-        minor_indices = []
-    # Build the mutated genome using gumpy
-    vcf = gumpy.VCFFile(
+    # Build the mutated genome
+    vcf = grumpy.VCFFile(
         options.vcf_file,
-        ignore_filter=options.ignore_vcf_filter,
-        minor_population_indices=minor_indices,
-        min_dp=options.min_dp,
+        options.ignore_vcf_filter,
+        options.min_dp,
     )
-    sample = reference + vcf
-    logging.debug("Applied the VCF to the reference")
 
     # Get resistance catalogue
     if options.catalogue_file:
         resistanceCatalogue = piezo.ResistanceCatalogue(
             options.catalogue_file, prediction_subset_only=options.resistance_genes
         )
         logging.debug("Loaded resistance catalogue")
+        minor_type = gnomonicus.get_minority_population_type(resistanceCatalogue)
     else:
         resistanceCatalogue = None
         logging.info(
             "No resistance catalogue provided, producing variants and mutations only"
         )
+        # Default to COV if no resistance catalogue is provided
+        # not that it needs to be used, but minor populations are built into grumpy
+        minor_type = grumpy.MinorType.COV
+
+    sample = grumpy.mutate(reference, vcf)
+    logging.debug("Applied the VCF to the reference")
 
     # Get the GenomeDifference for extracting genome level mutations
-    diff = reference - sample
+    diff = grumpy.GenomeDifference(reference, sample, minor_type)
     logging.debug("Got the genome difference")
 
     # Complain if there are no variants
@@ -217,11 +182,12 @@ if __name__ == "__main__":
         diff,
         make_variants_csv,
         options.resistance_genes,
+        sample,
         catalogue=resistanceCatalogue,
     )
     logging.debug("Populated and saved variants.csv")
 
-    mutations, referenceGenes, minor_errors = populateMutations(
+    mutations = populateMutations(
         vcfStem,
         options.output_dir,
         diff,
@@ -244,13 +210,10 @@ if __name__ == "__main__":
             options.output_dir,
             resistanceCatalogue,
             mutations,
-            referenceGenes,
             vcfStem,
             make_effects_csv,
             make_prediction_csv,
-            fasta=options.fasta_adjudication,
-            reference=reference,
-            sample_genome=sample,
+            reference,
             make_mutations_csv=make_mutations_csv,
         )
         logging.debug("Populated and saved effects.csv and predictions.csv")
@@ -304,22 +267,4 @@ if __name__ == "__main__":
             options.vcf_file,
             options.genome_object,
             options.catalogue_file,
-            minor_errors,
-        )
-
-    if options.fasta and options.fasta.lower() in ["fixed", "variable"]:
-        fixed = options.fasta.lower() == "fixed"
-        # Write the resultant fasta file
-        sample.save_fasta(
-            os.path.join(options.output_dir, vcfStem + "-" + options.fasta + ".fasta"),
-            fixed_length=fixed,
-        )
-    elif options.fasta and options.fasta.lower() == "both":
-        sample.save_fasta(
-            os.path.join(options.output_dir, vcfStem + "-fixed.fasta"),
-            fixed_length=True,
-        )
-        sample.save_fasta(
-            os.path.join(options.output_dir, vcfStem + "-variable.fasta"),
-            fixed_length=False,
-        )
+        )
diff --git a/bin/gnomonicus-table-update b/bin/gnomonicus-table-update
@@ -4,8 +4,9 @@
 import argparse
 import piezo
 import pandas
+import grumpy
 
-from gnomonicus import loadGenomeAndGenes, populateEffects
+from gnomonicus import populateEffects
 
 if __name__ == "__main__":
     # Argparser setup
@@ -14,7 +15,7 @@ if __name__ == "__main__":
     parser.add_argument(
         "--genome_object",
         required=True,
-        help="Path to a compressed gumpy Genome object or a genbank file",
+        help="Path to a genbank file",
     )
     parser.add_argument(
         "--catalogue_file",
@@ -27,7 +28,6 @@ if __name__ == "__main__":
     options = parser.parse_args()
 
     mutations = pandas.read_csv(options.mutations)
-    reference, reference_genes = loadGenomeAndGenes(options.genome_object, options.progress)
     resistanceCatalogue = piezo.ResistanceCatalogue(options.catalogue_file)
 
     # Pull out original VCF stem from mutations filename
@@ -38,10 +38,9 @@ if __name__ == "__main__":
         options.output,
         resistanceCatalogue,
         mutations,
-        reference_genes,
         vcf_stem,
         True,
         True,
-        reference=reference,
+        grumpy.Genome(options.genome_object),
         append=True
     )