Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvements to the Data module #19

Open
16 of 17 tasks
aryarm opened this issue Mar 20, 2022 · 2 comments
Open
16 of 17 tasks

improvements to the Data module #19

aryarm opened this issue Mar 20, 2022 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@aryarm
Copy link
Member

aryarm commented Mar 20, 2022

input files

would be nice if we could support the following inputs

  • Path objects representing paths to the files
    • and files ending in gz
  • sys.stdout and sys.stdin
  • TextIO objects
    • This definitely won't be possible for the Genotypes class but we could do it for the Phenotypes and Covariates classes?

one strategy would be to create a function in the Data abstract class that could detect each of these cases and handle them appropriately?

  • we should also ensure that most of the classes can work appropriately on streams of data
    • and rewrite Genotypes.read to allow it to read data line by line

informative warnings

  • would also be nice if we could warn users when the regions or samples that they provided encompass zero variants
    • and tell them to check that the chroms prefix matches up or attempt to fix it ourselves
  • for all warnings and errors, use the Logger module instead of raising assertions?

additional classes

  • for covariates (as a table of samples x covariates)

filtering of variants

  • by whether they're multi-allelic
  • automatically by the subset of samples contained in the intersection of the genotype and phenotype files
    • note that this might be something we should only do within the code that utilizes the data module (for ex: happler)
  • by MAF

subclasses for different kinds of genotyping data

or just some way to type-hint the specific kind that you need

  • phased vs no restriction on phasing
  • biallelic vs no restriction on allele number
    • filterable for above a certain MAF (only applies to biallelic)
  • contains TRs (potentially handled by trtools - see support for a TR-based GenotypesPLINK class #73)

new functions

  • iterate() - a generator function that iterates over each line bit by bit and yields named tuples where each entry is a property of the module but having values just for a single row
@aryarm aryarm changed the title allow for different file inputs in data module allow for different types of file object inputs in data module Mar 21, 2022
@aryarm aryarm added the enhancement New feature or request label Mar 21, 2022
@aryarm aryarm changed the title allow for different types of file object inputs in data module improvements to the Data module Mar 27, 2022
@aryarm aryarm self-assigned this Apr 1, 2022
@aryarm
Copy link
Member Author

aryarm commented May 14, 2022

For the Genotypes class

  • create some way to quickly obtain the index of a variant based on its ID
    • add a dictionary
  • remove aaf from Genotypes.variants
    • it was never really useful to begin with, anyway - just a bad idea from the start
    • we can make it into a method, instead
  • also, remove the Genotypes.to_MAC() method
  • add subset() function to the Genotypes class
    • by default, return a new Genotypes instance unless the inplace parameter is set to True
  • numpy-based subsetting and indexing
    • implement __getitem__()
    • implement __setitem__()
    • implement __delitem__()?
  • a method to generate fake Genotypes
  • Compare genotypes read when _prephased=True and _prephased=False in the GenotypesPLINK class. Figure out why they're different
  • create a QC method for running all of the QC steps?
  • reduce memory usage by explicitly freeing memory after loading every chunk of a PGEN file
    Right after this line of code, add the following lines:
    del data
    gc.collect()
    
    (source and source)
    Update: current progress on this

For the Phenotypes class

  • allow for storing multiple phenotypes in the Phenotypes object
  • support PLINK2-style .pheno files
    • writing
    • reading
  • make the Covariates class into a subclass of the Phenotypes object
  • add a method to generate fake Phenotypes
  • add subset() function for choosing a subset of samples
  • change the type of the samples argument for the Phenotypes class to a set (refactor: data.Phenotypes samples parameter to be of type set #152)

For the Haplotype and Haplotypes classes

  • use Genotypes.subset() within the transform() methods
    • remove the samples arguments from each of the transform() methods
  • do not require an empty GenotypesRefAlt class as input to Haplotypes.transform()
  • add subset() function for choosing a subset of haplotypes after loading them
  • require that the haplotypes parameter of the Haplotypes.read() method be a set instead of a list

For the Haplotype and Variant classes

  • make it easier to extend the classes, so that extras don't have to be declared ahead of time? pros: it would make it easier to read files with multiple different sets of extra fields; but cons: it puts the burden of handling all of that on us, which could potentially be difficult to take on in the future

For the Haplotype class

  • create a method that will update the start and end coordinates according to the stored variants

@aryarm
Copy link
Member Author

aryarm commented Mar 12, 2024

we also discussed some potential breaking changes to the classes in the data library which we would like to implement this summer:

  • we could change the __init__ methods to accept the values of the class properties as parameters
  • and then the read and write methods would take file names as input, instead
  • and then we could remove the load method and add a check method instead which can run the other checks (like check_maf, etc)

this idea originally arose in #49

another idea:

  • change the __iter__ methods to output chunk_size numbers of variants at a time instead of only one at a time. This could be useful for folks that don't want to have to load the entire genotype matrix into memory all at once

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant