Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: a dosage command for storing TR motif counts in PGEN files #221

Open
2 tasks
aryarm opened this issue Sep 15, 2023 · 2 comments
Open
2 tasks

feat: a dosage command for storing TR motif counts in PGEN files #221

aryarm opened this issue Sep 15, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@aryarm
Copy link
Member

aryarm commented Sep 15, 2023

Note: This idea originated from the bottom of #73 (comment)

Once PGEN files can store multiallelic dosages, we should consider creating a command called dosage that can append dosage info computed using haptools' Genotypes classes to an existing PGEN file:

  • It can use GenotypesPLINKTR to compute the genotypes, sum them to compute the dosages (and possibly divide by 2 so that they remain in the range [0, 2]?) and then write them to the existing file using PgenWriter.append_dosages_batch(). It should only change the PGEN file. The PVAR and PSAM files can remain the same
  • If --out is specified, we should create a new PGEN file instead of overwriting the existing one. In the former case, we should also copy (or symlink?) the PVAR and PSAM files.

Once we do all of this, we should be able to use PLINK2 to analyze TRs! For example, we could do the following:

  1. Convert a TR VCF into a PGEN file
    plink2 --vcf-half-call m --make-pgen 'pvar-cols=vcfheader,qual,filter,info' --vcf input.vcf --out output
    
  2. Run haptools dosage to add dosages to the PGEN file
    haptools dosage output.pgen
    
  3. Use the PGEN file as input to plink2. For example, we could use it with --r[2]-[un]phased, --ld-*, --pca, --glm, --freq, --clump, or --maf/--mac. Quoting from the documentation:

    Dosages are always used when present

@aryarm
Copy link
Member Author

aryarm commented Aug 10, 2024

In Tara's script for computing the dosages, she uses the dosage values from Beagle, stored in the AP1 and AP2 fields. Could we extend the data.GenotypesTR library to handle those also?

Yes, in theory, we could just override the data.GenotypesTR._return_data method and have it load dosages from the AP1 and AP2 fields rather than using variant.GetLengthGenotypes()

And then we could implement the data.GenotypesPLINKTR.write_variants and data.GenotypesPLINKTR.write methods to handle writing the dosages

Would this be desirable in any way? Probably not. We don't do anything fancy in the data.GenotypesTR method that would make this faster. In fact, if anything, it would probably be more inconvenient (not less) because we would need to handle batching the writing of the PGEN file in chunks to avoid memory overruns.

@aryarm
Copy link
Member Author

aryarm commented Aug 12, 2024

Update:
Melissa has actually been working on a TRRecord.GetDosages method which we could just call instead of TRRecord.GetLengthGenotypes on this line. That should make it super easy to add support for dosages in haptools. And once the annotaTR tool is implemented in TRTools, we probably won't need to try to write PGEN files with dosages anymore - but we might want to read them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant