Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

* in snp is not allowed in MASH pipeline rds_to_vcf #551

Open
rfeng2023 opened this issue Mar 13, 2023 · 6 comments
Open

* in snp is not allowed in MASH pipeline rds_to_vcf #551

rfeng2023 opened this issue Mar 13, 2023 · 6 comments

Comments

@rfeng2023
Copy link
Contributor

image

There are some "*" (which means a deletion) in snp file will cause the error

Error in .Call2("new_XStringSet_from_CHARACTER", class(x0), elementType(x0), : key 42 (char '*') not in lookup table

That may be caused by Biostrings::DNAStringSet(nea), which only allow the input chr as "ACTG".

@rfeng2023 rfeng2023 changed the title * in snp is not allowed in rds_to_vcf * in snp is not allowed in MASH pipeline rds_to_vcf Mar 13, 2023
@gaow
Copy link
Contributor

gaow commented Mar 13, 2023

Well, other than a more technical solution on this, a general question is @hsun3163 when we "harmonize " deletion data, is there a way to harmonize it as eg

GC C

as opposed to:

G *

?

@hsun3163
Copy link
Collaborator

Well, other than a more technical solution on this, a general question is @hsun3163 when we "harmonize " deletion data, is there a way to harmonize it as eg

GC C

as opposed to:

G *

?

One immediate challenge for this is that, the C for GC C is not readily available when all we have is G *

@gaow
Copy link
Contributor

gaow commented Mar 15, 2023

As discussed with @rfeng2023 I think we are going to use the standard N symbol for "any basepair". ...

@rfeng2023
Copy link
Contributor Author

As discussed with @rfeng2023 I think we are going to use the standard N symbol for "any basepair". ...

should we replace it in the rds_to_vcf process or in the input file (which may need to be fixed in merged.sumstat.vcf)?

@gaow
Copy link
Contributor

gaow commented Mar 16, 2023

This is also a good question. @hsun3163 did you invent the G * convention or bcftools had it? If this is bcftools behavior then we better leave it as is. If it is @hsun3163 's choice then yes we may want to use GN N instead of G * because * is not a standard symbol for sequence data.

@hsun3163
Copy link
Collaborator

hsun3163 commented Mar 16, 2023

They were not introduced by me, instead, they were inherited from the raw vcf.gz file, as indicated below:

hs3163@node96:/mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing/output/data_preprocessing/genotype$ zcat DEJ_11898_B01_GRM_WGS_2017-05-15_21.recalibrated_variants.xqtl_protocol_data.add_chr.add_chr.leftnorm.filtered.vcf.gz | grep "*"  | cut -f 1,2,3,4,5,6 | head
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##bcftools_filterCommand=filter -i 'GT="hom" | TYPE="snp" & GT="het" & (FORMAT/AD[*:1])/(FORMAT/AD[*:0] + FORMAT/AD[*:1]) >= 0.15 | TYPE="indel" & GT="het" & (FORMAT/AD[*:1])/(FORMAT/AD[*:0] + FORMAT/AD[*:1]) >= 0.2'; Date=Wed Oct 19 15:56:29 2022
chr21   9540614 chr21:9540614:G:*       G       *       3256.49
chr21   9550890 chr21:9550890:A:*       A       *       11424.9
chr21   9553296 chr21:9553296:G:*       G       *       3665.16

Changing the * to n for only mash output may make the future comparisons between mash output and the output of other parts of our analysis pipeline difficult.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants