Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange genotype selection #322

Closed
mbhall88 opened this issue Jan 19, 2023 · 10 comments
Closed

Strange genotype selection #322

mbhall88 opened this issue Jan 19, 2023 · 10 comments

Comments

@mbhall88
Copy link
Member

mbhall88 commented Jan 19, 2023

I have a weird situation where pandora is selecting ref genotype when it should clearly be the alt - the alt has the highest likelihood.

gid     330     .       TGCCATTGGCGATAGCGCGGCCGGA       GGCTACGTCACGCACATTTGCCGGA       .       .       VC=PH_SNPs;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:7,0:5,0:10,0:9,0:57,0:47,0:0.375,1:-7.56853,-83.262:75.6935
gid     351     .       CGGA    CGGC,TGGA       .       .       VC=PH_SNPs;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:2,9,0:2,8,0:0,10,0:0,9,0:10,39,0:9,35,0:0.75,0,1:-95.4097,-21.0618,-124.709:74.3479

The position in question is 351. However, for some reason make_prg has not merged it with the variant above it - which happens to also span position 351. BUT, I am using --local genotyping, so pandora shouldn't be trying to make genotypes compatible in this scenario?

Here are the same positions in the consensus VCF

gid     330     .       TGCCATTGGCGATAGCGCGGCCGGA       GGCTACGTCACGCACATTTGCCGGA       .       .       VC=PH_SNPs;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS .:7,0:5,0:10,0:9,0:57,0:47,0:0.375,1
gid     351     .       CGGA    CGGC,TGGA       .       .       VC=PH_SNPs;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS 1:2,9,0:2,8,0:0,10,0:0,9,0:10,39,0:9,35,0:0.75,0,1

Any thoughts on what's happened here?

@iqbal-lab
Copy link
Collaborator

  1. This is weird, it has the highest likelihood
  2. Pandora does try to make things compatible with local genotyping as much as possible - the reason we abandoned it was that it became too hacky/messy to try and properly bolt on compatibility to local genotyping, and yet if you don't have compatibility, you can have conflicting genotypes.

I don't have a better suggestion than putting a breakpoint in and seeing what happens.

If you have time for a call next week, i'd be up for a chat as I don't remember/understand why you are using local genotyping, i forget the reason. I think it is/was because of minor alleles. Isn't your drprg minor analysis basically driven by looking at the coverage on both alleles, and could work the same on local and global genotyping? sorry for this basic question, have lost track

@iqbal-lab
Copy link
Collaborator

Sorry, was going to discuss with leandro, but he was teaching and then I was distracted. Will talk to him tomorrow!

@mbhall88
Copy link
Member Author

mbhall88 commented Jan 19, 2023

I don't remember/understand why you are using local genotyping, i forget the reason. I think it is/was because of minor alleles.

Motivation and justification.

Isn't your drprg minor analysis basically driven by looking at the coverage on both alleles, and could work the same on local and global genotyping?

Yes, but I still rely on the genotype. I only dig into null genotypes if they cross a start codon, and then, I'm just looking to see if we lose the gene (as a patch until #316 is sorted). I'm not really keen to ignore genotype and implement my own genotyping model when pandora has one already...

I dug into this a little more and the only reason I can see if genotype compatibility. I've removed the allele as POS 330 from the graph too as that is what was causing the problem and it only occurs in one sample that we build the PRG from (out of ~150 samples) and it doesn't contain any common resistance variants.

@iqbal-lab
Copy link
Collaborator

Fair enough, I'll chew on this. ,

@leoisl
Copy link
Collaborator

leoisl commented Jan 19, 2023

Sorry for the delay! I think what is happening in the genotyped VCF:

gid     330     .       TGCCATTGGCGATAGCGCGGCCGGA       GGCTACGTCACGCACATTTGCCGGA       .       .       VC=PH_SNPs;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:7,0:5,0:10,0:9,0:57,0:47,0:0.375,1:-7.56853,-83.262:75.6935
gid     351     .       CGGA    CGGC,TGGA       .       .       VC=PH_SNPs;GRAPHTYPE=NESTED     GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:2,9,0:2,8,0:0,10,0:0,9,0:10,39,0:9,35,0:0.75,0,1:-95.4097,-21.0618,-124.709:74.3479

with local genotyping is the following:

  1. pandora looks for the highest confidence call among all variants, and select that. That would be 330 calling ref (GT CONF 75.6935). This allele is TGCCATTGGCGATAGCGCGGCCGGA, spanning until pos 355;
  2. The next variant overlaps the 4 last bases of this first variant (these 4 last bases being CGGA), and has lower confidence than the first variant, so the choice on the first variant takes priority. Thus pandora makes the call on this second variant compatible with the first choice, forcing the call to be CGGA, i.e. also the ref.

IIRC, it is sort of a greedy algorithm, working from the highest to the lowest confidence calls, and making lower-confidence calls compatible with higher-confidence calls... I guess @rmcolq could correct us in case of any (or many) mistakes!

@mbhall88
Copy link
Member Author

That makes sense. I just thought that because I'm using local genotyping it would override that process?

@iqbal-lab
Copy link
Collaborator

That is how local genotyping works, or rather that's how Pandora tries to make local genotypes not self inconsistent.

@leoisl
Copy link
Collaborator

leoisl commented Jan 19, 2023

I think local genotyping is this process! Global genotyping will get the ML path and, for each site, will look at the most likely allele. If it agrees with the ML path (i.e. at site 1, ML path calls alt 1 and alt 1 is also the most likely) then it calls alt 1. Otherwise, it calls .. Local genotyping will instead work from the highest-confidence call towards the lowest one, possibly leaving the ML path, the goal is to make the lower-confidence calls compatible with higher-confidence calls if they are incompatible (either calling one of the alleles or nullifying the call)

@leoisl
Copy link
Collaborator

leoisl commented Jan 19, 2023

Sorry, did not see Zam's message before commenting, which is the same as mine, but much better summarised 😂 I think this is what happens, but also to be honest, last time I actually checked the genotyping code was some 2 years ago...

@mbhall88
Copy link
Member Author

Ohhh true. Yes, sorry I forgot global would just make it null. Thanks for clearing this up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants