Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue while choosing the reference path for genotyping #329

Open
AmayAgrawal opened this issue May 26, 2023 · 4 comments
Open

Issue while choosing the reference path for genotyping #329

AmayAgrawal opened this issue May 26, 2023 · 4 comments

Comments

@AmayAgrawal
Copy link

Hi,

I am facing an issue regarding the reference path that pandora uses for genotyping the variants. It is basically using the less frequent supported path instead of most frequent supported path as a reference. Below I will try to explain it in a simple way:

Suppose I am using 100 strains for my analysis. First, I did the pan-geome analysis and use the MSA's to build the pan-genome reference graphs (PRG). Next, used these PRG's to genotype the variants in these 100 strains using pandora. Now suppose for a pan-genome graph of a particular loci (let's say gene A) at a particular position (let's say 300), we have 3 differents paths that are possible. Among these 3 paths, If I understand correctly, the path which is supported by majority strains out of 100 strains should be chosen as reference, but actually it was not the case. Due to this, suppose the SNP which I was looking for (let's say C 300 T), in which 'C' is ref and 'T' is alt allele, actually pandora chooses 'T' as ref and 'C' as alt allele. I saw in one of the issues that is currently open that Pandora heavily undermappes (#325). Can it the be the case that it is choosing less frequent path due to this or maybe I am understanding something incorrectly?

@iqbal-lab
Copy link
Collaborator

  1. yes, this is possible. Pandora needs to make a "global" choice, of a path from one end of the gene to the other. Sometimes the data is such that there are lots of reads forcing a path one way across the graph, and this takes a path "a long way away vertically" from a bubble deep in the graph, where there is a lot of coverage for one allele. If there is no way to make a single path consistent with all of that, it does what it can based on dynamic programming.

Suppose the MSA looks like
xxxxxAxxxxxx
xxxxxCxxxxx
xxyyyyyyyyxx
If there is very low coverage on the x's and lots on the y, you get forced onto the bottom path, and the A/C choice becomes irrelevant/ignored.

  1. It's hard to comment more without concrete data; i expect it's not pandora undermapping, but can't tell
    Would you like to share more details?

@AmayAgrawal
Copy link
Author

Hi,
I have uploaded a zip folder at this drive link (https://nubes.helmholtz-berlin.de/s/R8SHBsT8yDmeca4) which contains all the necessary files required to regenerate the issue that I am talking about. This zip folder contains a 'README' file, which explains all the steps and files that are present in this zip folder.

Let me know if you have any more questions from my side

@iqbal-lab
Copy link
Collaborator

Omg we have not replied to you! So sorry @AmayAgrawal , we will return to this after the Xmas vacation

@AmayAgrawal
Copy link
Author

No worries. It would be nice if you can look at this now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants