Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output interpretation #66

Open
KabitaBaral1 opened this issue Oct 30, 2019 · 3 comments
Open

output interpretation #66

KabitaBaral1 opened this issue Oct 30, 2019 · 3 comments
Assignees

Comments

@KabitaBaral1
Copy link

Dear Shorah developers,

First, I would like to say that you have a great program that is very useful for local haplotype reconstruction, especially if we try to genotype across a larger region while using illumina, and multiple reads are spanning over a gene marker. Thanks for sharing to the community! Nevertheless, I was wondering if you could help me interpret my output. After running Shorah with a malaria sample, I got this in my coverage.txt file:

w-msp1_plasmodb_pf3d7-135-335.reads.fas msp1_plasmodb_pf3d7     135     335     2
w-msp1_plasmodb_pf3d7-202-402.reads.fas msp1_plasmodb_pf3d7     202     402     2
w-msp1_plasmodb_pf3d7-269-469.reads.fas msp1_plasmodb_pf3d7     269     469     1
w-msp1_plasmodb_pf3d7-403-603.reads.fas msp1_plasmodb_pf3d7     403     603     384
w-msp1_plasmodb_pf3d7-470-670.reads.fas msp1_plasmodb_pf3d7     470     670     1116
w-msp1_plasmodb_pf3d7-537-737.reads.fas msp1_plasmodb_pf3d7     537     737     1041
w-msp1_plasmodb_pf3d7-604-804.reads.fas msp1_plasmodb_pf3d7     604     804     1115
w-msp1_plasmodb_pf3d7-671-871.reads.fas msp1_plasmodb_pf3d7     671     871     1569
w-msp1_plasmodb_pf3d7-738-938.reads.fas msp1_plasmodb_pf3d7     738     938     1462
w-msp1_plasmodb_pf3d7-805-1005.reads.fas        msp1_plasmodb_pf3d7     805     1005    1649
w-msp1_plasmodb_pf3d7-872-1072.reads.fas        msp1_plasmodb_pf3d7     872     1072    241
w-msp1_plasmodb_pf3d7-939-1139.reads.fas        msp1_plasmodb_pf3d7     939     1139    1261
w-msp1_plasmodb_pf3d7-1006-1206.reads.fas       msp1_plasmodb_pf3d7     1006    1206    147

Does this mean each numeric column represents the coverage of each haplotype that was inferred by shotgun local approach after looking at each window file? If so, would this mean I have 3 haplotypes? I was also looking at the reconstruction haplotype of each set of window reads, and I also noticed how in some cases the posterior probability was 0.99 with as little as 2 reads. I read in one paper that it is recommended to apply a 95% cutoff on shorah posterior? Should I apply something like this on read number as well? I am asking because in the 949-1139 reads haplotype reconstruction there were 8 haplotypes over 95%, but this was not the case in the rest of the windows. Am I missing something important here?

Thanks in advance for such an useful program.

@dcm9123
Copy link

dcm9123 commented Oct 31, 2019

I believe that the first column represents the start read of that window, then the second column would be the end of the window, and the last one would be the average coverage of it. Regarding the posterior probability, I think that they mention what you say... a 95% is a good cutoff, but I am not sure if the number of reads should be filtered as well... I was having the exact same question regarding that. I also believe the hap0...hap'n' are correlated across the window files, meaning that hap0 in the first window would be the same hap0 in your 7th window, for example... In the end you are doing haplotype reconstruction. Regarding everything else I am not sure... I was having the same question about the reads and estimating the frequency.

@DrYak DrYak self-assigned this Nov 1, 2019
@DrYak
Copy link
Member

DrYak commented Nov 1, 2019

Hello dear users,

I got this in my coverage.txt file:

I believe that the first column represents the start read of that window, then the second column would be the end of the window, and the last one would be the average coverage of it

Yup, I confirm it : it's window position (should also match the name of the window) and number of reads inside that window.
Beware of region with lower coverage, the quality of the output won't be great.
Did you run ShoRAH stand-alone, or as part of a pipeline (like our V-pipe ) ?

I was also looking at the reconstruction haplotype of each set of window reads, and I also noticed how in some cases the posterior probability was 0.99 with as little as 2 reads. I read in one paper that it is recommended to apply a 95% cutoff on shorah posterior?

Regarding the posterior probability, I think that they mention what you say... a 95% is a good cutoff, but I am not sure if the number of reads should be filtered as well... I was having the exact same question regarding that.

I was having the same question about the reads and estimating the frequency.

@ozagordi , one of the devs of ShoRAH had back then started working on some tool to post process the output of local haplotypes.
In addition to filtering on the posterior probability (> 0.9) he did also suggest a cut of for number of reads (> 5 reads). Indeed, haplotypes generated from 2 reads don't make that much sense.

We're also featuring similar filtering feature in a visualization tool I'm working on.

I am asking because in the 949-1139 reads haplotype reconstruction there were 8 haplotypes over 95%, but this was not the case in the rest of the windows. Am I missing something important here?

Well, I would say two things:

  • Indeed you could do some sanity cut-offs (not consider low-probability sequences and also remove these with only a couple of reads).
  • The windows are overlapping, normally you should have three windows overlapping at each position (as long as the read coverage permits it, of course). ShoRAH will always try comparing 3 windows (if available) when looking for SNVs (see the corresponding CSV file). It might also be good to check if your large number of haplotype have sequences that each match similar sequence in the neighbouring regions.

Depending on the above, three possible interpretations:

  • It's actually a fluke, the algorithm detected just a couple of high-quality haplotypes that are also seen in overlapping windows, and a bunch of extra low-quality haplotypes that are basically just garbage (with an extremely poor average coverage).
  • If instead, even after filtering out the low-quality haplotypes and comparing with neighbours, you still have a higher number of haplotypes: You actually pin-pointed a high-variability region. There might be a region in the middle of a window where there are more variant appearing (are there such region in that gene of the genome of the Plasmodia family? Med school has been ages ago for me, and I don't remember clearly. Specially since it's not something I've encountered neither in the clinic nor in research).
  • It might be a technical problem:wetlab manipulation error, the QC and trimming part of the pipeline not done optimally, etc.

(Also, it could be a bug, if your results seem completely bonkers, even more likely if they are differently broken across successive runs. I was thinking I had weeded out the memory corruption bugs, but I might have missed something. In that case, I would be interested in your data if possible)

I also believe the hap0...hap'n' are correlated across the window files, meaning that hap0 in the first window would be the same hap0 in your 7th window, for example...

Nope. No, no, no.

Each window is processed independently by a different diri_sampler job. The dirichlet sampler only sees and processes the segments of reads which lay within its own window's limits (that's why the code has no trouble scaling to longer genomes). The numbers are assigned arbitrarily to the classes as they are built. They are not even stable accross runs with different random values. Even less so across different windows.

The only ways to match haplotypes between windows is to compare the sequences in the overlapping parts, and/or look which haplotype in different windows share the same sequences as their contributing reads.

I am working currently on a visualisation tool for whichi I explore leveraging the above.

In the end you are doing haplotype reconstruction.

Note that recent version only do local haplotype reconstruction.
The global reconstruction part has been deprecated.

As an example we're currently relying on a different 3rd party engine - Savage - by default in our our V-pipe, and offer additional alternative engines, such as our Haploclique and we have PhD students working on further solutions.
Global haplotype reconstruction is still an ongoing research field.

I would like to say that you have a great program that is very useful for local haplotype reconstruction
Thanks in advance for such an useful program.

We're happy you find our software useful. (Some of the original developers have moved onto other jobs but still lurk around this git, I'll definitely transmit it to them)

@KabitaBaral1
Copy link
Author

Hello,

Thank you for such detail response! It is much clearer to me now! Yes, indeed, I am working with a high variable gene-marker regions used for haplotyping and keep track of these haplotypes across different regions and patients, so it does make sense I am getting up to 8 haplotypes in a region. I've noticed some of my regions agree with high number of haplotypes, whereas the ends of the marker are more 'conserved', detecting up to 1 or 2. We were also dealing with some controls were we have only one strain per cultured sample, but up to 3, 4 or 5 haplotypes show up in the support/ folder. I am guessing even though I have one strain, I can end up with polyclonality in my cultured samples, and that could explain it.

Thanks for clearing up the hap0, hap1, hap2, I was thinking the same as @dcastaneda5, but it's good to know it is not the same haplotype ID across each window, so I don't have to keep track of it.

I was not aware of the V-Pipe that you mention! I was running shoRAH as a stand alone program for haplotyping, but in the future I'll use V-Pipe! and yes, I noticed that global analysis has been discontinued for this software and I realize how hard it is to reconstruct it from scratch, especially if I am dealing with length-polymorphic markers. Thanks for what you guys are doing, and for clearing this up for me! Much appreciated!

Kabita Baral

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants