-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime for pandora compare #323
Comments
Hey @AmayAgrawal , sorry for my delay.
I think from reading your messages, you are on step 1. Could you please confirm? If yes, you'd know how many samples you've mapped with a command like The maximum number of strains I believe we have used The current ways around this is to increase the number of threads, if possible. We are working to scale |
Hi @leoisl, Thanks for the explanation. As you predicted, currently I am in the first step where the reads from the samples are being mapped to the PanRG to infer the most likely path for the samples in each PRG. I am running the analysis with 64 threads and I am half way through the samples in 2 weeks time and I guess that I have to wait same amount of time for it to finish. Yes, I think it would be nice in the future to have the functionality where it can handle thousands of samples quickly. |
Sorry it is currently too slow! Pandora is inherently parallelisable by sample when mapping, and then by gene when doing 'compare', and could then in principle run very fast on a cluster. I don't know if you have access to one. But we need to make a number of updates to enable this, and we have to finish our current changes first. As leandro said, right now we're modifying it so even with enormous pan genomes, the ram use is controlled. This will take several weeks to finish. Apologies for the delay. |
Hi, I have a small question regarding this analysis. So as I mentioned earlier, I started the
From the above log, it can be seen that for almost two weeks, there was no output log written (last one was written on 13th Feb) and also nothing was written in the output folder. So I think that pandora was in the last step of making one final multi sample vcf file when it was killed automatically. First is do you think that it was in this step? Next question is now that it was killed and I do not have one final vcf file, I saw that there is VCFs_genotyped folder in which I can see the vcf file for each PRG. So I was thinking to combine the output from all these individual PRGs vcf files into one final vcf file. Will this work? Or is there any other way that pandora stores checkpoints and I can resume the analysis again from this particular step because I can't run the whole thing again as it took over a month to reach at this stage of the analysis. |
Hey @AmayAgrawal , sorry for the huge delay on answering you.
Maybe but I don't think the resulting files would be complete... I would not recommend that... Pandora also does not store checkpoints for resuming analysis later. Bottom line is that pandora is still not scalable to thousands of samples. We are working right now to scale it to millions of genes, and next step is to scale to thousands of samples. We will let you know when this feature is implemented. |
Sorry for this. I have one hack that would allow you to progress for now.
|
Hi,
I am using
pandora compare
to call out the variants after building the pan-genome reference graphs and indexing them usingpandora index
. Although I am using multiple threads, it works with one strain at a time and takes a lot of time to genotype single strain and I have more than 2000 strains to genotype.Is there any way so that I can parallise it to run on multiple strains at a time? or is there any other way such that I can genotype all samples quickly?
The text was updated successfully, but these errors were encountered: