-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Step 3 Ipyrad Running Time Too Long #587
Comments
12M reads is a lot, and paired-end data increases runtime. The progress bar in step 3 indicates the # of samples completed clustering, so 0% after 20 hours only means that none of the samples have finished, not that there has been no progress in clustering. What is the read length? 150bp? You can somewhat improve runtime by increasing Did you inspect results of fastqc for a few of your samples? If there are lots of low quality bases in either R1 or R2 you can remove these with What is the clust_threshold? Too strict of a clust_threshold will increase runtime (increasing numbers of reads that do not match a cluster, therefore increasing size of the seed file). Finishing a run with 2 samples and then looking at the results will give more clues about modifications to increase performance. For now check these suggestions and let me know if you have any questions. |
The cluster I am using is undergoing maintenence this week, so I cannot currently access my fastqc results. I am trimming the first and five last bases, and could increase this if need be. It is 150x2 GBS reads, and cluster threshold is set at the default .85. I am curious about the -t function, it looks like it distributes the jobs across cores more evenly? So if I only have 2 samples, each would be given its own core for this step? And if I have 40 cores and 81 samples then each core could handle two samples if I set -t to 40? Or even with 2 samples, if I set -t to the number of cores that I have running it will more efficiently divvy those two samples among those 40? pairgbs ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc. |
|
@isaacovercast The threading did seem to help a bit, and I increased the RAM to 1.5TB on 40 cores. It ran to about 54% completion of the clustering portion of step 3 in 120 hours. Our IT department has a 120 hour limit on HPC use that they are going to waive for me, but I just want to make sure I have done all I can on my end before starting a 2 week run. Our IT specialist looked at the job today during the clustering and said that it was using all of the cores, but only using 40GB of RAM during the period that he observed it. I see that Ipyrad has uses 75% of the available RAM by default, but 40GB isn't even close to that. Could there be another issue here that I am not seeing? |
I am running a subsample (2 individuals) of paired end GBS data on Ipyrad, they each have around 12 million reads, and this is a de novo assembly for a plant species in the Asteraceae with no close relatives to use as a reference. When I get to step 3 the clustering seems to sit at 0% no matter how much time and RAM I throw at it, I have run this step at 160GB of memory for 20 hours with no real headway made. I see that it has started writing the clustering files in my directory, but not to completion before slurm ends the job. To troubleshoot this I tightened down the params file and reran step two with the min q score set to 43 (q30) and I hard trimmed the first and last five bp from both forward and reverse reads. This didn't really change the number of reads that passed by very much. I understand that this step can take up to two weeks, but my total sample size is 81 with each having around 12 million reads, so I am concerned if just two samples are taking over 20 hours with this much memory I will be looking at a month long job that takes up a ton of my university's computing resources to run. Are there other bits of the params file I should be looking at making changes to that may help this run a little more efficiently?
The text was updated successfully, but these errors were encountered: