Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step 3 Ipyrad Running Time Too Long #587

Open
Hiers12 opened this issue Dec 18, 2024 · 4 comments
Open

Step 3 Ipyrad Running Time Too Long #587

Hiers12 opened this issue Dec 18, 2024 · 4 comments

Comments

@Hiers12
Copy link

Hiers12 commented Dec 18, 2024

I am running a subsample (2 individuals) of paired end GBS data on Ipyrad, they each have around 12 million reads, and this is a de novo assembly for a plant species in the Asteraceae with no close relatives to use as a reference. When I get to step 3 the clustering seems to sit at 0% no matter how much time and RAM I throw at it, I have run this step at 160GB of memory for 20 hours with no real headway made. I see that it has started writing the clustering files in my directory, but not to completion before slurm ends the job. To troubleshoot this I tightened down the params file and reran step two with the min q score set to 43 (q30) and I hard trimmed the first and last five bp from both forward and reverse reads. This didn't really change the number of reads that passed by very much. I understand that this step can take up to two weeks, but my total sample size is 81 with each having around 12 million reads, so I am concerned if just two samples are taking over 20 hours with this much memory I will be looking at a month long job that takes up a ton of my university's computing resources to run. Are there other bits of the params file I should be looking at making changes to that may help this run a little more efficiently?

@isaacovercast
Copy link
Collaborator

12M reads is a lot, and paired-end data increases runtime. The progress bar in step 3 indicates the # of samples completed clustering, so 0% after 20 hours only means that none of the samples have finished, not that there has been no progress in clustering.

What is the read length? 150bp?

You can somewhat improve runtime by increasing -t on the command line (default is 2).

Did you inspect results of fastqc for a few of your samples? If there are lots of low quality bases in either R1 or R2 you can remove these with trim_reads during step 2. Low quality data will increase runtime.

What is the clust_threshold? Too strict of a clust_threshold will increase runtime (increasing numbers of reads that do not match a cluster, therefore increasing size of the seed file).

Finishing a run with 2 samples and then looking at the results will give more clues about modifications to increase performance. For now check these suggestions and let me know if you have any questions.

@Hiers12
Copy link
Author

Hiers12 commented Dec 19, 2024

The cluster I am using is undergoing maintenence this week, so I cannot currently access my fastqc results. I am trimming the first and five last bases, and could increase this if need be. It is 150x2 GBS reads, and cluster threshold is set at the default .85. I am curious about the -t function, it looks like it distributes the jobs across cores more evenly? So if I only have 2 samples, each would be given its own core for this step? And if I have 40 cores and 81 samples then each core could handle two samples if I set -t to 40? Or even with 2 samples, if I set -t to the number of cores that I have running it will more efficiently divvy those two samples among those 40?

pairgbs ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
GCGC, TAA ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)
5 ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
43 ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
6 ## [11] [mindepth_statistical]: Min depth for statistical base calling
6 ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000 ## [13] [maxdepth]: Max cluster depth within samples
0.85 ## [14] [clust_threshold]: Clustering threshold for de novo assembly
0 ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes
2 ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)
35 ## [17] [filter_min_trim_len]: Min length of reads after adapter trim
2 ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
0.05 ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus
0.05 ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus
4 ## [21] [min_samples_locus]: Min # samples per locus for output
0.2 ## [22] [max_SNPs_locus]: Max # SNPs per locus
8 ## [23] [max_Indels_locus]: Max # of indels per locus
0.5 ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus
5, -5, 5, -5 ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)
0, 0, 0, 0 ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)
p, s, l , v, k ## [27] [output_formats]: Output formats (see docs)
## [28] [pop_assign_file]: Path to population assignment file
## [29] [reference_as_filter]: Reads mapped to this reference are removed in step 3

@isaacovercast
Copy link
Collaborator

-t sets the number of 'threads' to use for clustering in steps 3 and 6. The default is 2, and I would never set it higher than 4 or 6.

@Hiers12
Copy link
Author

Hiers12 commented Jan 3, 2025

@isaacovercast The threading did seem to help a bit, and I increased the RAM to 1.5TB on 40 cores. It ran to about 54% completion of the clustering portion of step 3 in 120 hours. Our IT department has a 120 hour limit on HPC use that they are going to waive for me, but I just want to make sure I have done all I can on my end before starting a 2 week run. Our IT specialist looked at the job today during the clustering and said that it was using all of the cores, but only using 40GB of RAM during the period that he observed it. I see that Ipyrad has uses 75% of the available RAM by default, but 40GB isn't even close to that. Could there be another issue here that I am not seeing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants