-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreading bug when using log-normal sampling #164
Comments
Hey Tony @2tony2 , Thanks a lot for reporting this. I really appreciate it. I will definitely look into it with my colleagues and we will get back to you. Let me see if I can reproduce the same issue. Keep you updated. Btw, I wonder if you have ever tried without setting Pinging @cheny19 to have her thoughts on this. |
Yes I did use it without the I think this is because I trained the model on some data which produces a read length distribution range which is outside of the range of some references being tested. From looking at the code, my hunch is that the while loop will keep running forever as the major for loop within it is exhausted without saturating the condition. Ideally you'd want to include some sort of exception if your for loop is exhausted to point this out. This is why the lognormal sampling should work fine (and it does as long as you don't simulate too many reads) as that is not dependent on the read length distribution of the original training dataset. I'm not sure whether I tried this specific configuration without the |
Hi! I was wondering if there had been any updates to this issue? In case it helps, I ran into the same error with the below command:
Essentially, I am trying to simulate reads with a much shorter median read length than the model I trained on (aiming for a median of 1500bp instead of the model median of ~10kbp). The error only occurs if the number of reads (-n) is set to be greater than 32. I think the error has something to do with trying to assign the median and sd because without these flags, the simulations run as expected. Additional tests:
This was done using nanosim v3.1.0 (re-installed yesterday from conda) Many thanks, |
Hi there!
I was running some simulations today with nanosim until I stumbled upon some issue that I thought may be worth pointing out here.
First of all, I was using a custom trained model and was using the
min
max
med
andsd
option in simulatory.py genome mode as follows:simulator.py genome -rg test.fa -c "nanosim_model/testmodel" -n 228 -b "guppy" -s "0.5" -dna_type "linear" -t "8" --fastq -k 6 -o simulated/out.fastq --perfect -min $sequence_min -max $sequence_max -med $sequence_length -sd 0.1
From my understanding of the source code, this samples fragment size from a log-normal distribution rather than kernel learned from data which had more desirable properties for my task at hand.
Now I was having it work at 1 and 100 reads but not 1000. After some testing the limit seemed to be 228. With the following error popping up when trying 229 reads on 8 threads:
From my interpretation of this the following seems to happen: When sampling from log-normal, an ndarray is used in each thread which has a limit of 32 numbers. Specified threads - 1 are used for this process thus in this case this is 7 threads. This means that 228 reads are divided over 7 threads in ndarrays = ±32 which is around the ndarray limit. I tested with some different thread counts and this hypothesis seems to hold up.
I'm no expert at programming multithreaded applications with numpy so I do not know if this has a straightforward solution, but I just wanted to point this out so you are aware. Maybe this could help?
The text was updated successfully, but these errors were encountered: