-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-deterministic results when -ncpus != 1 (mgiza bin) #26
Comments
I doubt anyone will look into it. Why is it a problem? In fact, I'm surprised cpu=1 is deterministic |
Well... since there is not a proper documentation where I could look into it, I thought it was not the expected. Since you are not surprised about this, am I wrong thinking that to be non-deterministic is the expected? |
you're right that the results should be determinstric or non-deterministic regardless of how many threads are used. I don't know the code that well so don't take my word for it. In my mind, it should be non-determistic during training due to randomness in word clustering. However, you seem to find the it non-deter. even during inference. That could be an issue. I'm not sure who can come to your rescue, mgiza is abadonware these days. Perhaps @edwardgao, the original author has some time Btw, running the command with your data crashes for me. I'm not sure if that has anything to do with it |
I have run the commands again and they work for me. Have you run cat corpus.fr-en.cooc.1 corpus.fr-en.cooc.2 > corpus.fr-en.cooc ? I had to split the file to be able to upload it to the issue. If you share the log perhaps I could find if something is wrong in my installation. |
Hi!
I have been using
mgiza
and I have noticed that the generated files does not contain the same information among different executions, not even the same number of lines. This happens when-ncpus
!= 1. I have tested using the same files and changing-ncpus
to 1, 2 and 8. Only when-ncpus 1
is provided, the two executions had exactly the same output files.Command:
The files has been generated using Bitextor 8.2. The files has been generated using data from this WARC. You may find the necessary files in order to reproduce the results attached in this issue (for corpus.fr-en.cooc.1.zip and corpus.fr-en.cooc.2.zip you will need to decompress and execute
cat corpus.fr-en.cooc.1 corpus.fr-en.cooc.2 > corpus.fr-en.cooc
).input_mgiza.zip
corpus.fr-en.cooc.2.zip
corpus.fr-en.cooc.1.zip
The text was updated successfully, but these errors were encountered: