Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How much memory do I need to process bin file (i.e. GoogleNews-vectors-negative300.bin) #50

Open
usama6832 opened this issue May 21, 2019 · 3 comments

Comments

@usama6832
Copy link

Hi everyone,

I try to run the "process_data.py" file with the same word2vec binary file (i.e. GoogleNews-vectors-negative300.bin)  but it didn't work. The process got killed after 30 mint approx.

Before I was thinking, it may be a memory problem, but I tried on the server (256GB RAM and 16GB GPU) too but unfortunately found same results (i.e. program got killed after running approx. 30 mint).

what could be possible reasons?

Your response will be highly appreciable.

@usama6832 usama6832 changed the title How much RAM memory do I need to process Goole News dataset bin file (i.e. GoogleNews-vectors-negative300.bin) How much memory do I need to process bin file (i.e. GoogleNews-vectors-negative300.bin) May 22, 2019
@GaoZhongqin
Copy link

Hi, @usama6832 did you solve this question? I am confusing this problem right now. Do you have any idea?

@usama6832
Copy link
Author

Yes, I solved my problem and successfully run this file on my data center having 256GB RAM. Perhaps, It was the python version compatibility problem instead of memory problem.

@JustinAW
Copy link

Hello all,

If you are attempting to do this under python 3 and are having memory limitation problems, then your issue likely lies within the string processing. Python 2 and Python 3 process binary files differently where all comparisons of binary strings in Python 3 must be preceded by a lowercase b for it to be successful.
Here is an example:

with open(fname, "rb") as f:
for line in range(foo):
ch = f.read(1)
if ch == b' ':
do something

Notice the space ' ' has a b before it: b' '
Without this b, that comparison will always be false if that character is a space in a binary file. This can lead to a memory leak that can grow to infinite size.

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants