Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a pickle file problem #51

Open
ghost opened this issue Jun 21, 2019 · 3 comments
Open

a pickle file problem #51

ghost opened this issue Jun 21, 2019 · 3 comments

Comments

@ghost
Copy link

ghost commented Jun 21, 2019

Hi, @yoonkim
I am a beginner of natural language processing and machine learning. Since 'GoogleNews-vectors-negative300.bin' file size is quite large, all of my attemps for making a pickle file ('mr.p') failed. Could you give me some pieces of advice for making 'mr.p' with 16GB~32GB RAM if you don't mind?

And.. I wonder if 'mr.p' also need a chunk process to solve the memory problem. (I little know about pickle file..)

Thank you

@GaoZhongqin
Copy link

Hi, @soohyunee did you solve this question? I am confusing this problem right now. Do you have any idea?

@ghost
Copy link
Author

ghost commented Sep 30, 2019

Hi, @GaoZhongqin
I didn't solve the 'mr.p' related problems, but this Kaggle kernel helped me to make an embedding layer without troubles.

https://www.kaggle.com/ia1na09/cnn-keras-pretrained-word2vec-yoon-kim-model

If you don't need 'mr.p' file, I suggest you the way of this kernel. I hope the kernel helps you as well :)
Thank you

@JustinAW
Copy link

Hello all,

If you are attempting to do this under python 3 and are having memory limitation problems, then your issue likely lies within the string processing. Python 2 and Python 3 process binary files differently where all comparisons of binary strings in Python 3 must be preceded by a lowercase b for it to be successful.
Here is an example:

with open(fname, "rb") as f:
for line in range(foo):
ch = f.read(1)
if ch == b' ':
do something

Notice the space ' ' has a b before it: b' '
Without this b, that comparison will always be false if that character is a space in a binary file. This can lead to a memory leak that can grow to infinite size.

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants