-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OverflowError: Python int too large to convert to C long #2780
Comments
It looks like the problem is at least in Specifically, on this line:
This is where we get |
Attempting a dirty fix based on StackOverflow #csv.field_size_limit(sys.maxsize)
maxInt = sys.maxsize
try:
csv.field_size_limit(maxInt)
break
except OverflowError:
maxInt = int(maxInt/10) Waiting for student to respond if it worked. |
Had to set up a Windows env to test this. The above approach did not work (hangs). The code below works; do you want me to do a PR?
If you want to test it, here is my conda yml, use
|
It'd help to include the full error stack for any error, to better identify the files/exact-lines involved in any errr. I can't find any This seems to be one of the not-version-controlled, hard-to-browse, hard-to-review active code files that's dynamically downloaded & run in a manner I consider highly unwise (per #2283). How would someone contribute a PR against such a file? |
@gojomo The only way I know to get these files is to use the downloader successfully. When that happens, it places a folder hierarchy in the user's home directory, under #csv.field_size_limit(sys.maxsize)
csv.field_size_limit(2147483647) I can send a full stack if needed, but the problem is |
The CC @chaitaliSaini @menshikh-iv the authors of this code. I see no comment in the code, so not sure what that |
@gojomo as discussed ad nauseum, the dataset releases are immutable by design and you cannot open a PR against them. You can release a new, updated version. |
In linux at least, the default field size appears to be 131072: import csv
print( csv.field_size_limit()); My guess is that some of the documents in |
But this isn't a problem in a dataset, it's a problem in executed source code – which could and should be in version control. Why should we be running code in project users' installations that isn't under version-control, hasn't been reviewed, & can't receive fix PRs (from either users or contributors)? |
For reproducibility reasons. The same dataset should result in the same output, forever, bugs included. And that implies the same code too. That was the design anyway. In this particular case, changing |
I agree that it was confusing to me to trace this problem back to code that was packaged with the data and not under version control. That said, I would suggest opening another issue specifically for that suggested redesign. A separate issue would preserve the discussion and also invite comments from other users/developers. |
I'm definitely open to that. It will need a strong open source contributor though – to make sure the redesign is an actual improvement :) |
Exactly - if you want reproducibility, you'd want more things under version control, not less, so you can see what was delivered at any one time – ideally as part of a named version! Here, if someone changed the |
Problem description
fakeDataset = downloader.load('fake-news')
fails with the above error on Windows machines running Python 3.7 64 bit with gensim 3.8.1
Steps/code/corpus to reproduce
fakeDataset = downloader.load('fake-news')
on machine with above configuration.
Versions
Windows-10-10.0.18362
Python 3.7.6
NumPy 1.18.1
SciPy 1.4.1
gensim 3.8.1
FAST_VERSION 0
Attempted workaround
I zipped the data directory from a Linux machine and gave it to a student to unzip on their Windows machine. Re-executing the code above failed with the same error, suggesting the problem is not in downloading but in loading the downloaded data. Perhaps there is a bug in unzipping the archive with Python?
The text was updated successfully, but these errors were encountered: