Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OverflowError: Python int too large to convert to C long #2780

Open
aolney opened this issue Apr 2, 2020 · 13 comments
Open

OverflowError: Python int too large to convert to C long #2780

aolney opened this issue Apr 2, 2020 · 13 comments
Labels
bug Issue described a bug impact HIGH Show-stopper for affected users reach LOW Affects only niche use-case users

Comments

@aolney
Copy link

aolney commented Apr 2, 2020

Problem description

fakeDataset = downloader.load('fake-news')

fails with the above error on Windows machines running Python 3.7 64 bit with gensim 3.8.1

Steps/code/corpus to reproduce

fakeDataset = downloader.load('fake-news')

on machine with above configuration.

Versions

Windows-10-10.0.18362
Python 3.7.6
NumPy 1.18.1
SciPy 1.4.1
gensim 3.8.1
FAST_VERSION 0

Attempted workaround

I zipped the data directory from a Linux machine and gave it to a student to unzip on their Windows machine. Re-executing the code above failed with the same error, suggesting the problem is not in downloading but in loading the downloaded data. Perhaps there is a bug in unzipping the archive with Python?

@piskvorky piskvorky added bug Issue described a bug reach LOW Affects only niche use-case users impact HIGH Show-stopper for affected users labels Apr 2, 2020
@aolney
Copy link
Author

aolney commented Apr 2, 2020

It looks like the problem is at least in __init__.py in the fake-news data folder.

Specifically, on this line:

csv.field_size_limit(sys.maxsize)

This is where we get OverflowError: Python int too large to convert to C long

@aolney
Copy link
Author

aolney commented Apr 2, 2020

Attempting a dirty fix based on StackOverflow

#csv.field_size_limit(sys.maxsize)
maxInt = sys.maxsize
try:
    csv.field_size_limit(maxInt)
    break
except OverflowError:
    maxInt = int(maxInt/10)

Waiting for student to respond if it worked.

@aolney
Copy link
Author

aolney commented Apr 2, 2020

Had to set up a Windows env to test this. The above approach did not work (hangs).

The code below works; do you want me to do a PR?

import os
import csv
import sys
from smart_open import smart_open
from gensim.downloader import base_dir


#csv.field_size_limit(sys.maxsize)
csv.field_size_limit(2147483647)


class Dataset(object):
    def __init__(self, fn):
        self.fn = fn

    def __iter__(self):
        with smart_open(self.fn, 'rb') as infile:
            if sys.version_info[0] == 2:
                reader = csv.DictReader(infile, delimiter=",", quotechar='"')
            else:
                reader = csv.DictReader((line.decode("utf-8") for line in infile), delimiter=",", quotechar='"')
            for row in reader:
                yield dict(row)


def load_data():
    path = os.path.join(base_dir, 'fake-news', "fake-news.gz")
    return Dataset(path)

If you want to test it, here is my conda yml, use conda env create -f environment.yml after putting this in a file environment.yml:

name: iis4011
channels:
  - pytorch
  - conda-forge
  - defaults
dependencies:
  - attrs=19.3.0=py_0
  - backcall=0.1.0=py_0
  - blas=1.0=mkl
  - bleach=3.1.4=pyh9f0ad1d_0
  - ca-certificates=2020.1.1=0
  - certifi=2019.11.28=py37_1
  - cffi=1.14.0=py37ha419a9e_0
  - chardet=3.0.4=py37hc8dfbb8_1006
  - colorama=0.4.3=py_0
  - cpuonly=1.0=0
  - cryptography=2.8=py37hb32ad35_1
  - decorator=4.4.2=py_0
  - defusedxml=0.6.0=py_0
  - entrypoints=0.3=py37hc8dfbb8_1001
  - freetype=2.9.1=ha9979f8_1
  - icc_rt=2019.0.0=h0cc432a_1
  - idna=2.9=py_1
  - importlib-metadata=1.6.0=py37hc8dfbb8_0
  - importlib_metadata=1.6.0=0
  - intel-openmp=2020.0=166
  - ipykernel=5.2.0=py37h5ca1d4c_1
  - ipython=7.13.0=py37hc8dfbb8_2
  - ipython_genutils=0.2.0=py_1
  - jedi=0.15.2=py37_0
  - jinja2=2.11.1=py_0
  - jpeg=9b=hb83a4c4_2
  - json5=0.9.0=py_0
  - jsonschema=3.2.0=py37hc8dfbb8_1
  - jupyter_client=6.1.2=py_0
  - jupyter_core=4.6.3=py37hc8dfbb8_1
  - jupyterlab=1.2.6=pyhf63ae98_0
  - jupyterlab_server=1.1.0=py_0
  - libpng=1.6.37=h2a8f88b_0
  - libsodium=1.0.17=h2fa13f4_0
  - libtiff=4.1.0=h56a325e_0
  - m2w64-gcc-libgfortran=5.3.0=6
  - m2w64-gcc-libs=5.3.0=7
  - m2w64-gcc-libs-core=5.3.0=7
  - m2w64-gmp=6.1.0=2
  - m2w64-libwinpthread-git=5.0.0.4634.697f757=2
  - markupsafe=1.1.1=py37h8055547_1
  - mistune=0.8.4=py37hfa6e2cd_1000
  - mkl=2020.0=166
  - mkl-service=2.3.0=py37hb782905_0
  - mkl_fft=1.0.15=py37h14836fe_0
  - mkl_random=1.1.0=py37h675688f_0
  - msys2-conda-epoch=20160418=1
  - nbconvert=5.6.1=py37_0
  - nbformat=5.0.4=py_0
  - ninja=1.9.0=py37h74a9793_0
  - nodejs=13.10.1=0
  - notebook=6.0.3=py37_0
  - numpy-base=1.18.1=py37hc3f5095_1
  - olefile=0.46=py37_0
  - openssl=1.1.1f=he774522_0
  - pandoc=2.9.2=0
  - pandocfilters=1.4.2=py_1
  - parso=0.6.2=py_0
  - pickleshare=0.7.5=py37hc8dfbb8_1001
  - pillow=7.0.0=py37hcc1f983_0
  - pip=20.0.2=py37_1
  - prometheus_client=0.7.1=py_0
  - prompt-toolkit=3.0.5=py_0
  - ptvsd=4.3.2=py37hfa6e2cd_1
  - pycparser=2.20=py_0
  - pygments=2.6.1=py_0
  - pyopenssl=19.1.0=py_1
  - pyrsistent=0.16.0=py37h8055547_0
  - pysocks=1.7.1=py37hc8dfbb8_1
  - python=3.7.7=h60c2a47_0_cpython
  - python-dateutil=2.8.1=py_0
  - python_abi=3.7=1_cp37m
  - pytorch=1.4.0=py3.7_cpu_0
  - pywin32=227=py37hfa6e2cd_0
  - pywinpty=0.5.7=py37_0
  - pyzmq=19.0.0=py37h8c16cda_1
  - requests=2.23.0=pyh8c360ce_2
  - send2trash=1.5.0=py_0
  - setuptools=46.1.3=py37_0
  - six=1.14.0=py_1
  - sqlite=3.31.1=he774522_0
  - terminado=0.8.3=py37hc8dfbb8_1
  - testpath=0.4.4=py_0
  - tk=8.6.8=hfa6e2cd_0
  - torchvision=0.5.0=py37_cpu
  - tornado=6.0.4=py37hfa6e2cd_0
  - traitlets=4.3.3=py37hc8dfbb8_1
  - urllib3=1.25.7=py37hc8dfbb8_1
  - vc=14.1=h0510ff6_4
  - vs2015_runtime=14.16.27012=hf0eaf9b_1
  - wcwidth=0.1.9=pyh9f0ad1d_0
  - webencodings=0.5.1=py_1
  - wheel=0.34.2=py37_0
  - win_inet_pton=1.1.0=py37_0
  - wincertstore=0.2=py37_0
  - winpty=0.4.3=4
  - xeus=0.23.10=h1ad3211_0
  - xeus-python=0.6.13=py37h5b9e2c8_1
  - xz=5.2.4=h2fa13f4_4
  - zeromq=4.3.2=h6538335_2
  - zipp=3.1.0=py_0
  - zlib=1.2.11=h62dcd97_3
  - zstd=1.3.7=h508b16e_0
  - pip:
    - atomicwrites==1.3.0
    - blis==0.4.1
    - boto3==1.12.35
    - botocore==1.15.35
    - cachetools==4.0.0
    - catalogue==1.0.0
    - cycler==0.10.0
    - cymem==2.0.3
    - docutils==0.15.2
    - funcy==1.14
    - future==0.18.2
    - gensim==3.8.1
    - google-api-core==1.16.0
    - google-auth==1.13.1
    - google-cloud-core==1.3.0
    - google-cloud-storage==1.27.0
    - google-resumable-media==0.5.0
    - googleapis-common-protos==1.51.0
    - jmespath==0.9.5
    - joblib==0.14.1
    - kiwisolver==1.2.0
    - matplotlib==3.2.1
    - more-itertools==8.2.0
    - murmurhash==1.0.2
    - nltk==3.4.5
    - numexpr==2.7.1
    - numpy==1.18.2
    - packaging==20.3
    - pandas==1.0.3
    - plac==1.1.3
    - pluggy==0.13.1
    - preshed==3.0.2
    - protobuf==3.11.3
    - py==1.8.1
    - pyasn1==0.4.8
    - pyasn1-modules==0.2.8
    - pyldavis==2.1.2
    - pyparsing==2.4.6
    - pytest==5.4.1
    - pytz==2019.3
    - rsa==4.0
    - s3transfer==0.3.3
    - scikit-learn==0.22.2.post1
    - scipy==1.4.1
    - smart-open==1.10.0
    - spacy==2.2.4
    - srsly==1.0.2
    - thinc==7.4.0
    - tqdm==4.45.0
    - wasabi==0.6.0
prefix: C:\Users\Andrew Olney\.conda\envs\iis4011

@gojomo
Copy link
Collaborator

gojomo commented Apr 3, 2020

It'd help to include the full error stack for any error, to better identify the files/exact-lines involved in any errr.

I can't find any fake-news directory in this project's source control (nor in gensim-data) – is the problem code actually in some other project/source-tree?

This seems to be one of the not-version-controlled, hard-to-browse, hard-to-review active code files that's dynamically downloaded & run in a manner I consider highly unwise (per #2283). How would someone contribute a PR against such a file?

@aolney
Copy link
Author

aolney commented Apr 3, 2020

@gojomo The only way I know to get these files is to use the downloader successfully. When that happens, it places a folder hierarchy in the user's home directory, under gensim-data (screenshot below). The file in question is __init__.py. The code I pasted above represents the entire contents of that file, with the commented line being the original line where the error was thrown, and the line below it being the new line that fixes the problem. Just to be completely clear, I'm referring to

#csv.field_size_limit(sys.maxsize)
csv.field_size_limit(2147483647)

I can send a full stack if needed, but the problem is sys.maxsize on Windows for Python. I'm sure there are cleaner ways to solve the problem than what I have above, but it works.

Screenshot from 2020-04-02 19-35-18

@piskvorky
Copy link
Owner

piskvorky commented Apr 3, 2020

The __init__.py file is here:
https://github.com/RaRe-Technologies/gensim-data/releases/tag/fake-news

CC @chaitaliSaini @menshikh-iv the authors of this code. I see no comment in the code, so not sure what that csv.field_size_limit(sys.maxsize) is about – why is it there?

@piskvorky
Copy link
Owner

@gojomo as discussed ad nauseum, the dataset releases are immutable by design and you cannot open a PR against them. You can release a new, updated version.

@aolney
Copy link
Author

aolney commented Apr 3, 2020

In linux at least, the default field size appears to be 131072:

import csv
print( csv.field_size_limit());

My guess is that some of the documents in fake-news might be longer than that, and that's the reason the author put it there in the first place. However, I haven't played around with different sizes. The size I gave above works but hasn't been memory optimized if that's a concern.

@gojomo
Copy link
Collaborator

gojomo commented Apr 3, 2020

@gojomo as discussed ad nauseum, the dataset releases are immutable by design and you cannot open a PR against them. You can release a new, updated version.

But this isn't a problem in a dataset, it's a problem in executed source code – which could and should be in version control. Why should we be running code in project users' installations that isn't under version-control, hasn't been reviewed, & can't receive fix PRs (from either users or contributors)?

@piskvorky
Copy link
Owner

piskvorky commented Apr 3, 2020

For reproducibility reasons. The same dataset should result in the same output, forever, bugs included. And that implies the same code too. That was the design anyway.

In this particular case, changing csv.field_size_limit() may result in changed results (I assume, haven't checked).

@aolney
Copy link
Author

aolney commented Apr 3, 2020

I agree that it was confusing to me to trace this problem back to code that was packaged with the data and not under version control.

That said, I would suggest opening another issue specifically for that suggested redesign. A separate issue would preserve the discussion and also invite comments from other users/developers.

@piskvorky
Copy link
Owner

piskvorky commented Apr 3, 2020

I'm definitely open to that. It will need a strong open source contributor though – to make sure the redesign is an actual improvement :)

@gojomo
Copy link
Collaborator

gojomo commented Apr 3, 2020

For reproducibility reasons. The same dataset should result in the same output, forever, bugs included. And that implies the same code too. That was the design anyway.

In this particular case, changing csv.field_size_limit() may result in changed results (I assume, haven't checked).

Exactly - if you want reproducibility, you'd want more things under version control, not less, so you can see what was delivered at any one time – ideally as part of a named version!

Here, if someone changed the __init__.py 'asset' served by https://github.com/RaRe-Technologies/gensim-data/releases/download/fake-news/__init__.py - who could notice that change? If for 2 hours, or 2 days, it was changed to a malicious file, then changed back to something innocent, who would notice? Are you getting notifications of every asset-change there? Is there a persistent log somewhere? (I can't find one, but would feel somewhat better if there was one to allow you & me to know what code someone executed, who ran downloader.load('fake-news') on April 1, compared to some other date. Right now, I don't see how that's possible - reproducibility is not tracked or possible in the current practice.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug impact HIGH Show-stopper for affected users reach LOW Affects only niche use-case users
Projects
None yet
Development

No branches or pull requests

3 participants