OverflowError: Python int too large to convert to C long #2780

aolney · 2020-04-02T17:38:32Z

Problem description

fakeDataset = downloader.load('fake-news')

fails with the above error on Windows machines running Python 3.7 64 bit with gensim 3.8.1

Steps/code/corpus to reproduce

fakeDataset = downloader.load('fake-news')

on machine with above configuration.

Versions

Windows-10-10.0.18362
Python 3.7.6
NumPy 1.18.1
SciPy 1.4.1
gensim 3.8.1
FAST_VERSION 0

Attempted workaround

I zipped the data directory from a Linux machine and gave it to a student to unzip on their Windows machine. Re-executing the code above failed with the same error, suggesting the problem is not in downloading but in loading the downloaded data. Perhaps there is a bug in unzipping the archive with Python?

The text was updated successfully, but these errors were encountered:

aolney · 2020-04-02T18:09:34Z

It looks like the problem is at least in __init__.py in the fake-news data folder.

Specifically, on this line:

csv.field_size_limit(sys.maxsize)

This is where we get OverflowError: Python int too large to convert to C long

aolney · 2020-04-02T18:19:11Z

Attempting a dirty fix based on StackOverflow

#csv.field_size_limit(sys.maxsize)
maxInt = sys.maxsize
try:
    csv.field_size_limit(maxInt)
    break
except OverflowError:
    maxInt = int(maxInt/10)

Waiting for student to respond if it worked.

aolney · 2020-04-02T22:16:52Z

Had to set up a Windows env to test this. The above approach did not work (hangs).

The code below works; do you want me to do a PR?

import os
import csv
import sys
from smart_open import smart_open
from gensim.downloader import base_dir


#csv.field_size_limit(sys.maxsize)
csv.field_size_limit(2147483647)


class Dataset(object):
    def __init__(self, fn):
        self.fn = fn

    def __iter__(self):
        with smart_open(self.fn, 'rb') as infile:
            if sys.version_info[0] == 2:
                reader = csv.DictReader(infile, delimiter=",", quotechar='"')
            else:
                reader = csv.DictReader((line.decode("utf-8") for line in infile), delimiter=",", quotechar='"')
            for row in reader:
                yield dict(row)


def load_data():
    path = os.path.join(base_dir, 'fake-news', "fake-news.gz")
    return Dataset(path)

If you want to test it, here is my conda yml, use conda env create -f environment.yml after putting this in a file environment.yml:

name: iis4011
channels:
  - pytorch
  - conda-forge
  - defaults
dependencies:
  - attrs=19.3.0=py_0
  - backcall=0.1.0=py_0
  - blas=1.0=mkl
  - bleach=3.1.4=pyh9f0ad1d_0
  - ca-certificates=2020.1.1=0
  - certifi=2019.11.28=py37_1
  - cffi=1.14.0=py37ha419a9e_0
  - chardet=3.0.4=py37hc8dfbb8_1006
  - colorama=0.4.3=py_0
  - cpuonly=1.0=0
  - cryptography=2.8=py37hb32ad35_1
  - decorator=4.4.2=py_0
  - defusedxml=0.6.0=py_0
  - entrypoints=0.3=py37hc8dfbb8_1001
  - freetype=2.9.1=ha9979f8_1
  - icc_rt=2019.0.0=h0cc432a_1
  - idna=2.9=py_1
  - importlib-metadata=1.6.0=py37hc8dfbb8_0
  - importlib_metadata=1.6.0=0
  - intel-openmp=2020.0=166
  - ipykernel=5.2.0=py37h5ca1d4c_1
  - ipython=7.13.0=py37hc8dfbb8_2
  - ipython_genutils=0.2.0=py_1
  - jedi=0.15.2=py37_0
  - jinja2=2.11.1=py_0
  - jpeg=9b=hb83a4c4_2
  - json5=0.9.0=py_0
  - jsonschema=3.2.0=py37hc8dfbb8_1
  - jupyter_client=6.1.2=py_0
  - jupyter_core=4.6.3=py37hc8dfbb8_1
  - jupyterlab=1.2.6=pyhf63ae98_0
  - jupyterlab_server=1.1.0=py_0
  - libpng=1.6.37=h2a8f88b_0
  - libsodium=1.0.17=h2fa13f4_0
  - libtiff=4.1.0=h56a325e_0
  - m2w64-gcc-libgfortran=5.3.0=6
  - m2w64-gcc-libs=5.3.0=7
  - m2w64-gcc-libs-core=5.3.0=7
  - m2w64-gmp=6.1.0=2
  - m2w64-libwinpthread-git=5.0.0.4634.697f757=2
  - markupsafe=1.1.1=py37h8055547_1
  - mistune=0.8.4=py37hfa6e2cd_1000
  - mkl=2020.0=166
  - mkl-service=2.3.0=py37hb782905_0
  - mkl_fft=1.0.15=py37h14836fe_0
  - mkl_random=1.1.0=py37h675688f_0
  - msys2-conda-epoch=20160418=1
  - nbconvert=5.6.1=py37_0
  - nbformat=5.0.4=py_0
  - ninja=1.9.0=py37h74a9793_0
  - nodejs=13.10.1=0
  - notebook=6.0.3=py37_0
  - numpy-base=1.18.1=py37hc3f5095_1
  - olefile=0.46=py37_0
  - openssl=1.1.1f=he774522_0
  - pandoc=2.9.2=0
  - pandocfilters=1.4.2=py_1
  - parso=0.6.2=py_0
  - pickleshare=0.7.5=py37hc8dfbb8_1001
  - pillow=7.0.0=py37hcc1f983_0
  - pip=20.0.2=py37_1
  - prometheus_client=0.7.1=py_0
  - prompt-toolkit=3.0.5=py_0
  - ptvsd=4.3.2=py37hfa6e2cd_1
  - pycparser=2.20=py_0
  - pygments=2.6.1=py_0
  - pyopenssl=19.1.0=py_1
  - pyrsistent=0.16.0=py37h8055547_0
  - pysocks=1.7.1=py37hc8dfbb8_1
  - python=3.7.7=h60c2a47_0_cpython
  - python-dateutil=2.8.1=py_0
  - python_abi=3.7=1_cp37m
  - pytorch=1.4.0=py3.7_cpu_0
  - pywin32=227=py37hfa6e2cd_0
  - pywinpty=0.5.7=py37_0
  - pyzmq=19.0.0=py37h8c16cda_1
  - requests=2.23.0=pyh8c360ce_2
  - send2trash=1.5.0=py_0
  - setuptools=46.1.3=py37_0
  - six=1.14.0=py_1
  - sqlite=3.31.1=he774522_0
  - terminado=0.8.3=py37hc8dfbb8_1
  - testpath=0.4.4=py_0
  - tk=8.6.8=hfa6e2cd_0
  - torchvision=0.5.0=py37_cpu
  - tornado=6.0.4=py37hfa6e2cd_0
  - traitlets=4.3.3=py37hc8dfbb8_1
  - urllib3=1.25.7=py37hc8dfbb8_1
  - vc=14.1=h0510ff6_4
  - vs2015_runtime=14.16.27012=hf0eaf9b_1
  - wcwidth=0.1.9=pyh9f0ad1d_0
  - webencodings=0.5.1=py_1
  - wheel=0.34.2=py37_0
  - win_inet_pton=1.1.0=py37_0
  - wincertstore=0.2=py37_0
  - winpty=0.4.3=4
  - xeus=0.23.10=h1ad3211_0
  - xeus-python=0.6.13=py37h5b9e2c8_1
  - xz=5.2.4=h2fa13f4_4
  - zeromq=4.3.2=h6538335_2
  - zipp=3.1.0=py_0
  - zlib=1.2.11=h62dcd97_3
  - zstd=1.3.7=h508b16e_0
  - pip:
    - atomicwrites==1.3.0
    - blis==0.4.1
    - boto3==1.12.35
    - botocore==1.15.35
    - cachetools==4.0.0
    - catalogue==1.0.0
    - cycler==0.10.0
    - cymem==2.0.3
    - docutils==0.15.2
    - funcy==1.14
    - future==0.18.2
    - gensim==3.8.1
    - google-api-core==1.16.0
    - google-auth==1.13.1
    - google-cloud-core==1.3.0
    - google-cloud-storage==1.27.0
    - google-resumable-media==0.5.0
    - googleapis-common-protos==1.51.0
    - jmespath==0.9.5
    - joblib==0.14.1
    - kiwisolver==1.2.0
    - matplotlib==3.2.1
    - more-itertools==8.2.0
    - murmurhash==1.0.2
    - nltk==3.4.5
    - numexpr==2.7.1
    - numpy==1.18.2
    - packaging==20.3
    - pandas==1.0.3
    - plac==1.1.3
    - pluggy==0.13.1
    - preshed==3.0.2
    - protobuf==3.11.3
    - py==1.8.1
    - pyasn1==0.4.8
    - pyasn1-modules==0.2.8
    - pyldavis==2.1.2
    - pyparsing==2.4.6
    - pytest==5.4.1
    - pytz==2019.3
    - rsa==4.0
    - s3transfer==0.3.3
    - scikit-learn==0.22.2.post1
    - scipy==1.4.1
    - smart-open==1.10.0
    - spacy==2.2.4
    - srsly==1.0.2
    - thinc==7.4.0
    - tqdm==4.45.0
    - wasabi==0.6.0
prefix: C:\Users\Andrew Olney\.conda\envs\iis4011

gojomo · 2020-04-03T00:18:26Z

It'd help to include the full error stack for any error, to better identify the files/exact-lines involved in any errr.

I can't find any fake-news directory in this project's source control (nor in gensim-data) – is the problem code actually in some other project/source-tree?

This seems to be one of the not-version-controlled, hard-to-browse, hard-to-review active code files that's dynamically downloaded & run in a manner I consider highly unwise (per #2283). How would someone contribute a PR against such a file?

aolney · 2020-04-03T00:40:26Z

@gojomo The only way I know to get these files is to use the downloader successfully. When that happens, it places a folder hierarchy in the user's home directory, under gensim-data (screenshot below). The file in question is __init__.py. The code I pasted above represents the entire contents of that file, with the commented line being the original line where the error was thrown, and the line below it being the new line that fixes the problem. Just to be completely clear, I'm referring to

#csv.field_size_limit(sys.maxsize)
csv.field_size_limit(2147483647)

I can send a full stack if needed, but the problem is sys.maxsize on Windows for Python. I'm sure there are cleaner ways to solve the problem than what I have above, but it works.

piskvorky · 2020-04-03T07:40:53Z

The __init__.py file is here:
https://github.com/RaRe-Technologies/gensim-data/releases/tag/fake-news

CC @chaitaliSaini @menshikh-iv the authors of this code. I see no comment in the code, so not sure what that csv.field_size_limit(sys.maxsize) is about – why is it there?

piskvorky · 2020-04-03T07:42:48Z

@gojomo as discussed ad nauseum, the dataset releases are immutable by design and you cannot open a PR against them. You can release a new, updated version.

aolney · 2020-04-03T13:14:53Z

In linux at least, the default field size appears to be 131072:

import csv
print( csv.field_size_limit());

My guess is that some of the documents in fake-news might be longer than that, and that's the reason the author put it there in the first place. However, I haven't played around with different sizes. The size I gave above works but hasn't been memory optimized if that's a concern.

gojomo · 2020-04-03T19:06:36Z

@gojomo as discussed ad nauseum, the dataset releases are immutable by design and you cannot open a PR against them. You can release a new, updated version.

But this isn't a problem in a dataset, it's a problem in executed source code – which could and should be in version control. Why should we be running code in project users' installations that isn't under version-control, hasn't been reviewed, & can't receive fix PRs (from either users or contributors)?

piskvorky · 2020-04-03T19:21:22Z

For reproducibility reasons. The same dataset should result in the same output, forever, bugs included. And that implies the same code too. That was the design anyway.

In this particular case, changing csv.field_size_limit() may result in changed results (I assume, haven't checked).

aolney · 2020-04-03T19:24:35Z

I agree that it was confusing to me to trace this problem back to code that was packaged with the data and not under version control.

That said, I would suggest opening another issue specifically for that suggested redesign. A separate issue would preserve the discussion and also invite comments from other users/developers.

piskvorky · 2020-04-03T20:45:28Z

I'm definitely open to that. It will need a strong open source contributor though – to make sure the redesign is an actual improvement :)

gojomo · 2020-04-03T21:47:54Z

For reproducibility reasons. The same dataset should result in the same output, forever, bugs included. And that implies the same code too. That was the design anyway.

In this particular case, changing csv.field_size_limit() may result in changed results (I assume, haven't checked).

Exactly - if you want reproducibility, you'd want more things under version control, not less, so you can see what was delivered at any one time – ideally as part of a named version!

Here, if someone changed the __init__.py 'asset' served by https://github.com/RaRe-Technologies/gensim-data/releases/download/fake-news/__init__.py - who could notice that change? If for 2 hours, or 2 days, it was changed to a malicious file, then changed back to something innocent, who would notice? Are you getting notifications of every asset-change there? Is there a persistent log somewhere? (I can't find one, but would feel somewhat better if there was one to allow you & me to know what code someone executed, who ran downloader.load('fake-news') on April 1, compared to some other date. Right now, I don't see how that's possible - reproducibility is not tracked or possible in the current practice.)

piskvorky added bug Issue described a bug reach LOW Affects only niche use-case users impact HIGH Show-stopper for affected users labels Apr 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OverflowError: Python int too large to convert to C long #2780

OverflowError: Python int too large to convert to C long #2780

aolney commented Apr 2, 2020

aolney commented Apr 2, 2020

aolney commented Apr 2, 2020

aolney commented Apr 2, 2020

gojomo commented Apr 3, 2020 •

edited

Loading

aolney commented Apr 3, 2020

piskvorky commented Apr 3, 2020 •

edited

Loading

piskvorky commented Apr 3, 2020

aolney commented Apr 3, 2020

gojomo commented Apr 3, 2020

piskvorky commented Apr 3, 2020 •

edited

Loading

aolney commented Apr 3, 2020

piskvorky commented Apr 3, 2020 •

edited

Loading

gojomo commented Apr 3, 2020

OverflowError: Python int too large to convert to C long #2780

OverflowError: Python int too large to convert to C long #2780

Comments

aolney commented Apr 2, 2020

Problem description

Steps/code/corpus to reproduce

Versions

Attempted workaround

aolney commented Apr 2, 2020

aolney commented Apr 2, 2020

aolney commented Apr 2, 2020

gojomo commented Apr 3, 2020 • edited Loading

aolney commented Apr 3, 2020

piskvorky commented Apr 3, 2020 • edited Loading

piskvorky commented Apr 3, 2020

aolney commented Apr 3, 2020

gojomo commented Apr 3, 2020

piskvorky commented Apr 3, 2020 • edited Loading

aolney commented Apr 3, 2020

piskvorky commented Apr 3, 2020 • edited Loading

gojomo commented Apr 3, 2020

gojomo commented Apr 3, 2020 •

edited

Loading

piskvorky commented Apr 3, 2020 •

edited

Loading

piskvorky commented Apr 3, 2020 •

edited

Loading

piskvorky commented Apr 3, 2020 •

edited

Loading