`dtype (int16) cannot accommodate number of category labels` and `unexpected keyword argument 'infile'` #789

sophiamyang · 2022-06-16T03:18:32Z

Hi! I think I found two fastparquet bugs. Can I get some help please? Thanks so much in advance!

Bug 1

Dask version: 2021.8.1
fastparquet version: 0.5.0
Python version: Python 3.9.6
Operating System: OSX
Install method (conda, pip, source): conda
Code:

df = dd.read_parquet(
        "s3://anaconda-package-data/conda/monthly/2022/*.parquet",
        storage_options={"anon": True},
    )
df.tail()

Problem:
When I have more categories than int16 (here 32768>32833), data is not read correctly.

Additional info:
Pyarrow doesn't seem to have this issue

Bug 2

After I update dask and fastparquet, I got a new bug.

Dask version: 2022.5.0
fastparquet version: 0.5.0
Python version: Python 3.9.12
Operating System: OSX
Install method (conda, pip, source): conda

martindurant · 2022-06-16T19:15:37Z

For cases where the data was written to be categorical, but the number of categories is not stored in the metadata, fastparquet has a categories= kwarg in to_pandas. I'm pretty confident that you can add this in read_parquet and it'll get passed through to the right place.

martindurant · 2022-06-16T19:57:37Z

OK, so the case is, that the global key-value metadata inferred is taken from the first file, so the total number of categries for the whole five files is under-estimated. The following both work

pf = fastparquet.ParquetFile("s3://anaconda-package-data/conda/monthly/2022/*.parquet", open_with=fs.open)
out = pf.to_pandas(categories={"pkg_name": 65000}) # we know its > 2**16
out = pf.to_pandas(categories={})  # turn off categories altogether

I haven't figured out how to get dask to respect this yet, perhaps @rjzamora knows.

sophiamyang mentioned this issue Jun 23, 2022

RuntimeError: Assigned array dtype (int16) cannot accommodate number of category labels (32833) conda-incubator/condastats#11

Closed

martindurant mentioned this issue Jun 27, 2022

Consolidate cats when loading from many files #790

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`dtype (int16) cannot accommodate number of category labels` and `unexpected keyword argument 'infile'` #789

`dtype (int16) cannot accommodate number of category labels` and `unexpected keyword argument 'infile'` #789

sophiamyang commented Jun 16, 2022

martindurant commented Jun 16, 2022

martindurant commented Jun 16, 2022

dtype (int16) cannot accommodate number of category labels and unexpected keyword argument 'infile' #789

dtype (int16) cannot accommodate number of category labels and unexpected keyword argument 'infile' #789

Comments

sophiamyang commented Jun 16, 2022

martindurant commented Jun 16, 2022

martindurant commented Jun 16, 2022

`dtype (int16) cannot accommodate number of category labels` and `unexpected keyword argument 'infile'` #789

`dtype (int16) cannot accommodate number of category labels` and `unexpected keyword argument 'infile'` #789