You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For cases where the data was written to be categorical, but the number of categories is not stored in the metadata, fastparquet has a categories= kwarg in to_pandas. I'm pretty confident that you can add this in read_parquet and it'll get passed through to the right place.
OK, so the case is, that the global key-value metadata inferred is taken from the first file, so the total number of categries for the whole five files is under-estimated. The following both work
pf = fastparquet.ParquetFile("s3://anaconda-package-data/conda/monthly/2022/*.parquet", open_with=fs.open)
out = pf.to_pandas(categories={"pkg_name": 65000}) # we know its > 2**16
out = pf.to_pandas(categories={}) # turn off categories altogether
I haven't figured out how to get dask to respect this yet, perhaps @rjzamora knows.
Hi! I think I found two fastparquet bugs. Can I get some help please? Thanks so much in advance!
Bug 1
Dask version: 2021.8.1
fastparquet version: 0.5.0
Python version: Python 3.9.6
Operating System: OSX
Install method (conda, pip, source): conda
Code:
When I have more categories than int16 (here 32768>32833), data is not read correctly.
Pyarrow doesn't seem to have this issue
Bug 2
After I update dask and fastparquet, I got a new bug.
The text was updated successfully, but these errors were encountered: