Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dtype (int16) cannot accommodate number of category labels and unexpected keyword argument 'infile' #789

Open
sophiamyang opened this issue Jun 16, 2022 · 2 comments

Comments

@sophiamyang
Copy link

Hi! I think I found two fastparquet bugs. Can I get some help please? Thanks so much in advance!

Bug 1

  • Dask version: 2021.8.1

  • fastparquet version: 0.5.0

  • Python version: Python 3.9.6

  • Operating System: OSX

  • Install method (conda, pip, source): conda

  • Code:

df = dd.read_parquet(
        "s3://anaconda-package-data/conda/monthly/2022/*.parquet",
        storage_options={"anon": True},
    )
df.tail()
  • Problem:
    When I have more categories than int16 (here 32768>32833), data is not read correctly.

image

  • Additional info:
    Pyarrow doesn't seem to have this issue

image

Bug 2

After I update dask and fastparquet, I got a new bug.

  • Dask version: 2022.5.0
  • fastparquet version: 0.5.0
  • Python version: Python 3.9.12
  • Operating System: OSX
  • Install method (conda, pip, source): conda

image

@martindurant
Copy link
Member

For cases where the data was written to be categorical, but the number of categories is not stored in the metadata, fastparquet has a categories= kwarg in to_pandas. I'm pretty confident that you can add this in read_parquet and it'll get passed through to the right place.

@martindurant
Copy link
Member

OK, so the case is, that the global key-value metadata inferred is taken from the first file, so the total number of categries for the whole five files is under-estimated. The following both work

pf = fastparquet.ParquetFile("s3://anaconda-package-data/conda/monthly/2022/*.parquet", open_with=fs.open)
out = pf.to_pandas(categories={"pkg_name": 65000}) # we know its > 2**16
out = pf.to_pandas(categories={})  # turn off categories altogether

I haven't figured out how to get dask to respect this yet, perhaps @rjzamora knows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants