Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical dtype not preserved with fastparquet-write, pyarrow-read #920

Open
zmoon opened this issue Feb 11, 2024 · 2 comments
Open

Categorical dtype not preserved with fastparquet-write, pyarrow-read #920

zmoon opened this issue Feb 11, 2024 · 2 comments

Comments

@zmoon
Copy link

zmoon commented Feb 11, 2024

Describe the issue: Not sure if this is a fastparquet or pyarrow (or pandas) issue, but I noticed that a column with pandas categorical dtype is read as object dtype if the Parquet file is created by the fastparquet engine and then read by the pyarrow engine. The other three cases preserve the dtype.

Minimal Complete Verifiable Example:

import itertools

import pandas as pd

df = pd.Series(["a", "b", "c"]).rename("cat").astype("category").to_frame()

fn = "cat.parquet"
data = []
for write, read in itertools.product(["pyarrow", "fastparquet"], repeat=2):
    df.to_parquet(fn, engine=write)
    df_ = pd.read_parquet(fn, engine=read)
    data.append((write, read, df_["cat"].dtype))

res = pd.DataFrame(data, columns=["write", "read", "dtype"])
print(res)
         write         read     dtype
0      pyarrow      pyarrow  category
1      pyarrow  fastparquet  category
2  fastparquet      pyarrow    object
3  fastparquet  fastparquet  category

Anything else we need to know?:

Environment:

  • Dask version:
  • Python version: 3.11.3
  • Operating System:
  • Install method (conda, pip, source): pip
  • fastparquet 2024.2.0, pyarrow 15.0.0, pandas 2.2.0
@martindurant
Copy link
Member

Thanks for notifying me, sounds like a metadata parsing thing. Whilst it should be easy to fix, I'm not sure when I will get to it.

Interestingly, with the fastparquet API, you can always assert that a give column should be a category type with categories=, but I don't think pyarrow can do that.

@martindurant
Copy link
Member

arrow produces
{'column_indexes': [{'field_name': None, 'metadata': {'encoding': 'UTF-8'}, 'name': None, 'numpy_type': 'object', 'pandas_type': 'unicode'}],
 'columns': [{'field_name': 'cat', 'metadata': {'num_categories': 3, 'ordered': False}, 'name': 'cat', 'numpy_type': 'int8', 'pandas_type': 'categorical'}],
 'creator': {'library': 'pyarrow', 'version': '11.0.0'},
 'index_columns': [{'kind': 'range', 'name': None, 'start': 0, 'step': 1, 'stop': 3}],
 'pandas_version': '2.1.4'}

fastparquet produces
{'column_indexes': [{'field_name': None, 'metadata': {'encoding': 'UTF-8'}, 'name': None, 'numpy_type': 'object', 'pandas_type': 'unicode'}],
 'columns': [{'field_name': 'cat', 'metadata': {'num_categories': 3, 'ordered': False}, 'name': 'cat', 'numpy_type': 'int8', 'pandas_type': 'categorical'}],
 'creator': {'library': 'pyarrow', 'version': '11.0.0'},
 'index_columns': [{'kind': 'range', 'name': None, 'start': 0, 'step': 1, 'stop': 3}],
 'pandas_version': '2.1.4'}

So I can only suppose arrow doesn't trust categories not made by arrow - it's their fault?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants