Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'ParquetFile' object has no attribute '_columns_dtype' when reading files without pandas metadata #869

Open
piotrb5e3 opened this issue Jun 26, 2023 · 1 comment

Comments

@piotrb5e3
Copy link

Describe the issue:
I'm getting the following crash when I try to open a parquet file that was created with either pyspark or pyarrow.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/piotr.bakalarski/repos/neural-search-commons/venv/lib/python3.10/site-packages/fastparquet/api.py", line 407, in iter_row_groups
    df = self[i].to_pandas(filters=filters, **kwargs)
  File "/Users/piotr.bakalarski/repos/neural-search-commons/venv/lib/python3.10/site-packages/fastparquet/api.py", line 765, in to_pandas
    df, views = self.pre_allocate(size, columns, categories, index, dtypes=dtypes)
  File "/Users/piotr.bakalarski/repos/neural-search-commons/venv/lib/python3.10/site-packages/fastparquet/api.py", line 796, in pre_allocate
    dtypes, self.tz, columns_dtype=self._columns_dtype)
AttributeError: 'ParquetFile' object has no attribute '_columns_dtype'

Minimal Complete Verifiable Example:

Using fastparquet==2023.4.0 and pyarrow==10.0.1:

import pyarrow.parquet as pq
import pyarrow as pa
from fastparquet import ParquetFile

pq.write_table(pa.table({"a": "abcd", "n": [1,2,3,4]}), "repro.parquet")
pf = ParquetFile("repro.parquet")

pf.head(1)  # <-- This line causes error

Anything else we need to know?:
From what I can see, this is caused by _columns_dtype not getting set, because the __setstate__ in __getitem__ [HERE] does not set _columns_dtype. Overall it seems that this getitem behavior is unintuitive and may lead to more issues with not-copied parameters.

Environment:

  • Dask version: None
  • Fastparquet version: 2023.4.0
  • Python version: 3.10.11
  • Operating System: macOS Ventura 13.4.1
  • Install method (conda, pip, source): pip
@piotrb5e3
Copy link
Author

I can see that this was already fixed in Extra field when cloning ParquetFile (#866). Could you release a new version of the library?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant