You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the issue:
I'm getting the following crash when I try to open a parquet file that was created with either pyspark or pyarrow.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/piotr.bakalarski/repos/neural-search-commons/venv/lib/python3.10/site-packages/fastparquet/api.py", line 407, in iter_row_groups
df = self[i].to_pandas(filters=filters, **kwargs)
File "/Users/piotr.bakalarski/repos/neural-search-commons/venv/lib/python3.10/site-packages/fastparquet/api.py", line 765, in to_pandas
df, views = self.pre_allocate(size, columns, categories, index, dtypes=dtypes)
File "/Users/piotr.bakalarski/repos/neural-search-commons/venv/lib/python3.10/site-packages/fastparquet/api.py", line 796, in pre_allocate
dtypes, self.tz, columns_dtype=self._columns_dtype)
AttributeError: 'ParquetFile' object has no attribute '_columns_dtype'
Minimal Complete Verifiable Example:
Using fastparquet==2023.4.0 and pyarrow==10.0.1:
importpyarrow.parquetaspqimportpyarrowaspafromfastparquetimportParquetFilepq.write_table(pa.table({"a": "abcd", "n": [1,2,3,4]}), "repro.parquet")
pf=ParquetFile("repro.parquet")
pf.head(1) # <-- This line causes error
Anything else we need to know?:
From what I can see, this is caused by _columns_dtype not getting set, because the __setstate__ in __getitem__[HERE] does not set _columns_dtype. Overall it seems that this getitem behavior is unintuitive and may lead to more issues with not-copied parameters.
Environment:
Dask version: None
Fastparquet version: 2023.4.0
Python version: 3.10.11
Operating System: macOS Ventura 13.4.1
Install method (conda, pip, source): pip
The text was updated successfully, but these errors were encountered:
Describe the issue:
I'm getting the following crash when I try to open a parquet file that was created with either pyspark or pyarrow.
Minimal Complete Verifiable Example:
Using
fastparquet==2023.4.0
andpyarrow==10.0.1
:Anything else we need to know?:
From what I can see, this is caused by
_columns_dtype
not getting set, because the__setstate__
in__getitem__
[HERE] does not set_columns_dtype
. Overall it seems that this getitem behavior is unintuitive and may lead to more issues with not-copied parameters.Environment:
The text was updated successfully, but these errors were encountered: