`to_parquet` not able to write DateRange with period[D] #797

pyrito · 2022-08-11T13:32:05Z

What happened: ValueError thrown when trying to write DataFrame (w/ period[D] date range column) to parquet files

What you expected to happen: to_parquet finishes call without any errors

Minimal Complete Verifiable Example:

import pandas
import numpy as np

pandas_df = pandas.DataFrame(
    {
        "idx_categorical": pandas.Categorical(["y", "z"] * 1000),
        "idx_datetime": pandas.date_range(start="1/1/2018", periods=2000, freq='D'),
        "idx_periodrange": pandas.period_range(
            start="2017-01-01", periods=2000
        ),
        "B": ["a", "b"] * 1000,
        "C": ["c"] * 2000,
    }
)

# This fails
pandas_df.set_index("idx_datetime").to_parquet("testing/test.parquet", engine='fastparquet')

Anything else we need to know?:

Environment:

fastparquet version: 0.8.1 (latest SHA: 34069fe)
Python version: 3.9.12
Operating System: macOS 12.2.1
Install method (conda, pip, source): source

Cluster Dump State:

ValueError                                Traceback (most recent call last)
Input In [2], in <cell line: 17>()
      4 pandas_df = pandas.DataFrame(
      5     {
      6         "idx_categorical": pandas.Categorical(["y", "z"] * 1000),
   (...)
     13     }
     14 )
     16 # This works correctly
---> 17 pandas_df.set_index("idx_datetime").to_parquet("testing/test.parquet", engine="fastparquet")
     19 # This fails sometimes
     20 pdf = pandas.read_parquet("testing/test.parquet", engine="fastparquet")

File ~/opt/anaconda3/envs/test/lib/python3.9/site-packages/pandas/util/_decorators.py:207, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    205     else:
    206         kwargs[new_arg_name] = new_arg_value
--> 207 return func(*args, **kwargs)

File ~/opt/anaconda3/envs/test/lib/python3.9/site-packages/pandas/core/frame.py:2835, in DataFrame.to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
   2749 """
   2750 Write a DataFrame to the binary parquet format.
   2751 
   (...)
   2831 >>> content = f.read()
   2832 """
   2833 from pandas.io.parquet import to_parquet
-> 2835 return to_parquet(
   2836     self,
   2837     path,
   2838     engine,
   2839     compression=compression,
   2840     index=index,
   2841     partition_cols=partition_cols,
   2842     storage_options=storage_options,
   2843     **kwargs,
   2844 )

File ~/opt/anaconda3/envs/test/lib/python3.9/site-packages/pandas/io/parquet.py:420, in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, **kwargs)
    416 impl = get_engine(engine)
    418 path_or_buf: FilePath | WriteBuffer[bytes] = io.BytesIO() if path is None else path
--> 420 impl.write(
    421     df,
    422     path_or_buf,
    423     compression=compression,
    424     index=index,
    425     partition_cols=partition_cols,
    426     storage_options=storage_options,
    427     **kwargs,
    428 )
    430 if path is None:
    431     assert isinstance(path_or_buf, io.BytesIO)

File ~/opt/anaconda3/envs/test/lib/python3.9/site-packages/pandas/io/parquet.py:301, in FastParquetImpl.write(self, df, path, compression, index, partition_cols, storage_options, **kwargs)
    296     raise ValueError(
    297         "storage_options passed with file object or non-fsspec file path"
    298     )
    300 with catch_warnings(record=True):
--> 301     self.api.write(
    302         path,
    303         df,
    304         compression=compression,
    305         write_index=index,
    306         partition_on=partition_cols,
    307         **kwargs,
    308     )

File ~/Documents/fastparquet/fastparquet/writer.py:1214, in write(filename, data, row_group_offsets, compression, file_scheme, open_with, mkdirs, has_nulls, write_index, partition_on, fixed_text, append, object_encoding, times, custom_metadata, stats)
   1211 check_column_names(data.columns, partition_on, fixed_text,
   1212                    object_encoding, has_nulls)
   1213 ignore = partition_on if file_scheme != 'simple' else []
-> 1214 fmd = make_metadata(data, has_nulls=has_nulls, ignore_columns=ignore,
   1215                     fixed_text=fixed_text,
   1216                     object_encoding=object_encoding,
   1217                     times=times, index_cols=index_cols,
   1218                     partition_cols=partition_on)
   1219 if custom_metadata is not None:
   1220     kvm = fmd.key_value_metadata or []

File ~/Documents/fastparquet/fastparquet/writer.py:824, in make_metadata(data, has_nulls, ignore_columns, fixed_text, object_encoding, times, index_cols, partition_cols)
    822     se.name = column
    823 else:
--> 824     se, type = find_type(data[column], fixed_text=fixed,
    825                          object_encoding=oencoding, times=times,
    826                          is_index=is_index)
    827 col_has_nulls = has_nulls
    828 if has_nulls is None:

File ~/Documents/fastparquet/fastparquet/writer.py:225, in find_type(data, fixed_text, object_encoding, times, is_index)
    221     type, converted_type, width = (parquet_thrift.Type.BYTE_ARRAY,
    222                                    parquet_thrift.ConvertedType.UTF8,
    223                                    None)
    224 else:
--> 225     raise ValueError("Don't know how to convert data type: %s" % dtype)
    226 se = parquet_thrift.SchemaElement(
    227     name=norm_col_name(data.name, is_index), type_length=width,
    228     converted_type=converted_type, type=type,
   (...)
    231     i32=True
    232 )
    233 return se, type

ValueError: Don't know how to convert data type: period[D]

The text was updated successfully, but these errors were encountered:

martindurant · 2022-08-11T14:28:54Z

Ah, so a range but for time type. We should find out how pyarrow encodes this in metadata. Alternatively, we could just materialise into a datetime column/index before writing.

pyrito mentioned this issue Aug 11, 2022

Non-deterministic bug when reading parquet file with datetime column as index #796

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`to_parquet` not able to write DateRange with period[D] #797

`to_parquet` not able to write DateRange with period[D] #797

pyrito commented Aug 11, 2022 •

edited

Loading

martindurant commented Aug 11, 2022

to_parquet not able to write DateRange with period[D] #797

to_parquet not able to write DateRange with period[D] #797

Comments

pyrito commented Aug 11, 2022 • edited Loading

martindurant commented Aug 11, 2022

`to_parquet` not able to write DateRange with period[D] #797

`to_parquet` not able to write DateRange with period[D] #797

pyrito commented Aug 11, 2022 •

edited

Loading