Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_parquet not able to write DateRange with period[D] #797

Open
pyrito opened this issue Aug 11, 2022 · 1 comment
Open

to_parquet not able to write DateRange with period[D] #797

pyrito opened this issue Aug 11, 2022 · 1 comment

Comments

@pyrito
Copy link

pyrito commented Aug 11, 2022

What happened: ValueError thrown when trying to write DataFrame (w/ period[D] date range column) to parquet files

What you expected to happen: to_parquet finishes call without any errors

Minimal Complete Verifiable Example:

import pandas
import numpy as np

pandas_df = pandas.DataFrame(
    {
        "idx_categorical": pandas.Categorical(["y", "z"] * 1000),
        "idx_datetime": pandas.date_range(start="1/1/2018", periods=2000, freq='D'),
        "idx_periodrange": pandas.period_range(
            start="2017-01-01", periods=2000
        ),
        "B": ["a", "b"] * 1000,
        "C": ["c"] * 2000,
    }
)

# This fails
pandas_df.set_index("idx_datetime").to_parquet("testing/test.parquet", engine='fastparquet')

Anything else we need to know?:

Environment:

  • fastparquet version: 0.8.1 (latest SHA: 34069fe)
  • Python version: 3.9.12
  • Operating System: macOS 12.2.1
  • Install method (conda, pip, source): source
Cluster Dump State:
ValueError                                Traceback (most recent call last)
Input In [2], in <cell line: 17>()
      4 pandas_df = pandas.DataFrame(
      5     {
      6         "idx_categorical": pandas.Categorical(["y", "z"] * 1000),
   (...)
     13     }
     14 )
     16 # This works correctly
---> 17 pandas_df.set_index("idx_datetime").to_parquet("testing/test.parquet", engine="fastparquet")
     19 # This fails sometimes
     20 pdf = pandas.read_parquet("testing/test.parquet", engine="fastparquet")

File ~/opt/anaconda3/envs/test/lib/python3.9/site-packages/pandas/util/_decorators.py:207, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    205     else:
    206         kwargs[new_arg_name] = new_arg_value
--> 207 return func(*args, **kwargs)

File ~/opt/anaconda3/envs/test/lib/python3.9/site-packages/pandas/core/frame.py:2835, in DataFrame.to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
   2749 """
   2750 Write a DataFrame to the binary parquet format.
   2751 
   (...)
   2831 >>> content = f.read()
   2832 """
   2833 from pandas.io.parquet import to_parquet
-> 2835 return to_parquet(
   2836     self,
   2837     path,
   2838     engine,
   2839     compression=compression,
   2840     index=index,
   2841     partition_cols=partition_cols,
   2842     storage_options=storage_options,
   2843     **kwargs,
   2844 )

File ~/opt/anaconda3/envs/test/lib/python3.9/site-packages/pandas/io/parquet.py:420, in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, **kwargs)
    416 impl = get_engine(engine)
    418 path_or_buf: FilePath | WriteBuffer[bytes] = io.BytesIO() if path is None else path
--> 420 impl.write(
    421     df,
    422     path_or_buf,
    423     compression=compression,
    424     index=index,
    425     partition_cols=partition_cols,
    426     storage_options=storage_options,
    427     **kwargs,
    428 )
    430 if path is None:
    431     assert isinstance(path_or_buf, io.BytesIO)

File ~/opt/anaconda3/envs/test/lib/python3.9/site-packages/pandas/io/parquet.py:301, in FastParquetImpl.write(self, df, path, compression, index, partition_cols, storage_options, **kwargs)
    296     raise ValueError(
    297         "storage_options passed with file object or non-fsspec file path"
    298     )
    300 with catch_warnings(record=True):
--> 301     self.api.write(
    302         path,
    303         df,
    304         compression=compression,
    305         write_index=index,
    306         partition_on=partition_cols,
    307         **kwargs,
    308     )

File ~/Documents/fastparquet/fastparquet/writer.py:1214, in write(filename, data, row_group_offsets, compression, file_scheme, open_with, mkdirs, has_nulls, write_index, partition_on, fixed_text, append, object_encoding, times, custom_metadata, stats)
   1211 check_column_names(data.columns, partition_on, fixed_text,
   1212                    object_encoding, has_nulls)
   1213 ignore = partition_on if file_scheme != 'simple' else []
-> 1214 fmd = make_metadata(data, has_nulls=has_nulls, ignore_columns=ignore,
   1215                     fixed_text=fixed_text,
   1216                     object_encoding=object_encoding,
   1217                     times=times, index_cols=index_cols,
   1218                     partition_cols=partition_on)
   1219 if custom_metadata is not None:
   1220     kvm = fmd.key_value_metadata or []

File ~/Documents/fastparquet/fastparquet/writer.py:824, in make_metadata(data, has_nulls, ignore_columns, fixed_text, object_encoding, times, index_cols, partition_cols)
    822     se.name = column
    823 else:
--> 824     se, type = find_type(data[column], fixed_text=fixed,
    825                          object_encoding=oencoding, times=times,
    826                          is_index=is_index)
    827 col_has_nulls = has_nulls
    828 if has_nulls is None:

File ~/Documents/fastparquet/fastparquet/writer.py:225, in find_type(data, fixed_text, object_encoding, times, is_index)
    221     type, converted_type, width = (parquet_thrift.Type.BYTE_ARRAY,
    222                                    parquet_thrift.ConvertedType.UTF8,
    223                                    None)
    224 else:
--> 225     raise ValueError("Don't know how to convert data type: %s" % dtype)
    226 se = parquet_thrift.SchemaElement(
    227     name=norm_col_name(data.name, is_index), type_length=width,
    228     converted_type=converted_type, type=type,
   (...)
    231     i32=True
    232 )
    233 return se, type

ValueError: Don't know how to convert data type: period[D]

@martindurant
Copy link
Member

Ah, so a range but for time type. We should find out how pyarrow encodes this in metadata. Alternatively, we could just materialise into a datetime column/index before writing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants