Cannot filter tz-aware datetime64 column with tz-aware predicate #744

erinov1 · 2022-01-28T21:04:11Z

I have a partitioned parquet dataset containing a datetime64[ns, UTC] column ts (i.e., it is timezone-aware, withtz=UTC). The following pandas invocation does not work with engine=fastparquet:

pd.read_parquet(
    'my_dataset', 
    engine='fastparquet', 
    filters=[('ts', '>=', pd.Timestamp('2021-01-01', tz='UTC'))]
)

Tail of traceback:

File ~/Library/Caches/pypoetry/virtualenvs/project-GfuZs_x0-py3.8/lib/python3.8/site-packages/fastparquet/api.py:1090, in filter_out_stats(rg, filters, schema)
   1088                     s["converted_min"] = vmin
   1089                 vmin = s["converted_min"]
-> 1090             if filter_val(op, val, vmin, vmax):
   1091                 return True
   1092 return False

File ~/Library/Caches/pypoetry/virtualenvs/project-GfuZs_x0-py3.8/lib/python3.8/site-packages/fastparquet/api.py:1334, in filter_val(op, val, vmin, vmax)
   1332     return filter_not_in(val, vmin, vmax)
   1333 if vmax is not None:
-> 1334     if op in ['==', '>=', '='] and val > vmax:
   1335         return True
   1336     if op == '>' and val >= vmax:

File ~/Library/Caches/pypoetry/virtualenvs/project-GfuZs_x0-py3.8/lib/python3.8/site-packages/pandas/_libs/tslibs/timestamps.pyx:253, in pandas._libs.tslibs.timestamps._Timestamp.__richcmp__()

TypeError: Cannot compare tz-naive and tz-aware timestamps

The same invocation works fine with engine=pyarrow. On the other hand, fastparquet is able to do the filtering if the timezone is omitted (and of course pyarrow fails):

pd.read_parquet(
    'my_dataset', 
    engine='fastparquet', 
    filters=[('ts', '>=', pd.Timestamp('2021-01-01' ))]
)

I suspect that pyarrow has the right idea here?

Environment:
fastparquet==0.8.0

The text was updated successfully, but these errors were encountered:

martindurant · 2022-01-31T17:59:37Z

This does not surprise me: the "statistics" in the parquet file are stored without any timezone and values for the column are only applied after complete load. It would be reasonable to apply time zones to the statistics, but I suspect it would be annoying to implement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot filter tz-aware datetime64 column with tz-aware predicate #744

Cannot filter tz-aware datetime64 column with tz-aware predicate #744

erinov1 commented Jan 28, 2022 •

edited

Loading

martindurant commented Jan 31, 2022

Cannot filter tz-aware datetime64 column with tz-aware predicate #744

Cannot filter tz-aware datetime64 column with tz-aware predicate #744

Comments

erinov1 commented Jan 28, 2022 • edited Loading

martindurant commented Jan 31, 2022

erinov1 commented Jan 28, 2022 •

edited

Loading