Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot filter tz-aware datetime64 column with tz-aware predicate #744

Open
erinov1 opened this issue Jan 28, 2022 · 1 comment
Open

Cannot filter tz-aware datetime64 column with tz-aware predicate #744

erinov1 opened this issue Jan 28, 2022 · 1 comment

Comments

@erinov1
Copy link

erinov1 commented Jan 28, 2022

I have a partitioned parquet dataset containing a datetime64[ns, UTC] column ts (i.e., it is timezone-aware, withtz=UTC). The following pandas invocation does not work with engine=fastparquet:

pd.read_parquet(
    'my_dataset', 
    engine='fastparquet', 
    filters=[('ts', '>=', pd.Timestamp('2021-01-01', tz='UTC'))]
)

Tail of traceback:

File ~/Library/Caches/pypoetry/virtualenvs/project-GfuZs_x0-py3.8/lib/python3.8/site-packages/fastparquet/api.py:1090, in filter_out_stats(rg, filters, schema)
   1088                     s["converted_min"] = vmin
   1089                 vmin = s["converted_min"]
-> 1090             if filter_val(op, val, vmin, vmax):
   1091                 return True
   1092 return False

File ~/Library/Caches/pypoetry/virtualenvs/project-GfuZs_x0-py3.8/lib/python3.8/site-packages/fastparquet/api.py:1334, in filter_val(op, val, vmin, vmax)
   1332     return filter_not_in(val, vmin, vmax)
   1333 if vmax is not None:
-> 1334     if op in ['==', '>=', '='] and val > vmax:
   1335         return True
   1336     if op == '>' and val >= vmax:

File ~/Library/Caches/pypoetry/virtualenvs/project-GfuZs_x0-py3.8/lib/python3.8/site-packages/pandas/_libs/tslibs/timestamps.pyx:253, in pandas._libs.tslibs.timestamps._Timestamp.__richcmp__()

TypeError: Cannot compare tz-naive and tz-aware timestamps

The same invocation works fine with engine=pyarrow. On the other hand, fastparquet is able to do the filtering if the timezone is omitted (and of course pyarrow fails):

pd.read_parquet(
    'my_dataset', 
    engine='fastparquet', 
    filters=[('ts', '>=', pd.Timestamp('2021-01-01' ))]
)

I suspect that pyarrow has the right idea here?

Environment:
fastparquet==0.8.0

@martindurant
Copy link
Member

This does not surprise me: the "statistics" in the parquet file are stored without any timezone and values for the column are only applied after complete load. It would be reasonable to apply time zones to the statistics, but I suspect it would be annoying to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants