Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax nanosecond datetime restriction in CF time decoding #9618

Open
wants to merge 61 commits into
base: main
Choose a base branch
from

Conversation

kmuehlbauer
Copy link
Contributor

@kmuehlbauer kmuehlbauer commented Oct 13, 2024

This is another attempt to resolve #7493. This goes a step further than #9580.

The idea of this PR is to automatically infer the needed resolutions for decoding/encoding and only keep the constraints pandas imposes ("s" - lowest resolution, "ns" - highest resolution). There is still the idea of a default resolution, but this should only take precedence if it doesn't clash with the automatic inference. This can be discussed, though. Update: I've implemented time-unit-kwarg a first try to have default resolution on decode, which will override the current inferred resolution only to higher resolution (eg. 's' -> 'ns').

For sanity checking, and also for my own good, I've created a documentation page on time-coding in the internal dev section. Any suggestions (especially grammar) or ideas for enhancements are much appreciated.

There still might be room for consolidation of functions/methods (mostly in coding/times.py), but I have to leave it alone for some days. I went down that rabbit hole and need to relax, too 😬.

Looking forward to get your insights here, @spencerkclark, @ChrisBarker-NOAA, @pydata/xarray.

Todo:

  • floating point handling
  • Handling in Variable constructor
  • update decoding tests to iterate over time_units (where appropriate)
  • ...

@kmuehlbauer
Copy link
Contributor Author

Nice, mypy 1.12 is out and breaks our typing, 😭.

@TomNicholas
Copy link
Member

Nice, mypy 1.12 is out and breaks our typing, 😭

Can we pin it in the CI temporarily?

@TomNicholas TomNicholas mentioned this pull request Oct 14, 2024
4 tasks
@kmuehlbauer
Copy link
Contributor Author

Can we pin it in the CI temporarily?

Yes, 1.11.2 was the last version.

@kmuehlbauer kmuehlbauer marked this pull request as ready for review October 14, 2024 18:05
@kmuehlbauer
Copy link
Contributor Author

This is now ready for a first round of review. I think this is already in a quite usable state.

But no rush, this should be thoroughly tested.

@spencerkclark
Copy link
Member

Sounds good @kmuehlbauer! I’ll try and take an initial look this weekend.

@ChrisBarker-NOAA
Copy link

Not to throw too much of a wrench in the works here -- so feel free to disregard, but there's an issue I've faced with (single precision) float time encoding:

Folks (carelessly :-( ) sometimes encode times as "days since ..." using a single precision float. The problem here is not unnecessary precision, as you get with double, but too little -- if you go more than a few years out , you lose seconds precision. (the key problem is that float time -- its precision is a function of the magnitude -- not good for this use case)

The end result is that I get things like model timesteps that are supposed to be hourly, reporting as, e.g. 12:00:18, rather than 12:00:00

One way I've dealt with this is rounding to the minute, or even to hours (if I know the output is hourly), or perhaps to 10 minutes.

Could / should xarray provide a facility for doing this? maybe?

I guess what I'm proposing is that there be some way to tell xarray to store / save a time variable with e.g. second precision, but to round it to something more coarse when decoding.

maybe this could even be automatic / inferred:

if a time is in float days since -- it almost certainly is NOT millisecond precision, or even second precision -- and you could even look at the values (the first one?) and see what the minimum precision is for that timespan.

If I've done my math right, a float can only store second precision for a little over three years. So if the values are greater than three years, you don't have second precision.

Anyway, maybe way too much magic, but it would be nice for my use cases :-)

Example:

# 15 min timestep
In [57]: dates
Out[57]: 
[datetime.datetime(2024, 1, 1, 0, 0),
 datetime.datetime(2024, 1, 1, 0, 15),
 datetime.datetime(2024, 1, 1, 0, 30),
 datetime.datetime(2024, 1, 1, 0, 45)]

# common choice of units (though a bad one :-( )
In [58]: units
Out[58]: 'days since 1970-01-01T00:00:00'

# convert to numbers, used float64 by default
In [59]: nums_double = nc4.date2num(dates, units)

# truncate to float32
In [60]: nums_float = nums_double.astype(np.float32)

# convert back to datetimes:
In [61]: dates_float = nc4.num2date(nums_float, units)

In [62]: dates_float
Out[62]: 
array([cftime.DatetimeGregorian(2024, 1, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 14, 3, 750000, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 30, 56, 250000, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 45, 0, 0, has_year_zero=False)],
      dtype=object)
In [67]: [str(dt) for dt in dates_float]
Out[67]: 
['2024-01-01 00:00:00',
 '2024-01-01 00:14:03.750000',
 '2024-01-01 00:30:56.250000',
 '2024-01-01 00:45:00']

Ouch! so what were 15 minute timesteps is now off by about one minute -- and what's too bad is that rounding to the minute wouldn't be right either -- you'd need to round to maybe 5 minutes?

Anyway, maybe this simply isn't xarray's problem to solve -- data providers shouldn't make such mistakes :-(

@spencerkclark
Copy link
Member

@ChrisBarker-NOAA yeah, I agree this kind of situation is annoying, but my feeling is that trying to fix this automatically would be too much magic. Xarray has convenient functionality for rounding times, which can be used to correct this explicitly—that would be my preference. E.g. for your example it would look like:

>>> decoded
<xarray.DataArray 'time' (time: 4)> Size: 32B
array([cftime.DatetimeGregorian(2024, 1, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 14, 3, 750000, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 30, 56, 250000, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 45, 0, 0, has_year_zero=False)],
      dtype=object)
Dimensions without coordinates: time
>>> decoded.dt.round("5min")
<xarray.DataArray 'round' (time: 4)> Size: 32B
array([cftime.DatetimeGregorian(2024, 1, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 15, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 30, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 45, 0, 0, has_year_zero=False)],
      dtype=object)
Dimensions without coordinates: time

@ChrisBarker-NOAA
Copy link

Xarray has convenient functionality for rounding times

Oh, nice: I had missed that! -- you're probably right, too much magic to do for people.

@kmuehlbauer
Copy link
Contributor Author

@spencerkclark @ChrisBarker-NOAA I've implemented automated decoding of floating point data to the needed resolution, even when the wanted resolution does not apply.

Unfortunately the above outlined behaviour is too much involved to be put into the decoder. Nevertheless maybe we can distill some best practices from your vast experience with data @ChrisBarker-NOAA and create a nice example how to handle these difficulties?

@ChrisBarker-NOAA
Copy link

create a nice example how to handle these difficulties?

Sure -- where would be a good home for that?

@kmuehlbauer
Copy link
Contributor Author

Not sure, but https://docs.xarray.dev/en/stable/user-guide/time-series.html could have a dedicated floating point date section.

@kmuehlbauer kmuehlbauer changed the title Relax nanosecond datetime restriction in CF time coding Relax nanosecond datetime restriction in CF time decoding Nov 21, 2024
@kmuehlbauer
Copy link
Contributor Author

I've added a kwarg time_unit into the decode_cf and subsequent functionality.

But instead of adding that kwarg we could slightly overload the decode_times to take one of "s", "ms", "us", "ns" with "ns" as default.

This would have the positive effect, that we wouldn't need the additional kwarg and have to distribute it through the backends.

  • decode_times=None - directs to decode_times=True
  • decode_times=False - no decoding
  • decode_times=True - decode times with default value ("ns")
  • decode_times="s" - decode times to at least "s"
  • decode_times="ms" - decode times to at least "ms"
  • decode_times="us" - decode times to at least "us"
  • decode_times="ns" - decode times to "ns"

We could guard decode_times=None and decode_times=True with a DeprecationWarning and add our new defaults in the WarningMessage (eg. "us").

This methodology would be fully backwards compatible. It advertises the change via DeprecationWarning in normal operation and also if issues appear in the decoding steps.

If this is something which makes sense @shoyer, @dcherian, @spencerkclark, I'd add the needed changes to this PR.

@dcherian
Copy link
Contributor

Alternatively, we could make small progress on #4490 and have

from xarray.coding import DatetimeCoder

ds = xr.open_mfdataset(..., decode_times=DatetimeCoder(units="ms"))

In the long term, it seems nice to have the default use the "natural" units i.e. "h" for units="hours since ..." and apparently even "M" for units=months since ... (!!)

https://numpy.org/doc/stable/reference/arrays.datetime.html#basic-datetimes
The date units are years (‘Y’), months (‘M’), weeks (‘W’), and days (‘D’), while the time units are hours (‘h’), minutes (‘m’), seconds (‘s’), milliseconds (‘ms’),

@kmuehlbauer
Copy link
Contributor Author

Alternatively, we could make small progress on #4490 and have

from xarray.coding import DatetimeCoder

ds = xr.open_mfdataset(..., decode_times=DatetimeCoder(units="ms"))

This took a while to sink in 😉 Yes, that's a neat move. I'll incorporate this suggestion.

In the long term, it seems nice to have the default use the "natural" units i.e. "h" for units="hours since ..." and apparently even "M" for units=months since ... (!!)

As long as we use pd.Timestamp for parsing the time unit specification (eg. seconds since 1992-10-8 15:15:42.5 -6:00) we can only do this for those units pd.Timestamp supports (‘s’, ‘ms’, ‘us’, and ‘ns’). We could add some code that checks if the data can be represented in "days" or "hours" (as given in the time unit specification) and convert after the parsing. Not sure how much there is involved. And this won't work for indexes, as those are restricted to (‘s’, ‘ms’, ‘us’, and ‘ns’).

@spencerkclark
Copy link
Member

+1 for fewer arguments to open_dataset, however that is achieved!

Indeed for those practical reasons I do not think it is worth trying to match the on-disk units of integer data any more closely. Second precision already allows for a time span of roughly +/- 290 billion years (many times older than the Earth), which I think is plenty for most applications :).

Monthly or yearly units are also somewhat awkward to deal with due to their different (albeit often violated) definition in the CF conventions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants