Skip to content
This repository has been archived by the owner on Apr 30, 2021. It is now read-only.

Move general functionality upstream #157

Open
andersy005 opened this issue Oct 28, 2019 · 9 comments
Open

Move general functionality upstream #157

andersy005 opened this issue Oct 28, 2019 · 9 comments

Comments

@andersy005
Copy link
Contributor

While working on #156 and per discussion with @dcherian, I learned that esmlab's codebase is predominantly full of hacks.

I don't know the code that well at all (!) but this stuff looks sane.

I myself have started having a hard time understanding the existing codebase. It's been a while since I looked at esmlab's code base. So, I spent the last four days looking at the existing code base, and my take-away is that esmlab's codebase is full of hacks. And this makes it difficult to debug or even maintain :(

I am personally in favor of helping with pushing most of the general functionality into xarray and only keeping things that are domain-specific. I just found out that there has been recent progress in pydata/xarray#2922, and I am excited about this. Once pydata/xarray#2922 is merged, we can move most if not all existing functionality from https://github.com/NCAR/esmlab/blob/master/esmlab/statistics.py into xarray.

Originally posted by @andersy005 in #156 (comment)

@andersy005
Copy link
Contributor Author

Yeah +1 on moving things upstream.

From a quick look, it seems like esmlab basically adds time-aware weighted groupby/resampling; time bounds consistency; and time units/calendargpropagation (pydata/xarray#1614).

For the last one, it would be good to document how much of the table in pydata/xarray#1614 is implemented and what's left.

Originally posted by @dcherian in #156 (comment)

@andersy005
Copy link
Contributor Author

andersy005 commented Oct 28, 2019

@dcherian

From a quick look, it seems like esmlab basically adds time-aware weighted groupby/resampling; time bounds consistency; and time units/calendargpropagation

This is correct.

We are relying on the time bounds for all time-aware operations.

  1. Decoding Time: Internally, esmlab decodes time it by using the time bounds variable. Xarray doesn't seem to be aware of time-bounds variable when decoding time:
In [1]: import xarray as xr                                                                                                                        

In [2]: import numpy as np                                                                                                                         

   ...:     ds['variable_1'] = xr.DataArray( 
   ...:         np.append( 
   ...:             np.zeros([12, 2, 2], dtype='float32'), np.ones([12, 2, 2], dtype='float32'), axis=0 
   ...:         ), 
   ...:         dims=['time', 'lat', 'lon'], 
   ...:     ) 
   ...:     ds['variable_2'] = xr.DataArray( 
   ...:         np.append( 
   ...:             np.ones([12, 2, 2], dtype='float32'), np.zeros([12, 2, 2], dtype='float32'), axis=0 
   ...:         ), 
   ...:         dims=['time', 'lat', 'lon'], 
   ...:     ) 
   ...:     ds.time.attrs['units'] = 'days since 0001-01-01 00:00:00' 
   ...:     ds.time.attrs['calendar'] = 'noleap' 
   ...:     ds.time.attrs['bounds'] = 'time_bound' 
   ...:     return ds.copy(True) 
   ...:                                                                                                                                            

In [4]: ds = create_dataset()                                                                                                                      

In [5]: ds                                                                                                                                         
Out[5]: 
<xarray.Dataset>
Dimensions:     (d2: 2, lat: 2, lon: 2, time: 24)
Coordinates:
  * lat         (lat) int64 0 1
  * lon         (lon) int64 0 1
  * time        (time) float64 31.0 59.0 90.0 120.0 ... 638.0 669.0 699.0 730.0
  * d2          (d2) int64 0 1
Data variables:
    time_bound  (time, d2) float64 0.0 31.0 31.0 59.0 ... 699.0 699.0 730.0
    variable_1  (time, lat, lon) float32 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0
    variable_2  (time, lat, lon) float32 1.0 1.0 1.0 1.0 1.0 ... 0.0 0.0 0.0 0.0

In [6]: xr.decode_cf(ds)                                                                                                                           
Out[6]: 
<xarray.Dataset>
Dimensions:     (d2: 2, lat: 2, lon: 2, time: 24)
Coordinates:
  * lat         (lat) int64 0 1
  * lon         (lon) int64 0 1
  * time        (time) object 0001-02-01 00:00:00 ... 0003-01-01 00:00:00
  * d2          (d2) int64 0 1
Data variables:
    time_bound  (time, d2) object ...
    variable_1  (time, lat, lon) float32 ...
    variable_2  (time, lat, lon) float32 ...

Notice how the time axis is shifted to the right by one month.

Ideally, you would want the time to be:

In [10]: import cftime            
In [14]: cftime.num2date(ds.time_bound.mean(dim='d2'), units='days since 0001-01-01 00:00:00', calendar='noleap')                                  
Out[14]: 
array([cftime.DatetimeNoLeap(0001-01-16 12:00:00),
       cftime.DatetimeNoLeap(0001-02-15 00:00:00),
       cftime.DatetimeNoLeap(0001-03-16 12:00:00),
       cftime.DatetimeNoLeap(0001-04-16 00:00:00),
       cftime.DatetimeNoLeap(0001-05-16 12:00:00),
       cftime.DatetimeNoLeap(0001-06-16 00:00:00),
       cftime.DatetimeNoLeap(0001-07-16 12:00:00),
       cftime.DatetimeNoLeap(0001-08-16 12:00:00),
       cftime.DatetimeNoLeap(0001-09-16 00:00:00),
       cftime.DatetimeNoLeap(0001-10-16 12:00:00),
       cftime.DatetimeNoLeap(0001-11-16 00:00:00),
       cftime.DatetimeNoLeap(0001-12-16 12:00:00),
       cftime.DatetimeNoLeap(0002-01-16 12:00:00),
       cftime.DatetimeNoLeap(0002-02-15 00:00:00),
       cftime.DatetimeNoLeap(0002-03-16 12:00:00),
       cftime.DatetimeNoLeap(0002-04-16 00:00:00),
       cftime.DatetimeNoLeap(0002-05-16 12:00:00),
       cftime.DatetimeNoLeap(0002-06-16 00:00:00),
       cftime.DatetimeNoLeap(0002-07-16 12:00:00),
       cftime.DatetimeNoLeap(0002-08-16 12:00:00),
       cftime.DatetimeNoLeap(0002-09-16 00:00:00),
       cftime.DatetimeNoLeap(0002-10-16 12:00:00),
       cftime.DatetimeNoLeap(0002-11-16 00:00:00),
       cftime.DatetimeNoLeap(0002-12-16 12:00:00)], dtype=object)

Is this something that is reasonable to implement in xarray or is too domain-specific?

  1. Wrong results when applying ops such as resample() on the time axis. Because of the way xarray decodes the time, the user ends up getting wrong results when they try to do some operations along the time axis:

    • For instance, when applying annual resampling on the above dataset, we end up getting three years instead of two years:
In [19]: ds.resample(time='A').mean()                                                                                                              
Out[19]: 
<xarray.Dataset>
Dimensions:     (d2: 2, lat: 2, lon: 2, time: 3)
Coordinates:
  * time        (time) object 0001-12-31 00:00:00 ... 0003-12-31 00:00:00
  * lat         (lat) int64 0 1
  * d2          (d2) int64 0 1
  * lon         (lon) int64 0 1
Data variables:
    variable_1  (time, lat, lon) float32 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0
    variable_2  (time, lat, lon) float32 1.0 1.0 1.0 1.0 ... 0.0 0.0 0.0 0.0
  1. xarray discards the time_bounds variable. As a user, I would expect xarray to keep the time_bounds variable, and generate new values for it accordingly. For instance:
In [26]: esmlab.resample(create_dataset(), freq='ann')                                                                                             
Out[26]: 
<xarray.Dataset>
Dimensions:     (d2: 2, lat: 2, lon: 2, time: 2)
Coordinates:
  * lat         (lat) int64 0 1
  * lon         (lon) int64 0 1
  * time        (time) float64 181.7 546.7
  * d2          (d2) int64 0 1
Data variables:
    time_bound  (d2, time) float64 0.0 365.0 365.0 730.0
    variable_1  (time, lat, lon) float64 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
    variable_2  (time, lat, lon) float64 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0

Is it reasonable to have this in xarray or is domain specific (i.e. should we handle this in esmlab?) ?

@andersy005
Copy link
Contributor Author

xarray discards the time_bounds variable. As a user, I would expect xarray to keep the time_bounds variable, and generate new values for it accordingly.

#142 proposes a method to generate the new time-coordinate.

@dcherian
Copy link

  1. Is this something that is reasonable to implement in xarray or is too domain-specific?

    It is inconvenient but I don't think its wrong. The file is intentionally saved with time at the end of the month. Why wasn't it saved with the midpoint of time bounds in the first place? In any case, this won't fly (Time bounds returned after an operation with resample-method pydata/xarray#2231 (comment)). xarray basically does not deal with relationships between variables and that is unlikely to change. Our only exception (I think) is making sure bounds variables are encoded and decoded with the same units as the grid variable.

  2. As above.

  3. This looks like a bug to me. As a data variable, time_bounds should not be getting dropped. (coordinate variables get dropped groupby does not correctly handle non-dimensional coordinate pydata/xarray#2944 but I think that should change). Your code is incomplete (missing def create_dataset...). can you update that; i'll take a look.

Some of this might get solved eventually (pydata/xarray#1475) when xarray adds more fancy indexes but that's a longer term project.

I should be clear: esmlab adds very useful capabilities and a package like it is definitely required. It's just that general things like machinery for weighted operations should be moved upstream.

@andersy005
Copy link
Contributor Author

Your code is incomplete (missing def create_dataset...). can you update that; i'll take a look.

My bad. I partially pasted the code. Here's the complete version:

import xarray as xr
import numpy as np
def create_dataset():
    start_date = np.array([0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334], dtype=np.float64)
    start_date = np.append(start_date, start_date + 365)
    end_date = np.array([31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334, 365], dtype=np.float64)
    end_date = np.append(end_date, end_date + 365)
    ds = xr.Dataset(coords={'time': 24, 'lat': 2, 'lon': 2, 'd2': 2})
    ds['time'] = xr.DataArray(end_date, dims='time')
    ds['lat'] = xr.DataArray([0, 1], dims='lat')
    ds['lon'] = xr.DataArray([0, 1], dims='lon')
    ds['d2'] = xr.DataArray([0, 1], dims='d2')
    ds['time_bound'] = xr.DataArray(
        np.array([start_date, end_date]).transpose(), dims=['time', 'd2']
    )
    ds['variable_1'] = xr.DataArray(
        np.append(
            np.zeros([12, 2, 2], dtype='float32'), np.ones([12, 2, 2], dtype='float32'), axis=0
        ),
        dims=['time', 'lat', 'lon'],
    )
    ds['variable_2'] = xr.DataArray(
        np.append(
            np.ones([12, 2, 2], dtype='float32'), np.zeros([12, 2, 2], dtype='float32'), axis=0
        ),
        dims=['time', 'lat', 'lon'],
    )
    ds.time.encoding['units'] = 'days since 0001-01-01 00:00:00'
    ds.time.encoding['calendar'] = 'noleap'
    ds.time.encoding['bounds'] = 'time_bound'
    return ds

@andersy005
Copy link
Contributor Author

Why wasn't it saved with the midpoint of time bounds in the first place?

I agree with you. This is a CESM issue and not a general issue, and it makes more sense to address it at the source.

@andersy005
Copy link
Contributor Author

Or we could just have a separate minimal package to handle the strange behaviors/inconsistencies of CESM output that are problematic when using xarray.

@dcherian
Copy link

pydata/xarray#2922 could use some testing if someone's got the time

@andersy005
Copy link
Contributor Author

@dcherian, I took pydata/xarray#2922 for a test drive. From the results I am getting, it looks good to me:

Screen Shot 2019-12-11 at 11 49 29 AM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants