resample fails on HadCRUT dataset #126

brianpm · 2019-05-01T18:55:05Z

I'm trying to use resample to go from monthly to annual means in HadCRUT data. It seems to be having trouble with the time coordinate, and ultimately complains "TypeError: cannot perform reduce with flexible type".

import numpy as np
import xarray as xr
import esmlab
troot = Path("/glade/work/brianpm/HadCRUT")
tanomfil = xr.open_dataset(troot/"HadCRUT.4.6.0.0.median.nc", decode_times=False)
esmlab.resample(tanomfil, 'ann',  time_coord_name='time')

This was using esmlab '2019.4.27'
xarray 0.12.1
numpy 1.16.3
python 3.7.3

Forgot to mention: tried with decode_times=False and True.

The text was updated successfully, but these errors were encountered:

andersy005 · 2019-05-01T19:08:40Z

@brianpm, I am going to look into this and will get back to you

andersy005 · 2019-05-01T19:24:13Z

Open the dataset with decode_times=True

The issue is caused by the field_status variable datatype (which is a string). Esmlab is trying to apply resample to all data variables including field_status and it is failing due its field_status data type:

In [14]: tanomfil
Out[14]:
<xarray.Dataset>
Dimensions:              (latitude: 36, longitude: 72, time: 2030)
Coordinates:
  * latitude             (latitude) float32 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
  * longitude            (longitude) float32 -177.5 -172.5 ... 172.5 177.5
  * time                 (time) datetime64[ns] 1850-01-16T12:00:00 ... 2019-02-15
Data variables:
    temperature_anomaly  (time, latitude, longitude) float32 ...
    field_status         (time) |S1 ...

One solution is to drop this variable:

In [16]: ds = tanomfil.drop('field_status')

In [17]: esmlab.resample(ds, 'ann',  time_coord_name='time')
/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/xarray/core/nanops.py:159: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis=axis, dtype=dtype)
Out[17]:
<xarray.Dataset>
Dimensions:              (latitude: 36, longitude: 72, time: 170)
Coordinates:
  * latitude             (latitude) float64 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
  * longitude            (longitude) float64 -177.5 -172.5 ... 172.5 177.5
  * time                 (time) datetime64[ns] 1850-07-01T17:00:00 ... 2019-01-31T06:00:00
Data variables:
    temperature_anomaly  (time, latitude, longitude) float64 nan nan ... nan nan
Attributes:
    history:  \n2019-05-01 13:14:37.171125 esmlab.resample(<DATASET>, freq="a...

Another solution is to set field_status as a coordinate variable and then call resample:

In [27]: tanomfil = tanomfil.set_coords(['field_status'])

In [28]: tanomfil
Out[28]:
<xarray.Dataset>
Dimensions:              (latitude: 36, longitude: 72, time: 2030)
Coordinates:
  * latitude             (latitude) float32 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
  * longitude            (longitude) float32 -177.5 -172.5 ... 172.5 177.5
  * time                 (time) datetime64[ns] 1850-01-16T12:00:00 ... 2019-02-15
    field_status         (time) |S1 b'f' b'f' b'f' b'f' ... b'p' b'p' b'p' b'p'
Data variables:
    temperature_anomaly  (time, latitude, longitude) float32 ...
Attributes:
    title:                  HadCRUT4 near-surface temperature ensemble data -...
    institution:            Met Office Hadley Centre / Climatic Research Unit...
    history:                Updated at 25/03/2019 16:03:56
    source:                 CRUTEM.4.6.0.0, HadSST.3.1.1.0
    comment:
    reference:              Morice, C. P., J. J. Kennedy, N. A. Rayner, and P...
    version:                HadCRUT.4.6.0.0
    Conventions:            CF-1.0
    ensemble_members:       100
    ensemble_member_index:  0

In [29]: esmlab.resample(tanomfil, 'ann',  time_coord_name='time')
/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/xarray/core/nanops.py:159: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis=axis, dtype=dtype)
Out[29]:
<xarray.Dataset>
Dimensions:              (latitude: 36, longitude: 72, time: 170)
Coordinates:
  * latitude             (latitude) float64 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
  * longitude            (longitude) float64 -177.5 -172.5 ... 172.5 177.5
  * time                 (time) datetime64[ns] 1850-07-01T17:00:00 ... 2019-01-31T06:00:00
Data variables:
    temperature_anomaly  (time, latitude, longitude) float64 nan nan ... nan nan
Attributes:
    history:  \n2019-05-01 13:20:25.645865 esmlab.resample(<DATASET>, freq="a...

brianpm · 2019-05-01T19:36:25Z

That makes sense.

A nice alternative would be to be able to apply resample to just a DataArray instead of the whole dataset. Is that possible, or is then time_bnds then super problematic?

andersy005 · 2019-05-01T19:43:11Z

A nice alternative would be to be able to apply resample to just a DataArray instead of the whole dataset.

One easy solution for this would be to allow a user to pass a list of variables they would like to apply the resample on. For example:

esmlab.resample(tanomfil, 'ann',  time_coord_name='time', data_vars=['temperature_anomaly'])

@matt-long, any thoughts?

bradyrx · 2019-05-01T19:48:08Z

I imagine you could also do a quick check for variable type when you loop through data_vars. If you encounter a str, datetime, etc. you don't apply the resample there.

bradyrx · 2019-05-01T19:53:38Z

Note that native xarray just drops variables that can't be sampled in the requested way. It would be a nice fix to avoid that behavior. Although you might want to throw a warning that certain variables weren't resampled as requested due to data type.

import numpy as np
import pandas as pd
import xarray as xr
A = xr.DataArray(np.random.rand(100,), dims='time')
A.name = 'ints'
B = xr.DataArray(['str']*100, dims='time')
B.name = 'strs'
ds = xr.merge([A, B])
ds['time'] = pd.date_range('1900', '2000', freq='A')
print(ds)

<xarray.Dataset>
Dimensions:  (time: 100)
Coordinates:
  * time     (time) datetime64[ns] 1900-12-31 1901-12-31 ... 1999-12-31
Data variables:
    ints     (time) float64 0.8798 0.4835 0.6354 0.354 ... 0.1584 0.7635 0.8525
    strs     (time) <U3 'str' 'str' 'str' 'str' ... 'str' 'str' 'str' 'str'

ds.resample(time='A').mean()

<xarray.Dataset>
Dimensions:  (time: 100)
Coordinates:
  * time     (time) datetime64[ns] 1900-12-31 1901-12-31 ... 1999-12-31
Data variables:
    ints     (time) float64 0.8798 0.4835 0.6354 0.354 ... 0.1584 0.7635 0.8525

andersy005 · 2019-05-01T23:09:36Z

@bradyrx, we are using dset.apply() internally, and I believe this is where the problem is coming from:

esmlab/esmlab/core.py

Lines 488 to 505 in 8ced4d6

 def weighted_mean_arr(darr, wgts=None): 

 # if NaN are present, we need to use individual weights 

 cond = darr.isnull() 

 ones = xr.where(cond, 0.0, 1.0) 

 mask = ( 

 darr.resample({self.time_coord_name: 'A'}).mean(dim=self.time_coord_name).notnull() 

 ) 

 da_sum = ( 

 (darr * wgts).resample({self.time_coord_name: 'A'}).sum(dim=self.time_coord_name) 

 ) 

 ones_out = ( 

 (ones * wgts).resample({self.time_coord_name: 'A'}).sum(dim=self.time_coord_name) 

 ) 

 ones_out = ones_out.where(ones_out > 0.0) 

 da_weighted_mean = da_sum / ones_out 

 return da_weighted_mean.where(mask) 

 ds_resample_mean = dset.apply(weighted_mean_arr, wgts=wgts)

I am curious to know how one could add a check before calling .apply()

bradyrx · 2019-05-02T17:40:07Z

Here's a naive solution off the top of my head (using slightly modified weighted mean function here):

def weighted_mean_arr(darr, wgts=1, time_coord_name='time'): 
    # if NaN are present, we need to use individual weights 
    cond = darr.isnull() 
    ones = xr.where(cond, 0.0, 1.0) 
    mask = ( 
        darr.resample({time_coord_name: 'A'}).mean(dim=time_coord_name).notnull() 
    ) 
    da_sum = ( 
        (darr * wgts).resample({time_coord_name: 'A'}).sum(dim=time_coord_name) 
    ) 
    ones_out = ( 
        (ones * wgts).resample({time_coord_name: 'A'}).sum(dim=time_coord_name) 
    ) 
    ones_out = ones_out.where(ones_out > 0.0) 
    da_weighted_mean = da_sum / ones_out 
    return da_weighted_mean.where(mask)

import numpy as np
import pandas as pd
import xarray as xr

# Generate monthly dummy data with two strings variables
A = xr.DataArray(np.random.rand(120,), dims='time')
A.name = 'ints'
B = xr.DataArray(['str']*120, dims='time')
B.name = 'strs'
C = xr.DataArray(['foo']*120, dims='time')
C.name = 'strs2'
ds = xr.merge([A, B, C])
ds['time'] = pd.date_range('1900-01', '1910-01', freq='M')
print(ds)

<xarray.Dataset>
Dimensions:  (time: 120)
Coordinates:
  * time     (time) datetime64[ns] 1900-01-31 1900-02-28 ... 1909-12-31
Data variables:
    ints     (time) float64 0.3267 0.4233 0.4777 0.6476 ... 0.3189 0.9488 0.9251
    strs     (time) <U3 'str' 'str' 'str' 'str' ... 'str' 'str' 'str' 'str'
    strs2    (time) <U3 'foo' 'foo' 'foo' 'foo' ... 'foo' 'foo' 'foo' 'foo'

# this list probably isn't exhaustive
acceptable_types = [np.dtype('float'), np.dtype('int'), np.dtype('double')]
drop_vars = []
for var in ds.data_vars:
    if np.asarray(ds[var]).dtype not in acceptable_types:
        drop_vars.append(var)

# extract variables that can't be extracted into separate dataset
no_apply = ds[drop_vars]
# apply function only to non-string types
resampled = ds.drop(drop_vars).apply(weighted_mean_arr)

print(resampled)
<xarray.Dataset>
Dimensions:  (time: 10)
Coordinates:
  * time     (time) datetime64[ns] 1900-12-31 1901-12-31 ... 1909-12-31
Data variables:
    ints     (time) float64 0.4236 0.5849 0.5722 0.4475 ... 0.4421 0.5669 0.5151

print(no_apply)
<xarray.Dataset>
Dimensions:  (time: 120)
Coordinates:
  * time     (time) datetime64[ns] 1900-01-31 1900-02-28 ... 1909-12-31
Data variables:
    strs     (time) <U3 'str' 'str' 'str' 'str' ... 'str' 'str' 'str' 'str'
    strs2    (time) <U3 'foo' 'foo' 'foo' 'foo' ... 'foo' 'foo' 'foo' 'foo'

As in, you split the Dataset into two separate Datasets, apply the function, and then merge them in the end. The issue is now you have two time dimensions with different coordinates. At this point, you'd have to rename one of them something like TIME (which LANL does for some of their netCDF files with differing temporal resolutions). Although I don't really like this solution, as it gets confusing.

In the end, should you really be maintaining a Dataset with different resolutions on the time axis? Maybe xarray has it right to just drop any non-quantitative variables. (I.e., just ditch no_apply in this case).

andersy005 added the question label May 1, 2019

andersy005 mentioned this issue Jun 9, 2019

Refactoring: Esmlab's TimeManager class #139

Closed

andersy005 removed the question label Jul 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resample fails on HadCRUT dataset #126

resample fails on HadCRUT dataset #126

brianpm commented May 1, 2019 •

edited

Loading

andersy005 commented May 1, 2019

andersy005 commented May 1, 2019 •

edited

Loading

brianpm commented May 1, 2019 •

edited

Loading

andersy005 commented May 1, 2019

bradyrx commented May 1, 2019

bradyrx commented May 1, 2019

andersy005 commented May 1, 2019 •

edited

Loading

bradyrx commented May 2, 2019

resample fails on HadCRUT dataset #126

resample fails on HadCRUT dataset #126

Comments

brianpm commented May 1, 2019 • edited Loading

andersy005 commented May 1, 2019

andersy005 commented May 1, 2019 • edited Loading

brianpm commented May 1, 2019 • edited Loading

andersy005 commented May 1, 2019

bradyrx commented May 1, 2019

bradyrx commented May 1, 2019

andersy005 commented May 1, 2019 • edited Loading

bradyrx commented May 2, 2019

brianpm commented May 1, 2019 •

edited

Loading

andersy005 commented May 1, 2019 •

edited

Loading

brianpm commented May 1, 2019 •

edited

Loading

andersy005 commented May 1, 2019 •

edited

Loading