Skip to content
This repository has been archived by the owner on Apr 30, 2021. It is now read-only.

resample fails on HadCRUT dataset #126

Open
brianpm opened this issue May 1, 2019 · 8 comments
Open

resample fails on HadCRUT dataset #126

brianpm opened this issue May 1, 2019 · 8 comments

Comments

@brianpm
Copy link

brianpm commented May 1, 2019

I'm trying to use resample to go from monthly to annual means in HadCRUT data. It seems to be having trouble with the time coordinate, and ultimately complains "TypeError: cannot perform reduce with flexible type".

import numpy as np
import xarray as xr
import esmlab
troot = Path("/glade/work/brianpm/HadCRUT")
tanomfil = xr.open_dataset(troot/"HadCRUT.4.6.0.0.median.nc", decode_times=False)
esmlab.resample(tanomfil, 'ann',  time_coord_name='time')

This was using esmlab '2019.4.27'
xarray 0.12.1
numpy 1.16.3
python 3.7.3

Forgot to mention: tried with decode_times=False and True.

@andersy005
Copy link
Contributor

@brianpm, I am going to look into this and will get back to you

@andersy005
Copy link
Contributor

andersy005 commented May 1, 2019

  • Open the dataset with decode_times=True

The issue is caused by the field_status variable datatype (which is a string). Esmlab is trying to apply resample to all data variables including field_status and it is failing due its field_status data type:

In [14]: tanomfil
Out[14]:
<xarray.Dataset>
Dimensions:              (latitude: 36, longitude: 72, time: 2030)
Coordinates:
  * latitude             (latitude) float32 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
  * longitude            (longitude) float32 -177.5 -172.5 ... 172.5 177.5
  * time                 (time) datetime64[ns] 1850-01-16T12:00:00 ... 2019-02-15
Data variables:
    temperature_anomaly  (time, latitude, longitude) float32 ...
    field_status         (time) |S1 ...

One solution is to drop this variable:

In [16]: ds = tanomfil.drop('field_status')

In [17]: esmlab.resample(ds, 'ann',  time_coord_name='time')
/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/xarray/core/nanops.py:159: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis=axis, dtype=dtype)
Out[17]:
<xarray.Dataset>
Dimensions:              (latitude: 36, longitude: 72, time: 170)
Coordinates:
  * latitude             (latitude) float64 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
  * longitude            (longitude) float64 -177.5 -172.5 ... 172.5 177.5
  * time                 (time) datetime64[ns] 1850-07-01T17:00:00 ... 2019-01-31T06:00:00
Data variables:
    temperature_anomaly  (time, latitude, longitude) float64 nan nan ... nan nan
Attributes:
    history:  \n2019-05-01 13:14:37.171125 esmlab.resample(<DATASET>, freq="a...

Another solution is to set field_status as a coordinate variable and then call resample:

In [27]: tanomfil = tanomfil.set_coords(['field_status'])

In [28]: tanomfil
Out[28]:
<xarray.Dataset>
Dimensions:              (latitude: 36, longitude: 72, time: 2030)
Coordinates:
  * latitude             (latitude) float32 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
  * longitude            (longitude) float32 -177.5 -172.5 ... 172.5 177.5
  * time                 (time) datetime64[ns] 1850-01-16T12:00:00 ... 2019-02-15
    field_status         (time) |S1 b'f' b'f' b'f' b'f' ... b'p' b'p' b'p' b'p'
Data variables:
    temperature_anomaly  (time, latitude, longitude) float32 ...
Attributes:
    title:                  HadCRUT4 near-surface temperature ensemble data -...
    institution:            Met Office Hadley Centre / Climatic Research Unit...
    history:                Updated at 25/03/2019 16:03:56
    source:                 CRUTEM.4.6.0.0, HadSST.3.1.1.0
    comment:
    reference:              Morice, C. P., J. J. Kennedy, N. A. Rayner, and P...
    version:                HadCRUT.4.6.0.0
    Conventions:            CF-1.0
    ensemble_members:       100
    ensemble_member_index:  0

In [29]: esmlab.resample(tanomfil, 'ann',  time_coord_name='time')
/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/xarray/core/nanops.py:159: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis=axis, dtype=dtype)
Out[29]:
<xarray.Dataset>
Dimensions:              (latitude: 36, longitude: 72, time: 170)
Coordinates:
  * latitude             (latitude) float64 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
  * longitude            (longitude) float64 -177.5 -172.5 ... 172.5 177.5
  * time                 (time) datetime64[ns] 1850-07-01T17:00:00 ... 2019-01-31T06:00:00
Data variables:
    temperature_anomaly  (time, latitude, longitude) float64 nan nan ... nan nan
Attributes:
    history:  \n2019-05-01 13:20:25.645865 esmlab.resample(<DATASET>, freq="a...

@brianpm
Copy link
Author

brianpm commented May 1, 2019

That makes sense.

A nice alternative would be to be able to apply resample to just a DataArray instead of the whole dataset. Is that possible, or is then time_bnds then super problematic?

@andersy005
Copy link
Contributor

A nice alternative would be to be able to apply resample to just a DataArray instead of the whole dataset.

One easy solution for this would be to allow a user to pass a list of variables they would like to apply the resample on. For example:

esmlab.resample(tanomfil, 'ann',  time_coord_name='time', data_vars=['temperature_anomaly'])

@matt-long, any thoughts?

@bradyrx
Copy link
Contributor

bradyrx commented May 1, 2019

I imagine you could also do a quick check for variable type when you loop through data_vars. If you encounter a str, datetime, etc. you don't apply the resample there.

@bradyrx
Copy link
Contributor

bradyrx commented May 1, 2019

Note that native xarray just drops variables that can't be sampled in the requested way. It would be a nice fix to avoid that behavior. Although you might want to throw a warning that certain variables weren't resampled as requested due to data type.

import numpy as np
import pandas as pd
import xarray as xr
A = xr.DataArray(np.random.rand(100,), dims='time')
A.name = 'ints'
B = xr.DataArray(['str']*100, dims='time')
B.name = 'strs'
ds = xr.merge([A, B])
ds['time'] = pd.date_range('1900', '2000', freq='A')
print(ds)

<xarray.Dataset>
Dimensions:  (time: 100)
Coordinates:
  * time     (time) datetime64[ns] 1900-12-31 1901-12-31 ... 1999-12-31
Data variables:
    ints     (time) float64 0.8798 0.4835 0.6354 0.354 ... 0.1584 0.7635 0.8525
    strs     (time) <U3 'str' 'str' 'str' 'str' ... 'str' 'str' 'str' 'str'

ds.resample(time='A').mean()

<xarray.Dataset>
Dimensions:  (time: 100)
Coordinates:
  * time     (time) datetime64[ns] 1900-12-31 1901-12-31 ... 1999-12-31
Data variables:
    ints     (time) float64 0.8798 0.4835 0.6354 0.354 ... 0.1584 0.7635 0.8525

@andersy005
Copy link
Contributor

andersy005 commented May 1, 2019

@bradyrx, we are using dset.apply() internally, and I believe this is where the problem is coming from:

esmlab/esmlab/core.py

Lines 488 to 505 in 8ced4d6

def weighted_mean_arr(darr, wgts=None):
# if NaN are present, we need to use individual weights
cond = darr.isnull()
ones = xr.where(cond, 0.0, 1.0)
mask = (
darr.resample({self.time_coord_name: 'A'}).mean(dim=self.time_coord_name).notnull()
)
da_sum = (
(darr * wgts).resample({self.time_coord_name: 'A'}).sum(dim=self.time_coord_name)
)
ones_out = (
(ones * wgts).resample({self.time_coord_name: 'A'}).sum(dim=self.time_coord_name)
)
ones_out = ones_out.where(ones_out > 0.0)
da_weighted_mean = da_sum / ones_out
return da_weighted_mean.where(mask)
ds_resample_mean = dset.apply(weighted_mean_arr, wgts=wgts)

I am curious to know how one could add a check before calling .apply()

@bradyrx
Copy link
Contributor

bradyrx commented May 2, 2019

Here's a naive solution off the top of my head (using slightly modified weighted mean function here):

def weighted_mean_arr(darr, wgts=1, time_coord_name='time'): 
    # if NaN are present, we need to use individual weights 
    cond = darr.isnull() 
    ones = xr.where(cond, 0.0, 1.0) 
    mask = ( 
        darr.resample({time_coord_name: 'A'}).mean(dim=time_coord_name).notnull() 
    ) 
    da_sum = ( 
        (darr * wgts).resample({time_coord_name: 'A'}).sum(dim=time_coord_name) 
    ) 
    ones_out = ( 
        (ones * wgts).resample({time_coord_name: 'A'}).sum(dim=time_coord_name) 
    ) 
    ones_out = ones_out.where(ones_out > 0.0) 
    da_weighted_mean = da_sum / ones_out 
    return da_weighted_mean.where(mask) 
import numpy as np
import pandas as pd
import xarray as xr

# Generate monthly dummy data with two strings variables
A = xr.DataArray(np.random.rand(120,), dims='time')
A.name = 'ints'
B = xr.DataArray(['str']*120, dims='time')
B.name = 'strs'
C = xr.DataArray(['foo']*120, dims='time')
C.name = 'strs2'
ds = xr.merge([A, B, C])
ds['time'] = pd.date_range('1900-01', '1910-01', freq='M')
print(ds)

<xarray.Dataset>
Dimensions:  (time: 120)
Coordinates:
  * time     (time) datetime64[ns] 1900-01-31 1900-02-28 ... 1909-12-31
Data variables:
    ints     (time) float64 0.3267 0.4233 0.4777 0.6476 ... 0.3189 0.9488 0.9251
    strs     (time) <U3 'str' 'str' 'str' 'str' ... 'str' 'str' 'str' 'str'
    strs2    (time) <U3 'foo' 'foo' 'foo' 'foo' ... 'foo' 'foo' 'foo' 'foo'

# this list probably isn't exhaustive
acceptable_types = [np.dtype('float'), np.dtype('int'), np.dtype('double')]
drop_vars = []
for var in ds.data_vars:
    if np.asarray(ds[var]).dtype not in acceptable_types:
        drop_vars.append(var)

# extract variables that can't be extracted into separate dataset
no_apply = ds[drop_vars]
# apply function only to non-string types
resampled = ds.drop(drop_vars).apply(weighted_mean_arr)

print(resampled)
<xarray.Dataset>
Dimensions:  (time: 10)
Coordinates:
  * time     (time) datetime64[ns] 1900-12-31 1901-12-31 ... 1909-12-31
Data variables:
    ints     (time) float64 0.4236 0.5849 0.5722 0.4475 ... 0.4421 0.5669 0.5151

print(no_apply)
<xarray.Dataset>
Dimensions:  (time: 120)
Coordinates:
  * time     (time) datetime64[ns] 1900-01-31 1900-02-28 ... 1909-12-31
Data variables:
    strs     (time) <U3 'str' 'str' 'str' 'str' ... 'str' 'str' 'str' 'str'
    strs2    (time) <U3 'foo' 'foo' 'foo' 'foo' ... 'foo' 'foo' 'foo' 'foo'

As in, you split the Dataset into two separate Datasets, apply the function, and then merge them in the end. The issue is now you have two time dimensions with different coordinates. At this point, you'd have to rename one of them something like TIME (which LANL does for some of their netCDF files with differing temporal resolutions). Although I don't really like this solution, as it gets confusing.

In the end, should you really be maintaining a Dataset with different resolutions on the time axis? Maybe xarray has it right to just drop any non-quantitative variables. (I.e., just ditch no_apply in this case).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants