-
Notifications
You must be signed in to change notification settings - Fork 8
resample fails on HadCRUT dataset #126
Comments
@brianpm, I am going to look into this and will get back to you |
The issue is caused by the In [14]: tanomfil
Out[14]:
<xarray.Dataset>
Dimensions: (latitude: 36, longitude: 72, time: 2030)
Coordinates:
* latitude (latitude) float32 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
* longitude (longitude) float32 -177.5 -172.5 ... 172.5 177.5
* time (time) datetime64[ns] 1850-01-16T12:00:00 ... 2019-02-15
Data variables:
temperature_anomaly (time, latitude, longitude) float32 ...
field_status (time) |S1 ... One solution is to drop this variable: In [16]: ds = tanomfil.drop('field_status')
In [17]: esmlab.resample(ds, 'ann', time_coord_name='time')
/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/xarray/core/nanops.py:159: RuntimeWarning: Mean of empty slice
return np.nanmean(a, axis=axis, dtype=dtype)
Out[17]:
<xarray.Dataset>
Dimensions: (latitude: 36, longitude: 72, time: 170)
Coordinates:
* latitude (latitude) float64 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
* longitude (longitude) float64 -177.5 -172.5 ... 172.5 177.5
* time (time) datetime64[ns] 1850-07-01T17:00:00 ... 2019-01-31T06:00:00
Data variables:
temperature_anomaly (time, latitude, longitude) float64 nan nan ... nan nan
Attributes:
history: \n2019-05-01 13:14:37.171125 esmlab.resample(<DATASET>, freq="a... Another solution is to set In [27]: tanomfil = tanomfil.set_coords(['field_status'])
In [28]: tanomfil
Out[28]:
<xarray.Dataset>
Dimensions: (latitude: 36, longitude: 72, time: 2030)
Coordinates:
* latitude (latitude) float32 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
* longitude (longitude) float32 -177.5 -172.5 ... 172.5 177.5
* time (time) datetime64[ns] 1850-01-16T12:00:00 ... 2019-02-15
field_status (time) |S1 b'f' b'f' b'f' b'f' ... b'p' b'p' b'p' b'p'
Data variables:
temperature_anomaly (time, latitude, longitude) float32 ...
Attributes:
title: HadCRUT4 near-surface temperature ensemble data -...
institution: Met Office Hadley Centre / Climatic Research Unit...
history: Updated at 25/03/2019 16:03:56
source: CRUTEM.4.6.0.0, HadSST.3.1.1.0
comment:
reference: Morice, C. P., J. J. Kennedy, N. A. Rayner, and P...
version: HadCRUT.4.6.0.0
Conventions: CF-1.0
ensemble_members: 100
ensemble_member_index: 0
In [29]: esmlab.resample(tanomfil, 'ann', time_coord_name='time')
/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/xarray/core/nanops.py:159: RuntimeWarning: Mean of empty slice
return np.nanmean(a, axis=axis, dtype=dtype)
Out[29]:
<xarray.Dataset>
Dimensions: (latitude: 36, longitude: 72, time: 170)
Coordinates:
* latitude (latitude) float64 -87.5 -82.5 -77.5 ... 77.5 82.5 87.5
* longitude (longitude) float64 -177.5 -172.5 ... 172.5 177.5
* time (time) datetime64[ns] 1850-07-01T17:00:00 ... 2019-01-31T06:00:00
Data variables:
temperature_anomaly (time, latitude, longitude) float64 nan nan ... nan nan
Attributes:
history: \n2019-05-01 13:20:25.645865 esmlab.resample(<DATASET>, freq="a... |
That makes sense. A nice alternative would be to be able to apply resample to just a DataArray instead of the whole dataset. Is that possible, or is then |
One easy solution for this would be to allow a user to pass a list of variables they would like to apply the resample on. For example: esmlab.resample(tanomfil, 'ann', time_coord_name='time', data_vars=['temperature_anomaly']) @matt-long, any thoughts? |
I imagine you could also do a quick check for variable type when you loop through |
Note that native import numpy as np
import pandas as pd
import xarray as xr
A = xr.DataArray(np.random.rand(100,), dims='time')
A.name = 'ints'
B = xr.DataArray(['str']*100, dims='time')
B.name = 'strs'
ds = xr.merge([A, B])
ds['time'] = pd.date_range('1900', '2000', freq='A')
print(ds)
<xarray.Dataset>
Dimensions: (time: 100)
Coordinates:
* time (time) datetime64[ns] 1900-12-31 1901-12-31 ... 1999-12-31
Data variables:
ints (time) float64 0.8798 0.4835 0.6354 0.354 ... 0.1584 0.7635 0.8525
strs (time) <U3 'str' 'str' 'str' 'str' ... 'str' 'str' 'str' 'str'
ds.resample(time='A').mean()
<xarray.Dataset>
Dimensions: (time: 100)
Coordinates:
* time (time) datetime64[ns] 1900-12-31 1901-12-31 ... 1999-12-31
Data variables:
ints (time) float64 0.8798 0.4835 0.6354 0.354 ... 0.1584 0.7635 0.8525 |
@bradyrx, we are using Lines 488 to 505 in 8ced4d6
I am curious to know how one could add a check before calling |
Here's a naive solution off the top of my head (using slightly modified weighted mean function here): def weighted_mean_arr(darr, wgts=1, time_coord_name='time'):
# if NaN are present, we need to use individual weights
cond = darr.isnull()
ones = xr.where(cond, 0.0, 1.0)
mask = (
darr.resample({time_coord_name: 'A'}).mean(dim=time_coord_name).notnull()
)
da_sum = (
(darr * wgts).resample({time_coord_name: 'A'}).sum(dim=time_coord_name)
)
ones_out = (
(ones * wgts).resample({time_coord_name: 'A'}).sum(dim=time_coord_name)
)
ones_out = ones_out.where(ones_out > 0.0)
da_weighted_mean = da_sum / ones_out
return da_weighted_mean.where(mask) import numpy as np
import pandas as pd
import xarray as xr
# Generate monthly dummy data with two strings variables
A = xr.DataArray(np.random.rand(120,), dims='time')
A.name = 'ints'
B = xr.DataArray(['str']*120, dims='time')
B.name = 'strs'
C = xr.DataArray(['foo']*120, dims='time')
C.name = 'strs2'
ds = xr.merge([A, B, C])
ds['time'] = pd.date_range('1900-01', '1910-01', freq='M')
print(ds)
<xarray.Dataset>
Dimensions: (time: 120)
Coordinates:
* time (time) datetime64[ns] 1900-01-31 1900-02-28 ... 1909-12-31
Data variables:
ints (time) float64 0.3267 0.4233 0.4777 0.6476 ... 0.3189 0.9488 0.9251
strs (time) <U3 'str' 'str' 'str' 'str' ... 'str' 'str' 'str' 'str'
strs2 (time) <U3 'foo' 'foo' 'foo' 'foo' ... 'foo' 'foo' 'foo' 'foo'
# this list probably isn't exhaustive
acceptable_types = [np.dtype('float'), np.dtype('int'), np.dtype('double')]
drop_vars = []
for var in ds.data_vars:
if np.asarray(ds[var]).dtype not in acceptable_types:
drop_vars.append(var)
# extract variables that can't be extracted into separate dataset
no_apply = ds[drop_vars]
# apply function only to non-string types
resampled = ds.drop(drop_vars).apply(weighted_mean_arr)
print(resampled)
<xarray.Dataset>
Dimensions: (time: 10)
Coordinates:
* time (time) datetime64[ns] 1900-12-31 1901-12-31 ... 1909-12-31
Data variables:
ints (time) float64 0.4236 0.5849 0.5722 0.4475 ... 0.4421 0.5669 0.5151
print(no_apply)
<xarray.Dataset>
Dimensions: (time: 120)
Coordinates:
* time (time) datetime64[ns] 1900-01-31 1900-02-28 ... 1909-12-31
Data variables:
strs (time) <U3 'str' 'str' 'str' 'str' ... 'str' 'str' 'str' 'str'
strs2 (time) <U3 'foo' 'foo' 'foo' 'foo' ... 'foo' 'foo' 'foo' 'foo' As in, you split the Dataset into two separate Datasets, apply the function, and then merge them in the end. The issue is now you have two In the end, should you really be maintaining a |
I'm trying to use resample to go from monthly to annual means in HadCRUT data. It seems to be having trouble with the time coordinate, and ultimately complains "TypeError: cannot perform reduce with flexible type".
This was using esmlab '2019.4.27'
xarray 0.12.1
numpy 1.16.3
python 3.7.3
Forgot to mention: tried with decode_times=False and True.
The text was updated successfully, but these errors were encountered: