Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpolate_na: Rework 'limit' argument documentation/implementation #7665

Open
Ockenfuss opened this issue Mar 23, 2023 · 6 comments · May be fixed by #9402
Open

Interpolate_na: Rework 'limit' argument documentation/implementation #7665

Ockenfuss opened this issue Mar 23, 2023 · 6 comments · May be fixed by #9402
Labels
needs triage Issue that has not been reviewed by xarray team member

Comments

@Ockenfuss
Copy link
Contributor

What is your issue?

Currently, the 'limit' argument of interpolate_na shows some counterintuitive/undocumented behaviour.
Take the following example:

import xarray  as xr
import numpy as np
n=np.nan
da=xr.DataArray([n, n, n, 4, 5, n ,n ,n], dims=["y"])
da.interpolate_na('y', limit=1, fill_value='extrapolate')

This will produce the following result:

array([ 1., nan, nan,  4.,  5.,  6., nan, nan])

Two things are surprising, in my opinion:

  1. The interpolated value 1 at the beginning is far from any of the given values
  2. The filling is done only towards the 'right'. This asymmetric behaviour is not mentioned in the documentation.

Comparison to pandas

Similar behaviour can be created using pandas with the following arguments:

da=xr.DataArray([n, n, n, 4, 5, n ,n ,n], dims=["y"])
dap=da.to_pandas()
dap.interpolate(method='slinear', limit=1, limit_direction='forward', fill_value='extrapolate')
Output
y
0    NaN
1    NaN
2    NaN
3    4.0
4    5.0
5    6.0
6    NaN
7    NaN
dtype: float64

This is equivalent to the current xarray behaviour, except there is no 1 at the beginning.

Cause

Currently, the fill mask in xarray is implemented using a rolling window operation, where values outside the array are assumed to be valid (therefore the 1). See xarray.core.missing._get_valid_fill_mask

Possible Solutions

Boundary Issue

Concerning the 1 at the beginning: I think this should be considered a bug. It is likely not what you would expect if you specify a limit. As stated, pandas does not create it as well.

Asymmetric Filling

Concerning the asymmetric filling, I see two options:

  1. No changes to the code, but mention in the documentation that (effectively), a forward-fill is done.
  2. Make something similar to what pandas is doing. In pandas, there are two additional arguments controlling the limit behaviour: limit_direction is controlling the fill direction (left, right or both). limit_area effectively controls if we only do interpolation or allow for extrapolation as well.

What do you think?

@Ockenfuss Ockenfuss added the needs triage Issue that has not been reviewed by xarray team member label Mar 23, 2023
@Ockenfuss
Copy link
Contributor Author

Two comments if the pandas arguments are introduced:

  • If the pandas arguments are introduced, one might raise the question, whether the max_gap argument of xarray is still necessary, since pandas does not have such an argument. However, if the user does not want any interpolation if a gap is bigger than a specified length, this is currently not possible with pandas (actually, the max_gap feature is requested by the pandas community since a long time: Pandas interpolation enhancement request : specifying the maximum gap to interpolate. pandas-dev/pandas#12187 )
  • Coordinate handling: Currently, limit is operating on indices, max_gap is operating on coordinates. Is there a deeper reason for this separation? Both operations limit the interpolation 'distance'. If not, similar to use_coordinates, one could introduce something like limit_use_coordinates. If True, limit as well as max_gap refer to numeric coordinates, otherwise to a linearly increasing index ([0,1,2,...]).

@Ockenfuss
Copy link
Contributor Author

Do you have any thoughts on this?

I think with the following signature, the function will be backward compatible:

interpolate_na(dim, method='linear', use_coordinate=True, limit=None, limit_direction='forward', limit_area=None, limit_use_coordinate=False, max_gap=None)

@scottyhq
Copy link
Contributor

Thanks for raising this issue @Ockenfuss and thoroughly documenting the behavior.

The interpolated value 1 at the beginning is far from any of the given values

Agreed that is surprising and seems like a bug!

The filling is done only towards the 'right'. This asymmetric behaviour is not mentioned in the documentation

A pull request would be most welcome @Ockenfuss - scoped to either just improve the documentation or with your preferred API change. We can discuss the preferred syntax in the pull request.

@Ockenfuss
Copy link
Contributor Author

Thanks for the feedback @scottyhq! I used the time over christmas and tried to come up with an implementation, see #8577! As you suggested, I pointed out specific details to discuss in the PR.

@shoyer
Copy link
Member

shoyer commented Mar 13, 2024

After the Xarray meeting today (with @keewis and @dcherian) we discussed the preferred option for moving forward with cleaning up the limit argument. The concensus was to introduce a new keyword argument (e.g., cutoff) with the preferred default handling of limit, and to soft deprecate the old limit argument.

@Ockenfuss
Copy link
Contributor Author

Some more results from the discussion:

  • This affects all filling methods: interpolate_na, fillna, ffill, bfill, ...
  • We want two functionalities: limit (or cutoff) to limit filling distance, max_gap (or ?) to avoid any filling
  • Both should work on coordinates or index
  • The mask is a valuable product by itself
  • We prefer a new API to avoid breaking changes

Here are my proposals:

Masking function Signature:

namask('t', limit=3, limit_coords=3h, max_gap=3, max_gap_coords=3h) #in line with pandas/current xarray
namask('t', limit=3, cutoff=3h, gap_limit=3, gap_cutoff=3h) #cutoff for coordinates, limit for index

For direction control:

namask(limit_direction=forward/backward/both, limit_area=outside/inside/None) #regardless of index or coordinate

Possible API:

Making it an object:

da.namask(...) #masking object
da.namask(...).fillna(123) #return dataarray
da.namask(...).mask #get mask only

Making it a method:

mask=da.namask(...)
da.fillna(123).where(~mask) #Invalid=True to match numpy behaviour?

How to deal with existing methods:

  • da.fillna(123) shall still work
  • If any exsiting limit, max_gap is used: Internally forward to new API, raise Deprecation Warning (but possibly support for very long/forever)

Anything else:
This almost feels like introducing numpy.ma in xarray...
This is beyond the scope of this Issue! But we may copy some of the behaviour from numpy, e.g. invalid values are True there.

@Ockenfuss Ockenfuss linked a pull request Aug 23, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Issue that has not been reviewed by xarray team member
Projects
None yet
3 participants