-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fill gaps limited 7665 #9402
base: main
Are you sure you want to change the base?
Fill gaps limited 7665 #9402
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems extremely good!!
I left some refine-y comments, nothing preclusive.
What do others think? I don't use interpolate_na
myself much so other may have views...
limit_direction: {"forward", "backward", "both"}, default: "forward" | ||
Consecutive NaNs will be filled in this direction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we feel about this param vs. the ffill
/ bfill
on the returned object? Do we need both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is actually useful to have this parameter and allow to set it independently of the filing function. Then, for example, you can do things like:
n = np.nan
test=xr.DataArray([1,1, n, n, n, 2, 2], dims=["x"])
test.fill_gaps('x', limit=1, limit_direction='backward').ffill('x')
which effectively pulls the gap left edge towards the gap right edge:
<xarray.DataArray (x: 7)> Size: 56B
array([ 1., 1., nan, nan, 1., 2., 2.])
Dimensions without coordinates: x
I think there might be use cases for this, like constructing masks or advanced indexing arrays maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, in my (probably naive) mental model of filling, this wouldn't work like this. How would we explain in words why we get the additional 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point... My view on this:
fill_gaps()
decides which values to fill (here 1 step backward, i.e. the right edge of every gap)- The following function decides how they are filled (
ffill
here)
Both are two independent steps.
Maybe it is clearer to use 'left'/'right' instead of 'forward'/'backward' terminology?
limit_side: {"left", "right", "both"}, default: "both"
Whether to limit the filling to the left, right, or both sides of the gap
- "left": Fill only the left (lower index) side of each gap.
- "right": Fill only the right (higher index) side of each gap.
- "both": Fill from both sides of each gap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could also join this Wednesday developer meeting to get feedback on naming/API things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. I personally think that's quite confusing, but 50% chance it's just me.
Do others have a view?
I'm thinking that filling is simpler than this: it starts at the edge of a gap — and then fills the gap from the edge, which is by definition backwards or forwards (because it's already at an edge so can only go one direction)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(yes re Wednesday calls! Unfortunately it clashes with something I have so I find it tough to go, but others will be there!)
| datetime.timedelta | ||
) = None, | ||
limit_direction: LimitDirectionOptions = "both", | ||
limit_area: LimitAreaOptions | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refine-y but:
- are
interpolate
andextrapolate
clearer options? - is there a better name for the param? I'm not sure. "area" sounds 2D to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I adopted this nomenclature from pandas. What about limit_region
, does this sound less 2D?
Personally, I like the 'inside'/'outside' options, 'interpolate' sounds close to linear interpolation to me...
@@ -3766,6 +3769,160 @@ def bfill(self, dim: Hashable, limit: int | None = None) -> Self: | |||
|
|||
return bfill(self, dim, limit=limit) | |||
|
|||
def fill_gaps( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts from others on the naming? Would .fill
be insufficiently specific that it's filling na
? Would fill_missing
be clearer than fill_gaps
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to any of those. .fill
sounds very concise, but maybe this is easily confused with .ffill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think .fill
could be quite nice, do others have a view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe gap_filler
instead, since this method does not actually fill the gaps.
I'm also wondering if its better to have a method that constructs the appropriate mask that can be used later
mask = ds.get_gap_mask(max_gap=...)
ds.ffill(...).where(~mask)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points! Just to explain:
gap_filler
emphasizes the returned object type nicely! However, I choose fill_gaps
because it fits the naming scheme of other object-returning functions better (e.g. rolling
and coarsen
are not called roller
and coarser
in xarray, even though the operation is not perfomed immediately and an object is returned).
Ultimately (I am a non-native english speaker) I am happy for any recommendations regarding nomenclature.
If you prefer gap_filler
, I will change accordingly.
The function API is also presented as an alternative in the initial proposal. I decided to go for the object way because it is shorter (one line) and less error prone (you might easily forget the ~
). If the mask is required, you can easily get it from the object:
mask=ds.gap_filler(...).mask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically all the new methods are missing return types, please consider adding them.
The new class should be generic, I have added a few review comments here and there but more adoptions are needed.
Thanks a lot for the review of the typing! I`ll work through them asap and try to make all adoptions (probabily still need some help afterwards, though ...) |
- Allows for more flexibility, e.g. optional dim arguments - Better static typing
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
@headtr1ck I did my best to add type hints everywhere. Not sure how to resolve the last four remaining mypy errors... |
I'm currently on holiday and it's difficult to debug these errors on the phone. |
self.content = content | ||
self.mask = mask | ||
self.dim = dim | ||
self.dim = dim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.dim = dim |
# Calculate individual masks | ||
masks = [] | ||
if need_limit_mask: | ||
masks.append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is difficult to tell mypy that limit cannot be a Hashable here do to the dynamic way of how it is handled.
Maybe the best course of action is to add an assertion (not sure which exactly, maybe assert not utils.is_hashable(limit)
, but maybe this is too restrictive?)
Easier might be to add a #type: ignore[arg-type]
and a little comment as to why.
There are a couple of outstanding questions about the interfaces; in particular:
Re typing — are the interfaces typed correctly? I think they are. Assuming those are correct, I would vote that it's fine to add ignores internally (ofc the internal checks are useful tests for the external interfaces, so ideally we'd fix those) What's the best way of resolving those outstanding questions @pydata/xarray ? I know I've been terrible on attending calls now I'm on PT, but maybe we could do it on one of those? (would be great to minimize the delay on these sorts of PRs — I think they're really valuable and would love to engender more...) |
One more naming decision:
|
Did you manage to discuss this this morning? |
@max-sixty Yes, @Ockenfuss explained what decisions are to be made and encouraged to have a look here. |
Improve gap filling
fill_gaps()
to control gap length for all filling functions (ffill/bfill/interpolate_na/fillna)This PR involves a full implementation, documentation and corresponding tests. As discussed in #7665, a new API function
fill_gaps
was introduced, to avoid breaking changes of existing code and keep the argument inflation of the filling functions to a minimum.TBD if accepted:
whats-new.rst
api.rst
Deprecations
limit
in interpolate_na now requires numbagg or bottleneck. So far, limit worked without bottleneck, max_gap required it (not documented). Reason: max_gap and limit now share the same codebase using ffill. Since no one ever complained about the missing documentation and the error message is pretty clear and easy to resolve ("No module named 'bottleneck"), I hope this might be acceptable.