-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update GroupBy constructor for grouping by multiple variables, dask arrays #6610
Comments
This comment was marked as off-topic.
This comment was marked as off-topic.
In xarray-contrib/flox#191 @keewis proposes a much nicer API for multiple variables: data.groupby(
xr.Grouper(by="x", bins=pd.IntervalIndex.from_breaks(coords["x_vertices"])), # binning
xr.Grouper(by=data.y, labels=["a", "b", "c"]), # categorical, data.y is dask-backed
xr.Grouper(by="time", freq="MS"), # resample
) Note |
Using |
I also like the idea of creating specific Grouper objects for different types of selection, e.g., |
Here's a question. In #7561, I implement data.groupby({
"x0": xr.BinGrouper(bins=pd.IntervalIndex.from_breaks(coords["x_vertices"])), # binning
"y": xr.UniqueGrouper(labels=["a", "b", "c"]), # categorical, data.y is dask-backed
"time": xr.TimeResampleGrouper(freq="MS")
},
) Does this look OK or do we want to support passing the DataArray or variable name as a xr.BinGrouper(by="x0", bins=pd.IntervalIndex.from_breaks(coords["x_vertices"])) This syntax would support passing PS: Pandas has a data.groupby([
xr.BinGrouper("x0", bins=pd.IntervalIndex.from_breaks(coords["x_vertices"])), # binning
xr.UniqueGrouper("y", labels=["a", "b", "c"]), # categorical, data.y is dask-backed
xr.TimeResampleGrouper("time", freq="MS")
],
) |
We voted to move forward with this API: data.groupby({
"x0": xr.BinGrouper(bins=pd.IntervalIndex.from_breaks(coords["x_vertices"])), # binning
"y": xr.UniqueGrouper(labels=["a", "b", "c"]), # categorical, data.y is dask-backed
"time": xr.TimeResampleGrouper(freq="MS")
},
) We won't break backwards-compatibility for |
Does anyone have opinions on using |
What is your issue?
flox
supports grouping by multiple variables (would fix #324, #1056) and grouping by dask variables (would fix #2852).To enable this in GroupBy we need to update the constructor's signature to
bins
forgroupby_bins
which already supports grouping by dask variables). This lets us construct the output coordinate without evaluating the dask variable.The signature in flox is (may be errors!)
You would calculate that last example using flox as
The use of
expected_groups
andisbin
seems ugly to me (the names could also be better!)I propose we update groupby's signature to
group: DataArray | str
togroup: DataArray | str | Iterable[str] | Iterable[DataArray]
xr.Bins
object that wraps bin edges + any kwargs to be passed topandas.cut
. Note our current groupby_bins signature has a bunch of kwargs passed directly to pandas.cut.groups: None | ArrayLike | xarray.Bins | Iterable[None | ArrayLike | xarray.Bins]
to pass the "expected group labels".None
, then groups will be auto-detected from non-daskgroup
arrays (ifNone
for a daskgroup
, then raise error).xarray.Bins
indicates binning by the appropriate variablesArrayLike
treat as categorical.groups
is a little too similar togroup
so we should choose a better name.ArrayLike
would let us fix Ordered Groupby Keys #757 (pass the seasons in the order you want them in the output)So then that example becomes
Thoughts?
The text was updated successfully, but these errors were encountered: