Update GroupBy constructor for grouping by multiple variables, dask arrays #6610

dcherian · 2022-05-15T03:17:54Z

What is your issue?

flox supports grouping by multiple variables (would fix #324, #1056) and grouping by dask variables (would fix #2852).

To enable this in GroupBy we need to update the constructor's signature to

Accept multiple "by" variables.
Accept "expected group labels" for grouping by dask variables (like bins for groupby_bins which already supports grouping by dask variables). This lets us construct the output coordinate without evaluating the dask variable.
We may also want to simultaneously group by a categorical variable (season) and bin by a continuous variable (air temperature). So we also need a way to indicate whether the "expected group labels" are "bin edges" or categories.

The signature in flox is (may be errors!)

xarray_reduce(
    obj: Dataset | DataArray,
    *by: DataArray | str,
    func: str | Aggregation,
    expected_groups: Sequence | np.ndarray | None = None,
    isbin: bool | Sequence[bool] = False,
    ...
)

You would calculate that last example using flox as

xarray_reduce(
   ds,
    "season", "air_temperature", 
    expected_groups=[None, np.arange(21, 30, 1)],
    isbin=[False, True],
    ...
)

The use of expected_groups and isbin seems ugly to me (the names could also be better!)

I propose we update groupby's signature to

change group: DataArray | str to group: DataArray | str | Iterable[str] | Iterable[DataArray]
We could add a top-level xr.Bins object that wraps bin edges + any kwargs to be passed to pandas.cut. Note our current groupby_bins signature has a bunch of kwargs passed directly to pandas.cut.
Finally add groups: None | ArrayLike | xarray.Bins | Iterable[None | ArrayLike | xarray.Bins] to pass the "expected group labels".
1. If None, then groups will be auto-detected from non-dask group arrays (if None for a dask group, then raise error).
2. If xarray.Bins indicates binning by the appropriate variables
3. If ArrayLike treat as categorical.
4. groups is a little too similar to group so we should choose a better name.
5. The ordering of ArrayLike would let us fix Ordered Groupby Keys #757 (pass the seasons in the order you want them in the output)

So then that example becomes

ds.groupby(
    ["season", "air_temperature"], # season is numpy, air_temperature is dask
    groups=[None, xr.Bins(np.arange(21, 30, 1), closed="right")],
)

Thoughts?

The text was updated successfully, but these errors were encountered:

dcherian · 2022-11-28T19:58:29Z

In xarray-contrib/flox#191 @keewis proposes a much nicer API for multiple variables:

data.groupby(
    xr.Grouper(by="x", bins=pd.IntervalIndex.from_breaks(coords["x_vertices"])),  # binning
    xr.Grouper(by=data.y, labels=["a", "b", "c"]),  # categorical, data.y is dask-backed
    xr.Grouper(by="time", freq="MS"),  # resample
)

Note pd.Grouper uses key instead of by so that's a possibility too.

TomNicholas · 2022-12-07T17:07:08Z

Using xr.Grouper has the advantage that you don't have to start guessing about whether or not the user wanted some complicated behaviour (especially if their input is slightly wrong somehow and you have to raise an informative error). Simple defaults would get left as is and complex use cases can be explicit and opt-in.

shoyer · 2022-12-07T17:12:05Z

I also like the idea of creating specific Grouper objects for different types of selection, e.g., UniqueGrouper (the default), BinGrouper, TimeResampleGrouper, etc.

dcherian · 2023-04-06T04:07:05Z

Here's a question.

In #7561, I implement Grouper objects that don't have any information of the variable we're grouping by. So the future API would be:

data.groupby({
	"x0": xr.BinGrouper(bins=pd.IntervalIndex.from_breaks(coords["x_vertices"])),  # binning
    "y": xr.UniqueGrouper(labels=["a", "b", "c"]),  # categorical, data.y is dask-backed
    "time": xr.TimeResampleGrouper(freq="MS")
	},
)

Does this look OK or do we want to support passing the DataArray or variable name as a by kwarg:

xr.BinGrouper(by="x0", bins=pd.IntervalIndex.from_breaks(coords["x_vertices"]))

This syntax would support passing DataArray in by so xr.UniqueGrouper(by=data.y) for example. Is that an important usecase to support? In #7561, I create new ResolvedGrouper objects that do contain by as a DataArray always, so it's really a question of exposing that to the user.

PS: Pandas has a key kwarg for a column name. So following that would mean

data.groupby([
	xr.BinGrouper("x0", bins=pd.IntervalIndex.from_breaks(coords["x_vertices"])),  # binning
    xr.UniqueGrouper("y", labels=["a", "b", "c"]),  # categorical, data.y is dask-backed
    xr.TimeResampleGrouper("time", freq="MS")
	],
)

dcherian · 2023-04-26T15:59:06Z

We voted to move forward with this API:

data.groupby({
	"x0": xr.BinGrouper(bins=pd.IntervalIndex.from_breaks(coords["x_vertices"])),  # binning
    "y": xr.UniqueGrouper(labels=["a", "b", "c"]),  # categorical, data.y is dask-backed
    "time": xr.TimeResampleGrouper(freq="MS")
	},
)

We won't break backwards-compatibility for da.groupby(other_data_array) but for any complicated use-cases with Grouper the user must add the by variable to the xarray object, and refer to it by name in the dictionary as above,

dcherian · 2024-07-17T03:59:38Z

Does anyone have opinions on using UniqueGrouper vs CategoricalGrouper or CategoryGrouper?

dcherian added API design topic-groupby labels May 15, 2022

This comment was marked as off-topic.

Sign in to view

dcherian mentioned this issue Jun 12, 2022

Refactor GroupBy init to avoid factorization #6689

Closed

dcherian mentioned this issue Oct 24, 2022

Save groupby codes after factorizing, pass to flox #7206

Merged

3 tasks

keewis mentioned this issue Nov 23, 2022

improving the API for binned groupby xarray-contrib/flox#191

Open

spencerkclark mentioned this issue Jan 16, 2023

Preserve base and loffset arguments in resample #7444

Merged

3 tasks

dcherian mentioned this issue Feb 27, 2023

Introduce Grouper objects internally #7561

Merged

2 tasks

tomvothecoder mentioned this issue Apr 14, 2023

[Refactor]: Consider using flox and xr.resample() to improve temporal averaging grouping logic xCDAT/xcdat#217

Open

dcherian mentioned this issue Jul 27, 2023

Support xarray grouper objects in xarray interface xarray-contrib/flox#256

Open

dcherian mentioned this issue Dec 2, 2023

Grouper object design doc #8510

Merged

dcherian mentioned this issue Jun 14, 2024

Grouper, Resampler as public api #8840

Merged

5 tasks

dcherian closed this as completed in #8840 Jul 18, 2024

keewis mentioned this issue Aug 12, 2024

Enable multi-coord grouping from xarray #9332

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update GroupBy constructor for grouping by multiple variables, dask arrays #6610

Update GroupBy constructor for grouping by multiple variables, dask arrays #6610

dcherian commented May 15, 2022 •

edited

Loading

This comment was marked as off-topic.

dcherian commented Nov 28, 2022 •

edited

Loading

TomNicholas commented Dec 7, 2022

shoyer commented Dec 7, 2022

dcherian commented Apr 6, 2023 •

edited

Loading

dcherian commented Apr 26, 2023 •

edited

Loading

dcherian commented Jul 17, 2024

Update GroupBy constructor for grouping by multiple variables, dask arrays #6610

Update GroupBy constructor for grouping by multiple variables, dask arrays #6610

Comments

dcherian commented May 15, 2022 • edited Loading

What is your issue?

This comment was marked as off-topic.

dcherian commented Nov 28, 2022 • edited Loading

TomNicholas commented Dec 7, 2022

shoyer commented Dec 7, 2022

dcherian commented Apr 6, 2023 • edited Loading

dcherian commented Apr 26, 2023 • edited Loading

dcherian commented Jul 17, 2024

dcherian commented May 15, 2022 •

edited

Loading

dcherian commented Nov 28, 2022 •

edited

Loading

dcherian commented Apr 6, 2023 •

edited

Loading

dcherian commented Apr 26, 2023 •

edited

Loading