Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update GroupBy constructor for grouping by multiple variables, dask arrays #6610

Closed
dcherian opened this issue May 15, 2022 · 7 comments · Fixed by #8840
Closed

Update GroupBy constructor for grouping by multiple variables, dask arrays #6610

dcherian opened this issue May 15, 2022 · 7 comments · Fixed by #8840

Comments

@dcherian
Copy link
Contributor

dcherian commented May 15, 2022

What is your issue?

flox supports grouping by multiple variables (would fix #324, #1056) and grouping by dask variables (would fix #2852).

To enable this in GroupBy we need to update the constructor's signature to

  1. Accept multiple "by" variables.
  2. Accept "expected group labels" for grouping by dask variables (like bins for groupby_bins which already supports grouping by dask variables). This lets us construct the output coordinate without evaluating the dask variable.
  3. We may also want to simultaneously group by a categorical variable (season) and bin by a continuous variable (air temperature). So we also need a way to indicate whether the "expected group labels" are "bin edges" or categories.

The signature in flox is (may be errors!)

xarray_reduce(
    obj: Dataset | DataArray,
    *by: DataArray | str,
    func: str | Aggregation,
    expected_groups: Sequence | np.ndarray | None = None,
    isbin: bool | Sequence[bool] = False,
    ...
)

You would calculate that last example using flox as

xarray_reduce(
   ds,
    "season", "air_temperature", 
    expected_groups=[None, np.arange(21, 30, 1)],
    isbin=[False, True],
    ...
)

The use of expected_groups and isbin seems ugly to me (the names could also be better!)


I propose we update groupby's signature to

  1. change group: DataArray | str to group: DataArray | str | Iterable[str] | Iterable[DataArray]
  2. We could add a top-level xr.Bins object that wraps bin edges + any kwargs to be passed to pandas.cut. Note our current groupby_bins signature has a bunch of kwargs passed directly to pandas.cut.
  3. Finally add groups: None | ArrayLike | xarray.Bins | Iterable[None | ArrayLike | xarray.Bins] to pass the "expected group labels".
    1. If None, then groups will be auto-detected from non-dask group arrays (if None for a dask group, then raise error).
    2. If xarray.Bins indicates binning by the appropriate variables
    3. If ArrayLike treat as categorical.
    4. groups is a little too similar to group so we should choose a better name.
    5. The ordering of ArrayLike would let us fix Ordered Groupby Keys #757 (pass the seasons in the order you want them in the output)

So then that example becomes

ds.groupby(
    ["season", "air_temperature"], # season is numpy, air_temperature is dask
    groups=[None, xr.Bins(np.arange(21, 30, 1), closed="right")],
)

Thoughts?

@malmans2

This comment was marked as off-topic.

@dcherian
Copy link
Contributor Author

dcherian commented Nov 28, 2022

In xarray-contrib/flox#191 @keewis proposes a much nicer API for multiple variables:

data.groupby(
    xr.Grouper(by="x", bins=pd.IntervalIndex.from_breaks(coords["x_vertices"])),  # binning
    xr.Grouper(by=data.y, labels=["a", "b", "c"]),  # categorical, data.y is dask-backed
    xr.Grouper(by="time", freq="MS"),  # resample
)

Note pd.Grouper uses key instead of by so that's a possibility too.

@TomNicholas
Copy link
Member

Using xr.Grouper has the advantage that you don't have to start guessing about whether or not the user wanted some complicated behaviour (especially if their input is slightly wrong somehow and you have to raise an informative error). Simple defaults would get left as is and complex use cases can be explicit and opt-in.

@shoyer
Copy link
Member

shoyer commented Dec 7, 2022

I also like the idea of creating specific Grouper objects for different types of selection, e.g., UniqueGrouper (the default), BinGrouper, TimeResampleGrouper, etc.

@dcherian
Copy link
Contributor Author

dcherian commented Apr 6, 2023

Here's a question.

In #7561, I implement Grouper objects that don't have any information of the variable we're grouping by. So the future API would be:

data.groupby({
	"x0": xr.BinGrouper(bins=pd.IntervalIndex.from_breaks(coords["x_vertices"])),  # binning
    "y": xr.UniqueGrouper(labels=["a", "b", "c"]),  # categorical, data.y is dask-backed
    "time": xr.TimeResampleGrouper(freq="MS")
	},
)

Does this look OK or do we want to support passing the DataArray or variable name as a by kwarg:

xr.BinGrouper(by="x0", bins=pd.IntervalIndex.from_breaks(coords["x_vertices"]))

This syntax would support passing DataArray in by so xr.UniqueGrouper(by=data.y) for example. Is that an important usecase to support? In #7561, I create new ResolvedGrouper objects that do contain by as a DataArray always, so it's really a question of exposing that to the user.

PS: Pandas has a key kwarg for a column name. So following that would mean

data.groupby([
	xr.BinGrouper("x0", bins=pd.IntervalIndex.from_breaks(coords["x_vertices"])),  # binning
    xr.UniqueGrouper("y", labels=["a", "b", "c"]),  # categorical, data.y is dask-backed
    xr.TimeResampleGrouper("time", freq="MS")
	],
)

@dcherian
Copy link
Contributor Author

dcherian commented Apr 26, 2023

We voted to move forward with this API:

data.groupby({
	"x0": xr.BinGrouper(bins=pd.IntervalIndex.from_breaks(coords["x_vertices"])),  # binning
    "y": xr.UniqueGrouper(labels=["a", "b", "c"]),  # categorical, data.y is dask-backed
    "time": xr.TimeResampleGrouper(freq="MS")
	},
)

We won't break backwards-compatibility for da.groupby(other_data_array) but for any complicated use-cases with Grouper the user must add the by variable to the xarray object, and refer to it by name in the dictionary as above,

@dcherian
Copy link
Contributor Author

Does anyone have opinions on using UniqueGrouper vs CategoricalGrouper or CategoryGrouper?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants