GroupBy(chunked-array) #9522

dcherian · 2024-09-19T16:42:22Z

Closes Ordered Groupby Keys #757
Closes Allow grouping by dask variables #2852
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

This came together quickly last night ;)

TODO:

decide on backwards compatibility: we used to eagerly compute dask arrays, now it errors.

dcherian · 2024-09-20T02:57:06Z

This is ready for review. It is backwards-incompatible. Previously when grouping by a dask array we would just compute it eagerly. It has been like that for a very long time, so perhaps a deprecation cycle is needed. Thoughts?

dcherian · 2024-09-20T03:32:30Z

xarray/core/groupby.py

@@ -190,8 +192,8 @@ def values(self) -> range:
        return range(self.size)

    @property
-    def data(self) -> range:
-        return range(self.size)
+    def data(self) -> np.ndarray:


* main: (63 commits) Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651) Change URL for pydap test (pydata#9655) Fix multiple grouping with missing groups (pydata#9650) flox: Properly propagate multiindex (pydata#9649) Update Datatree html repr to indicate inheritance (pydata#9633) Re-implement map_over_datasets using group_subtrees (pydata#9636) fix zarr intersphinx (pydata#9652) Replace black and blackdoc with ruff-format (pydata#9506) Fix error and missing code cell in io.rst (pydata#9641) Support alternative names for the root node in DataTree.from_dict (pydata#9638) Updates to DataTree.equals and DataTree.identical (pydata#9627) DOC: Clarify error message in open_dataarray (pydata#9637) Add zip_subtrees for paired iteration over DataTrees (pydata#9623) Type check datatree tests (pydata#9632) Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631) Bug fixes for DataTree indexing and aggregation (pydata#9626) Add inherit=False option to DataTree.copy() (pydata#9628) docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625) Migration guide for users of old datatree repo (pydata#9598) Reimplement Datatree typed ops (pydata#9619) ...

* main: Add `DataTree.persist` (pydata#9682) Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688) Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689) Fix inadvertent deep-copying of child data in DataTree (pydata#9684) new blank whatsnew (pydata#9679) v2024.10.0 release summary (pydata#9678) drop the length from `numpy`'s fixed-width string dtypes (pydata#9586) fixing behaviour for group parameter in `open_datatree` (pydata#9666) Use zarr v3 dimension_names (pydata#9669) fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673) implement `dask` methods on `DataTree` (pydata#9670) support `chunks` in `open_groups` and `open_datatree` (pydata#9660) Compatibility for zarr-python 3.x (pydata#9552) Update to_dataframe doc to match current behavior (pydata#9662) Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658)

dcherian · 2024-10-29T23:27:19Z

This should be backwards compatible now, and raise nice warnings. I'd like to merge this soon, it's been around for a while...

Illviljan · 2024-10-30T06:47:56Z

xarray/core/groupby.py

+        if not is_chunked_array(_flatcodes):
+            # Constructing an index from the product is wrong when there are missing groups
+            # (e.g. binning, resampling). Account for that now.
+            midx = full_index[np.sort(pd.unique(_flatcodes[~mask]))]


Why not np.unique? You'll get the results sorted then.

np.unique sorts first. This can be quite slow if _flatcodes is large, which it can be,

xarray/tests/test_groupby.py

* main: Refactor out utility functions from to_zarr (pydata#9695) Use the same function to floatize coords in polyfit and polyval (pydata#9691)

dcherian added 5 commits September 19, 2024 07:46

GroupBy(chunked-array)

95f4802

Closes pydata#757 Closes pydata#2852

Optimizations

e022231

Optimize multi-index construction

d5d8ef2

Add tests

a1e0d6f

Add whats-new

adf2943

dcherian marked this pull request as draft September 19, 2024 16:44

dcherian added 6 commits September 19, 2024 13:30

Raise errors

f56dc85

Add docstring

17b7f2f

preserve attrs

339ed3a

Add test for pydata#757

93e786b

Typing fixes

dfdc96a

Handle multiple groupers

a15b04d

dcherian marked this pull request as ready for review September 20, 2024 02:57

dcherian mentioned this pull request Sep 20, 2024

Allow grouping by dask variables #2852

Open

dcherian added the needs review label Sep 20, 2024

dcherian commented Sep 20, 2024

View reviewed changes

TomNicholas added topic-groupby topic-dask labels Sep 20, 2024

dcherian mentioned this pull request Sep 27, 2024

Add histogram method #4610

Open

dcherian added 2 commits October 21, 2024 16:51

Backcompat

b295193

dcherian force-pushed the groupby-dask branch from 8537741 to b295193 Compare October 22, 2024 15:10

dcherian marked this pull request as draft October 22, 2024 15:12

dcherian added 2 commits October 29, 2024 16:24

better backcompat

f826b65

dcherian marked this pull request as ready for review October 29, 2024 23:26

fix

aada75d

Illviljan reviewed Oct 30, 2024

View reviewed changes

dcherian added 4 commits October 31, 2024 17:12

Merge branch 'main' into groupby-dask

3e65c0c

* main: Refactor out utility functions from to_zarr (pydata#9695) Use the same function to floatize coords in polyfit and polyval (pydata#9691)

Handle edge case

3e40605

comment

295d6dd

type: ignore

a4fed4d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GroupBy(chunked-array) #9522

GroupBy(chunked-array) #9522

dcherian commented Sep 19, 2024 •

edited

Loading

dcherian commented Sep 20, 2024 •

edited

Loading

dcherian Sep 20, 2024

dcherian commented Oct 29, 2024 •

edited

Loading

Illviljan Oct 30, 2024

dcherian Nov 1, 2024

GroupBy(chunked-array) #9522

Are you sure you want to change the base?

GroupBy(chunked-array) #9522

Conversation

dcherian commented Sep 19, 2024 • edited Loading

dcherian commented Sep 20, 2024 • edited Loading

dcherian Sep 20, 2024

Choose a reason for hiding this comment

dcherian commented Oct 29, 2024 • edited Loading

Illviljan Oct 30, 2024

Choose a reason for hiding this comment

dcherian Nov 1, 2024

Choose a reason for hiding this comment

dcherian commented Sep 19, 2024 •

edited

Loading

dcherian commented Sep 20, 2024 •

edited

Loading

dcherian commented Oct 29, 2024 •

edited

Loading