Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupBy(chunked-array) #9522

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open

GroupBy(chunked-array) #9522

wants to merge 20 commits into from

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Sep 19, 2024

This came together quickly last night ;)

TODO:

  • decide on backwards compatibility: we used to eagerly compute dask arrays, now it errors.

cc @bradyrx

@dcherian dcherian marked this pull request as draft September 19, 2024 16:44
@dcherian
Copy link
Contributor Author

dcherian commented Sep 20, 2024

This is ready for review. It is backwards-incompatible. Previously when grouping by a dask array we would just compute it eagerly. It has been like that for a very long time, so perhaps a deprecation cycle is needed. Thoughts?

@dcherian dcherian marked this pull request as ready for review September 20, 2024 02:57
@@ -190,8 +192,8 @@ def values(self) -> range:
return range(self.size)

@property
def data(self) -> range:
return range(self.size)
def data(self) -> np.ndarray:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for typing

* main: (63 commits)
  Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651)
  Change URL for pydap test (pydata#9655)
  Fix multiple grouping with missing groups (pydata#9650)
  flox: Properly propagate multiindex (pydata#9649)
  Update Datatree html repr to indicate inheritance (pydata#9633)
  Re-implement map_over_datasets using group_subtrees (pydata#9636)
  fix zarr intersphinx (pydata#9652)
  Replace black and blackdoc with ruff-format (pydata#9506)
  Fix error and missing code cell in io.rst (pydata#9641)
  Support alternative names for the root node in DataTree.from_dict (pydata#9638)
  Updates to DataTree.equals and DataTree.identical (pydata#9627)
  DOC: Clarify error message in open_dataarray (pydata#9637)
  Add zip_subtrees for paired iteration over DataTrees (pydata#9623)
  Type check datatree tests (pydata#9632)
  Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631)
  Bug fixes for DataTree indexing and aggregation (pydata#9626)
  Add inherit=False option to DataTree.copy() (pydata#9628)
  docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625)
  Migration guide for users of old datatree repo (pydata#9598)
  Reimplement Datatree typed ops (pydata#9619)
  ...
* main:
  Add `DataTree.persist` (pydata#9682)
  Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688)
  Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689)
  Fix inadvertent deep-copying of child data in DataTree (pydata#9684)
  new blank whatsnew (pydata#9679)
  v2024.10.0 release summary (pydata#9678)
  drop the length from `numpy`'s fixed-width string dtypes (pydata#9586)
  fixing behaviour for group parameter in `open_datatree` (pydata#9666)
  Use zarr v3 dimension_names (pydata#9669)
  fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673)
  implement `dask` methods on `DataTree` (pydata#9670)
  support `chunks` in `open_groups` and `open_datatree` (pydata#9660)
  Compatibility for zarr-python 3.x (pydata#9552)
  Update to_dataframe doc to match current behavior (pydata#9662)
  Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658)
@dcherian dcherian marked this pull request as ready for review October 29, 2024 23:26
@dcherian
Copy link
Contributor Author

dcherian commented Oct 29, 2024

This should be backwards compatible now, and raise nice warnings. I'd like to merge this soon, it's been around for a while...

if not is_chunked_array(_flatcodes):
# Constructing an index from the product is wrong when there are missing groups
# (e.g. binning, resampling). Account for that now.
midx = full_index[np.sort(pd.unique(_flatcodes[~mask]))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not np.unique? You'll get the results sorted then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.unique sorts first. This can be quite slow if _flatcodes is large, which it can be,

xarray/tests/test_groupby.py Outdated Show resolved Hide resolved
* main:
  Refactor out utility functions from to_zarr (pydata#9695)
  Use the same function to floatize coords in polyfit and polyval (pydata#9691)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow grouping by dask variables Ordered Groupby Keys
3 participants