-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using the shuffle primitive in Xarray #9546
Comments
I think (1,2,3) sound good to me. I'm not sure if I would go with (4), since that would either introduce a new kwarg everywhere, or a new API that would simply be a combination of |
the equivalent would be a little involved:
instead of
Though now I am realizing that we can detect if chunk boundaries line up with group boundaries and use
However, I am not sure if I like that kind of implicit behaviour. We still have the issue of passing |
but doesn't that mean that we'd really want to make that two separate operations? One for shuffling the data such that groups are within a single (possibly shared) chunk, and one where we apply a function to those shuffled chunks? Or am I missing anything? shuffled = ds.groupby(grouper).shuffle()
xr.map_blocks(udf, shuffled, template=...) (not sure if I got the syntax of Edit: otherwise we might also have a |
Is your feature request related to a problem?
dask recently added
dask.array.shuffle
to help with some classic GroupBy.map problems.shuffle
reorders the array so that all members of a single group are in a single chunk, with the possibility of multiple groups in a single chunk. I see a few ways to use this in Xarray:GroupBy.shuffle()
This shuffles and returns a new GroupBy object with which to do further operations (e.g.map
).Dataset.shuffle_by(Grouper)
This shuffles, and returns a new dataset (or dataarray), so that the shuffled data can be persisted to disk or you can do other things later (xref Saving the groups generated from groupby operation #5674)GroupBy.shuffle
under the hood inDatasetGroupBy.quantile
andDatasetGroupBy.median
, so that the exact quantile always works regardless of chunking (right now we raise and error), this seems like a no-brainer.shuffle
kwarg toGroupBy.map
and/orGroupBy.reduce
or a new API (e.g.GroupBy.transform
orGroupBy.map_shuffled
) that will shuffle, thenxarray.map_blocks
a wrapper function that applies theGroupby
on each block. This is how dask dataframe implementsGroupby.apply
#9320 implements (1,2). (1) is mostly for convenience, I could easily see us recommending using (2) before calling the GroupBy.
Thoughts?
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: