-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roundoff error between flox and brute force aggregations #398
Comments
I can't reproduce on numpy 2 even with
Also note that you can now use Xarray directly :) https://xarray.dev/blog/multiple-groupers |
Isn't this reproducing it since you have roundoff error there rather than all zeroes?
Incredible, thank you! EDIT: Some quick feedback. I'm finding with |
OK I was worried by your statement: "Differences are on the order of 1E-5." which is too large. This level of floating point inaccuracy is hard to fix but I can take a look. numpy |
How many unique groups do you have? can you share a dummy example?
Yes, it is. I've been meaning to port |
That makes sense. Just tried at scale with my full problem size (1km global grid with 1,134 unique groups across three levels of aggregation) and am seeing maximum deviation of 1E-4%.
As in above... a ton. I can't even run Here's an example just with the single mask being applied with 18 unique groups for aggregation. Have a 50-worker dask cluster with 8CPU and 32GB RAM per worker and the import flox.xarray
import xarray as xr
import numpy as np
import dask.array as da
np.random.seed(123)
# Simulating 1km global grid
lat = np.linspace(-89.1, 89.1, 21384)
lon = np.linspace(-180, 180, 43200)
# Simulating data we'll be aggregating
data = da.random.random((lat.size, lon.size), chunks=(3600, 3600))
data = xr.DataArray(data, dims=['lat', 'lon'], coords={'lat': lat, 'lon': lon})
# Simulating 18 unique groups on the grid to aggregate over
integer_mask = da.random.choice(np.arange(1, 19), size=(lat.size, lon.size), chunks=(3600, 3600))
integer_mask = xr.DataArray(integer_mask, dims=['lat', 'lon'], coords={'lat': lat, 'lon': lon})
# Add as coordinate
data = data.assign_coords(dict(label1=integer_mask))
# Try with groupby (usually will spike scheduler memory, crash cluster, etc.). Haven't done a lot
# of looking at what's going on to wreck the cluster, just get impatient and give up.
# gb = data.groupby("label1")
# Versus, with expected groups. Runs extremely quickly to set up graph + execute.
res = flox.xarray.xarray_reduce(data, "label1", func="mean", skipna=True, expected_groups=np.arange(1, 19)) |
re your issue: yes Xarray doesn't support grouping by a dask array yet, it will eagerly compute it and then find uniques. I need to port over that part of flox too :) BUT good to see that such features are useful! |
can you add your nice example to pydata/xarray#2852 please? |
wild, can you confirm that this is unchanged on numpy 2 please |
Thanks for the tip on
Done!
Just ran this with We can actually reproduce it with my MVE above. results = []
for int_mask in np.arange(1, 19):
masked_data = data.where(integer_mask == int_mask)
res = masked_data.mean()
results.append(res)
results = xr.concat(results, dim='int_val')
results = results.assign_coords(dict(int_val=np.arange(1, 19)))
# Compare to flox
b = flox.xarray.xarray_reduce(data, "label1", func="mean", skipna=True, expected_groups=np.arange(1, 19))
b = b.compute() My "brute force" solution looks like: array([0.49997059, 0.50000048, 0.50000736, 0.50009579, 0.49996396,
0.50001181, 0.49997802, 0.49999938, 0.49992536, 0.49997482,
0.49998581, 0.50000708, 0.50005436, 0.50001996, 0.50004166,
0.5000194 , 0.50000429, 0.50000408]) The flox solution looks like: array([0.49997059, 0.50000048, 0.50000736, 0.50009579, 0.49996396,
0.50001181, 0.49997802, 0.49999938, 0.49992536, 0.49997482,
0.49998581, 0.50000708, 0.50005436, 0.50001996, 0.50004166,
0.5000194 , 0.50000429, 0.50000408]) For % error of array([ 7.77201837e-14, 1.99839954e-13, -1.33224803e-13, 2.88602698e-13,
-3.33090918e-13, -1.55427554e-13, 5.55135915e-14, -4.99600978e-13,
1.88766094e-13, 4.10803207e-13, -3.88589086e-13, 3.77470481e-13,
-1.11010234e-13, -2.22035741e-13, 1.11013052e-13, -6.66107975e-14,
4.21881132e-13, 3.55268471e-13]) EDIT: Original answer had an error in my code. So here we're at very reasonable roundoff error. |
this probably won't fit well. It may be better to just do
OK great! I'm not sure we can do much better |
Wondering if you can leverage the |
We try to not run implicit computes, these can be really slow if a cluster isn't set up yet. |
Hi guys, love this package. I'm migrating some old aggregation code over to
flox
and noticed that there is some roundoff inconsistency between brute-force aggregations and usingflox
.Problem:
I am using
flox
to take a raster of integers as a mask of a variety of different groups I want to aggregate over. When I brute force it (just loop through the unique mask integers and create a.where()
mask and aggregate over the pixels), I get slightly different numbers than when I do this withflox
.Package Versions:
Code:
Execution:
Differences are on the order of 1E-5.
Thoughts:
float32
andfloat64
precision, assumingnumpy
internals might be converting precision. This reduced error but didn't eliminate it.bottleneck
throughxr.set_options(...)
, which has caused precision issues before, but this doesn't eliminate the issue either.The text was updated successfully, but these errors were encountered: