-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrating the proxy into the data viewer - progress update and performance observations and other issues #6
Comments
I wanted to make a note that the timings and screenshots above were obtained while running the zarr-proxy via AWS Lambda. |
The problem is that a single chunk shape header is being provided to the entire group. I see two high level ways of resolving this: Only Proxy ArraysIf we just never attempt to open groups, we don't have this problem. The sequence look like this
This is not compatible with how we tend to use Xarray, Zarr, and FSSpec from Python. There we tend to open the group and thus can't specialize the headers be different for different arrays. But it would work fine in plain Zarr. And it may be feasible from javascript land. Is Xarray support required here? Scope the header to specific arraysWe could to scope the header specify different chunks for different object. Instead of {"chunks": "10,10"} what about {
"chunks":
{
"storage.googleapis.com/ldeo-glaciology/bedmachine/bm.zarr/bed": "10,10"
}
} The steps to set up reading would be as followed. The first two are the same as above.
The tricky bit here is aligning the paths specified in the header with the paths specified in the URL. But this method should also work with Xarray. |
thank you for chiming in, @rabernat. I've implemented a more complex chunks header in #7 and @katamartin and i are wondering if we need the full path in the header key or if the keys can be relative to the path? {
"chunks":
{
"bed" "10,10",
"x": 5,
}
}
|
After tinkering with the new approach for specifying chunks headers in #7, I'm happy to report that everything seems to be working with both Xarray and Zarr. The key piece here is that we are now accepting chunks headers along the In [5]: import xarray as xr, zarr
In [6]: chunks='bed=10,10,mask=20,20'
In [7]: url = 'http://127.0.0.1:8000/storage.googleapis.com/ldeo-glaciology/bedmachine/bm.zarr'
In [8]: store = zarr.storage.FSStore(url, client_kwargs={'headers': {"chunks": chunks}})
In [9]: ds = xr.open_dataset(store, engine='zarr', chunks={})
In [10]: ds
Out[10]:
<xarray.Dataset>
Dimensions: (y: 13333, x: 13333)
Coordinates:
* x (x) int32 -3333000 -3332500 -3332000 ... 3332000 3332500 3333000
* y (y) int32 3333000 3332500 3332000 ... -3332000 -3332500 -3333000
Data variables:
bed (y, x) float32 dask.array<chunksize=(10, 10), meta=np.ndarray>
errbed (y, x) float32 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
firn (y, x) float32 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
geoid (y, x) int16 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
mask (y, x) int8 dask.array<chunksize=(20, 20), meta=np.ndarray>
source (y, x) int8 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
surface (y, x) float32 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
thickness (y, x) float32 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
Attributes: (12/25)
Author: Mathieu Morlighem
Conventions: CF-1.7
Data_citation: Morlighem M. et al., (2019), Deep...
Notes: Data processed at the Department ...
Projection: Polar Stereographic South (71S,0E)
Title: BedMachine Antarctica
... ...
spacing: [500]
standard_parallel: [-71.0]
straight_vertical_longitude_from_pole: [0.0]
version: 05-Nov-2019 (v1.38)
xmin: [-3333000]
ymax: [3333000] In [12]: ds.isel(x=range(2), y=range(2)).bed.compute()
Out[12]:
<xarray.DataArray 'bed' (y: 2, x: 2)>
array([[-5914.538 , -5919.3955],
[-5910.384 , -5915.8296]], dtype=float32)
Coordinates:
* x (x) int32 -3333000 -3332500
* y (y) int32 3333000 3332500
Attributes:
grid_mapping: mapping
long_name: bed topography
source: IBCSO and Mathieu Morlighem
standard_name: bedrock_altitude
units: meters |
Is there any live demo I could peek at? |
@rabernat yeah, you should be able to play around with this: |
I guess i meant an actual map. 😉 |
Aha yeah the link for the map is https://ncview-js.staging.carbonplan.org/, but the app is definitely not stable 😅. We're currently troubleshooting the integration with the newly added validations. |
@katamartin and I have been making progress in integrating the proxy into the data viewer. Our intention is to use the proxy for on-the-fly rechunking of datasets for visualization purposes. The results are looking promising and the performance is satisfactory (for small datasets and datasets hosted in AWS S3) even without caching on the backend
https://storage.googleapis.com/carbonplan-maps/ncview/demo/single_timestep/air_temperature.zarr
s3://carbonplan-data-viewer/demo/MURSST.zarr
( the original chunk size is roughly ~ 1.21 GB)retrieving data from stores hosted outside outside of S3 takes a long time (as expected). the following are timings for
gs://ldeo-glaciology/bedmachine/bm.zarr
(the original chunk size is roughly ~ 35MB)there's still more work to do to ensure seamless interoperability with existing zarr clients. To illustrate this, below is a code snippet that demonstrates how the proxy can be used via the zarr Python library.
if we attempt to access a variable whose dimensionality does not match the specified chunks in the HTTP headers, it causes issues or failure
. for instance, in our store,
x
is 1D, and the chunks we specified earlier are10,10
as defined inzarr.storage.FSStore(url, client_kwargs={'headers': {"chunks": "10,10"}})
It would be nice if there's a way to override the headers via fsspec.
I am also CC-ing some folks (@freeman-lab, @norlandrhagen, @jhamman, @rabernat) who might be interested in this, to keep them in the loop of our progress
The text was updated successfully, but these errors were encountered: