Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider better packaging of h5 files from exported ROIs #209

Open
emlynjdavies opened this issue Aug 26, 2024 · 6 comments
Open

Consider better packaging of h5 files from exported ROIs #209

emlynjdavies opened this issue Aug 26, 2024 · 6 comments
Labels
major new feature Changes in high-level pipeline use and/or data output that are not backwards-compatible

Comments

@emlynjdavies
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
exported ROIs are put into a folder that becomes eventually a very long list of h5 files

Describe the solution you'd like
package this into a single file that is easier to transport between discs

Describe alternatives you've considered
maybe some tools like xarray.open_mfdataset could be helpful here?

@emlynjdavies emlynjdavies added the patch / enhancement improved functionality or patch indented for changes that require bumping only the PATCH number label Aug 26, 2024
@emlynjdavies
Copy link
Collaborator Author

is this related to #207 @nepstad ?

@nepstad
Copy link
Collaborator

nepstad commented Aug 26, 2024

is this related to #207 @nepstad ?

Not directly, no. #207 concerns the stats netcdf file.

@nepstad
Copy link
Collaborator

nepstad commented Sep 17, 2024

We could supply a "merge ROI files" option to the PyOPIA command line interface. Since the ROIs have different shapes, xarray/netcdf is not ideal, but we could simply drop them into one big hdf5 file.

@nepstad
Copy link
Collaborator

nepstad commented Sep 17, 2024

Something like this could be a solution, producing one merged hdf5 file with one group for each processed image, containing all the ROIs as datasets under those groups:

h5files = sorted(glob('...'))
with h5py.File('silcam_rois_combined.h5', 'w') as f:
    for h5file in h5files:
        with h5py.File(h5file, 'r') as file_in:
            g = f.create_group(os.path.basename(h5file).replace('.h5', ''))
            for name in file_in:
                if 'PN' in name:
                    g.create_dataset(name, data=file_in['/'+name][:])

@nepstad
Copy link
Collaborator

nepstad commented Sep 18, 2024

Another option is to look into a more cloud-friendly storage solution, such as zarr: https://zarr.dev/

@emlynjdavies
Copy link
Collaborator Author

emlynjdavies commented Sep 19, 2024

If we switched to zarr, we could put the ROIs as a subgroup within the main STATS output - then we could have all output files in the same place.

See zarr groups here

@emlynjdavies emlynjdavies added major new feature Changes in high-level pipeline use and/or data output that are not backwards-compatible and removed patch / enhancement improved functionality or patch indented for changes that require bumping only the PATCH number labels Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major new feature Changes in high-level pipeline use and/or data output that are not backwards-compatible
Projects
None yet
Development

No branches or pull requests

2 participants