Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Add read_nwb to simplify reading nwbfiles #1974

Open
3 tasks done
h-mayorquin opened this issue Oct 23, 2024 · 23 comments
Open
3 tasks done

[Feature]: Add read_nwb to simplify reading nwbfiles #1974

h-mayorquin opened this issue Oct 23, 2024 · 23 comments
Assignees
Labels
category: proposal proposed enhancements or new features priority: low alternative solution already working and/or relevant to only specific user(s)
Milestone

Comments

@h-mayorquin
Copy link
Contributor

h-mayorquin commented Oct 23, 2024

What would you like to see added to PyNWB?

This is a comment that I have gotten from various users: reading and nwbfile is not as easy as it could be.

At the moment, when I google "how to read and nwbfile", the results show a suggestion to use NWBHDF5IO. This is also what we have on the tutorial.

While the IO approach is flexible, powerful, and follows good practices (i.e., it can be used as a context manager) it also suffers from what I think are several problems that detract from usability:

  1. It has too many options: path, file, region, mode, manager, etc.
  2. It is hard to remember the name to import.
  3. It does not read zarr.
  4. It requires two method calls, first instantiating the IO and then using read on the IO.

What solution would you like?

I think that we could provide users with a simple entry point that covers most cases and leave the IO approach as the power tool behind the scenes:

from pynwb import read_nwb

nwbfile = read_nwb(file_path)

I think the name is easier to remember and intuitive about what it does.

For simplicity, this function should:

  • Work only paths
  • Always on read mode
  • Open both zarr and hdf5.
  • load_namespaces=True
  • If we are ambitious we can have a default streaming choice if a link to dandi s3 asset is passed.

Downsides?

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

@h-mayorquin h-mayorquin changed the title [Feature]: Add read_nwb as a simple and memorable method to open an nwbfile [Feature]: Add read_nwb to simplify reading nwbfiles Oct 23, 2024
@stephprince
Copy link
Contributor

I like this idea! For me the main downside would be any potential confusion for users about having multiple methods to read a file and when to use read_nwb vs. NWBHDF5IO, but I think this could be clarified in the docs.

@oruebel @rly @bendichter thoughts?

@stephprince stephprince added category: proposal proposed enhancements or new features priority: low alternative solution already working and/or relevant to only specific user(s) labels Oct 24, 2024
@oruebel
Copy link
Contributor

oruebel commented Oct 24, 2024

potential confusion for users about having multiple methods to read a file and when to use read_nwb vs. NWBHDF5IO

I would make this part of io/utils.py to also make it clear on import that this is a convenience function. In addition, I think we want to be clear about the scoping of the function . Ie., folks will ask for the function to do everything, i.e, be able to open any NWB file and be able to configurabl all possible options. I think we want to limit scope the function to be: 1) read only and 2) have only the filename as a parameter and not support additional configurations. I.e., have it as a convenience function for opening a file for just read. As such NWBHDF5IO would still be used in most of our tutorials, since many tutorials show how to write data and read_io would be used mainly in read tutorials. Also, just scoping for read will be challenging enough since the function will need to work with:

  • both HDF5 and Zarr and return either NWBHDF5IO or NWBZarrIO an NWBFile object, which uses either NWBHDF5IO or NWBZarrIO for read (i.e., NWBFile.get_read_io() may be either NWBHDF5IO or NWBZarrIO)
  • S3 paths to open a files on DANDI
  • know when to load namespaces (or default to always load namespaces?)
  • ...

@stephprince
Copy link
Contributor

stephprince commented Oct 24, 2024

Here an example implementation of a read_nwb method that Cody made for NWBInspector that might be useful as a reference. And if we do incorporate streaming, here is a related proposal on a default function for remote read: #1739

Also, just scoping for read will be challenging enough since the function will need to work with:
both HDF5 and Zarr and return either NWBHDF5IO or NWBZarrIO

If I’m understanding the initial proposal, with this convenience method we would not return an IO object but just the NWBFile object?

@oruebel
Copy link
Contributor

oruebel commented Oct 24, 2024

If I’m understanding the initial proposal, with this convenience method we would not return an IO object but just the NWBFile object?

Sorry, you are correct, read_nwb should always return the NWBFile (not the IO object). I amended my comment above accordingly. However, the I/O object is still accessible via the NWBFile object via NWBFile.get_read_io(), i.e.,:

nwbfile = read_nwb('myfile.nwb')   # read HDF5
nwbfile.get_read_io()   # this returns NWBHDF5IO

nwbfile = read_nwb('myfile.zarr.nwb')  # read Zarr.
nwbfile.get_read_io()   # this returns NWBZarrIO

@rly rly assigned rly and stephprince and unassigned rly Oct 31, 2024
@rly rly added this to the Future milestone Oct 31, 2024
@magland
Copy link
Contributor

magland commented Nov 1, 2024

I like this idea, but I would hope that it would also support passing in an h5py object instead of just a file path or an s3 path. That would allow streaming. For what I do, I am almost always streaming from a remote location. Passing in the h5py object provides a lot of flexibility on how this would be done (streaming method, authentication, etc).

@h-mayorquin
Copy link
Contributor Author

I like this idea, but I would hope that it would also support passing in an h5py object instead of just a file path or an s3 path. That would allow streaming. For what I do, I am almost always streaming from a remote location. Passing in the h5py object provides a lot of flexibility on how this would be done (streaming method, authentication, etc).

My view is that the intended audience of this function is users without a lot of experience. I would prefer to keep the signature and the input types very simple and defer more complex use cases like yours to their respective backend IOs.

I am coming at this from the angle that if the format is successful reading will be way more common than writing, so I just want to have a top level method that is easier to remember, use and covers 95 % of the cases for reading files.

@h-mayorquin
Copy link
Contributor Author

h-mayorquin commented Nov 1, 2024

To make this more concrete, this is what I am envisioning as the signature, input types and docstring:

from pathlib import Path
from pynwb import NWBFile

def read_nwb(path: str | Path) -> NWBFile:
    """Read an NWB file from a local path or remote URL.

    Provides a simple, high-level interface for reading NWB files in the most 
    common use cases. Automatically handles both HDF5 and Zarr formats.
    For advanced use cases (parallel I/O, custom namespaces), use NWBHDF5IO or NWBZarrIO.

    Parameters
    ----------
    path : str or pathlib.Path
        Path to the NWB file. Can be either a local filesystem path to an HDF5 (.nwb) 
        or Zarr (.zarr) file, or a remote URL (e.g., DANDI S3 asset URL).

    Returns
    -------
    pynwb.NWBFile
        The loaded NWB file object containing all datasets and metadata.

    See Also
    --------
    pynwb.NWBHDF5IO : Core I/O class for HDF5 files with advanced options.
    hdmf_zarr.nwb.NWBZarrIO : Core I/O class for Zarr files with advanced options.

    Notes
    -----
    This function uses the following defaults:
    * Always opens in read-only mode
    * Automatically loads namespaces
    * Detects file format based on extension
    * Automatically handles local and remote paths

    Advanced features requiring direct use of IO classes include:
    * Custom namespace extensions
    * Parallel I/O with MPI
    * Custom build managers
    * Write or append modes
    * Pre-opened HDF5 file objects or Zarr stores
    * Remote file access configuration

    Examples
    --------
    Read a local NWB file:

    >>> from pynwb import read_nwb
    >>> nwbfile = read_nwb("path/to/file.nwb")

    Read from a remote DANDI asset:

    >>> nwbfile = read_nwb("s3://dandiarchive/.../file.nwb")
    """
    pass  # Implementation details would go here

@magland
Copy link
Contributor

magland commented Nov 1, 2024

I see your point, but I feel like allowing the argument to be an h5py object as well wouldn't complicate things very much. This would really simplify all my neurosift scripts, so I hope this could be considered.

Regarding s3 url, I think it's important to also support https://api.dandiarchive.org/... URLs Would be nice to allow any https://... so it's not limited to dandi.

@oruebel
Copy link
Contributor

oruebel commented Nov 1, 2024

This would really simplify all my neurosift scripts, so I hope this could be considered.

@magland could you add a brief example to illustrate how read_nwb would simplify your scripts vs. using NWBHDF5IO? If you are passing in an open h5py object, then I think either should look very similar, but I'm probably missing something.

@magland
Copy link
Contributor

magland commented Nov 1, 2024

@oruebel

f = lindi.LindiH5pyFile.from_lindi_file(url)
io = pynwb.NWBHDF5IO(file=f, mode='r')
nwbfile = io.read()

would become

f = lindi.LindiH5pyFile.from_lindi_file(url)
nwbfile = read_nwb(f)

So it saves one line... but it becomes a lot simpler to read.

@oruebel
Copy link
Contributor

oruebel commented Nov 2, 2024

So it saves one line... but it becomes a lot simpler to read.

Thanks @magland for the clarification.

I think allowing a file as input instead of path would be OK, since it doesn't add any parameters to the interface. We may want to restrict it to h5py.File objects that are in read-only mode (or at least warn if they are not), since otherwise you might get a write-able file instead of a read-only file.

Just a thought, now that the io object is being stored on the NWBFile object. We could add static read_nwb methods on each backend, i.e., nwbfile = NWBHDF5IO.read_nwb(). This would help simplify the logic of the utility method read_nwb, since it would just call NWBHDF5IO.read_nwb() or NWBZarrIO.read_nwb() and then each backend could add additional logic. E.g., NWBHDF5IO.read_nwb() could accept and h5py.File instead of a path while keeping the interface of read_nwb simple to allow only path. I.e., in this case the lindi example would change to:

f = lindi.LindiH5pyFile.from_lindi_file(url)
nwbfile = NWBHDF5IO.read_nwb(f)

@h-mayorquin @magland what do you think? Would adding NWBHDF5IO.read_nwb() and NWBZarrIO.read_nwb() in addition to read_nwb make things simpler or worse?

@magland
Copy link
Contributor

magland commented Nov 2, 2024

@oruebel That makes sense to me. I think if we had NWBHDF5IO.read_nwb(f) that supported h5py objects, then I wouldn't need the read_nwb utility, which could be reserved for the simplest case.

@h-mayorquin
Copy link
Contributor Author

This would help simplify the logic of the utility method read_nwb, since it would just call NWBHDF5IO.read_nwb() or NWBZarrIO.read_nwb() and then each backend could add additional logic. E.g., NWBHDF5IO.read_nwb() could accept and h5py.File instead of a path while keeping the interface of read_nwb simple to allow only path.

This is great. Another advantage is that the reading code is in its "natural" place: the backend IO object. read_nwb is then just a router over those new methods.

@oruebel
Copy link
Contributor

oruebel commented Nov 4, 2024

Sounds like we have a plan:

PyNWB

HDMF_ZARR

  • Add NWBZarrIO.read_nwb() which should accept: 1) local path, 2) S3 path, or 3) an open Zarr store.

HDMF / HDMF_ZARR / PyNWB

  • We may also need to refine the logic in HDF5IO.can_read() and ZarrIO.can_read() (or overwrite the methods in NWBHDF5IO and NWBZarrIO) to handle S3 paths correctly.

@oruebel
Copy link
Contributor

oruebel commented Nov 4, 2024

I created hdmf-dev/hdmf-zarr#225 for this and can take a stab at the changes needed in HDFM_ZARR.

@h-mayorquin would you be interested in taking a stab at the necessary changes in PyNWB?

@h-mayorquin
Copy link
Contributor Author

Yes, I can take a look at it!

@oruebel
Copy link
Contributor

oruebel commented Nov 7, 2024

@h-mayorquin the PR for adding NWBZarrIO.read_nwb for the Zarr backend is here hdmf-dev/hdmf-zarr#226

@h-mayorquin
Copy link
Contributor Author

Here is what I think is the analogous PR on pynwb (first part):
#1979

@oruebel
Copy link
Contributor

oruebel commented Nov 7, 2024

@h-mayorquin one item to add to the ToDo list is that we should also update the tutorial on reading NWB files https://pynwb.readthedocs.io/en/latest/tutorials/general/plot_read_basics.html#sphx-glr-tutorials-general-plot-read-basics-py to describe read_nwb for opening a file for reading.

@h-mayorquin
Copy link
Contributor Author

Sounds good, I can take care of that in the PR for pynwb.read_nwb().

@oruebel
Copy link
Contributor

oruebel commented Nov 12, 2024

@h-mayorquin hdmf-dev/hdmf-zarr#226 for adding NWBZarrIO.read_nwb has been merged

@h-mayorquin
Copy link
Contributor Author

OK, now that the PyNWB part has been merged #1979 I think I can work on the second part:

Add pynwb.read_nwb() which accepts a file path and calls the appropriate backend-specific read_nwb method

And updating the tutorial.

@h-mayorquin
Copy link
Contributor Author

h-mayorquin commented Nov 16, 2024

I added pynwb.read_nwb for local files:

#1994

I would wait for one of our meetings to discuss support for remote_paths in pynwb.read_nwb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: proposal proposed enhancements or new features priority: low alternative solution already working and/or relevant to only specific user(s)
Projects
None yet
Development

No branches or pull requests

5 participants