Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: make pandas optional / installation more lightweight #2010

Open
3 tasks done
yarikoptic opened this issue Dec 14, 2024 · 7 comments
Open
3 tasks done

[Feature]: make pandas optional / installation more lightweight #2010

yarikoptic opened this issue Dec 14, 2024 · 7 comments
Labels
category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s)

Comments

@yarikoptic
Copy link
Contributor

What would you like to see added to PyNWB?

pandas is great but quite a heavy dependency. ATM it even simply fails to install for me "easily" in a test windows env.

image

But it seems to be used only "optionally" (clearly should not mandate solely for conversion of age into isoformat)

❯ git grep -e pandas -e 'pd\.' -- src
src/pynwb/file.py:import pandas as pd
src/pynwb/file.py:            args_to_set["age"] = pd.Timedelta(args_to_set["age"]).isoformat()
src/pynwb/file.py:             'type': ('scalar_data', np.ndarray, list, tuple, pd.DataFrame, DynamicTable, NWBContainer, ScratchData),
src/pynwb/file.py:             'doc': ('Description for the internal DynamicTable used to store a pandas.DataFrame. This '
src/pynwb/file.py:                     'list, tuple, or pandas.DataFrame. Ignored when passing in an NWBContainer, '
src/pynwb/file.py:        if isinstance(data, (str, int, float, bytes, np.ndarray, list, tuple, pd.DataFrame)):
src/pynwb/file.py:                       'list, tuple, or pandas.DataFrame as scratch data.')
src/pynwb/file.py:            if isinstance(data, pd.DataFrame):
src/pynwb/file.py:                           'list, tuple, or pandas.DataFrame as scratch data.')
src/pynwb/file.py:                           'list, tuple, or pandas.DataFrame as scratch data.')
src/pynwb/icephys.py:        """Convert the collection of tables to a single pandas DataFrame"""

Is your feature request related to a problem?

No response

What solution would you like?

so I wonder if it could be moved into some extra depends, and some checks adjusted depending on pandas availability (if not available -- can't be of that data time)?

Do you have any interest in helping implement the feature?

No.

Code of Conduct

@oruebel
Copy link
Contributor

oruebel commented Dec 16, 2024

Pandas is a dependency of hdmf and many of the pandas functionality sits there. In particular functionality related to DyanmicTable lives there, e.g., to convert whole or partial tables to Pandas DataFrames. I.e., we may not need to specify it as a dependency in PyNWB directly, but it would still be a dependency via HDMF.

I'm not sure how hard it would be to modify HDMF to make Pandas optional, but I suspect this may not be trivial since pandas DataFrame is used as the main format for returning results from tables.

@oruebel oruebel added category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s) labels Dec 16, 2024
@oruebel
Copy link
Contributor

oruebel commented Dec 16, 2024

@rly can you comment

@rly
Copy link
Contributor

rly commented Dec 17, 2024

Related: #1282

pandas DataFrame is used as the main format for returning results from tables

Yes, that would make it difficult. If you slice a set of rows from a DynamicTable, a Pandas DataFrame is returned. This is a fairly common use case if you are reusing data, but users could use the pattern of accessing a single column and slicing the column, which does not involve pandas DataFrames. So, we would have to change how slices of DynamicTable are returned and raise a notice when a user tries to use functionality requiring pandas.

I understand that pandas is a heavy requirement, but manipulating tables is pretty common and IMO most users are more comfortable doing that in pandas than in our custom HDMF table syntax.

@yarikoptic
Copy link
Contributor Author

Well, I was not suggesting to change those aspects. It should be indeed pandas where efficient manipulation of tables is needed. I just want to point that I do not think ( but I could be wrong) that e.g. dandi client doesn't need any of that functionality which relies on pandas and data manipulation. And likely others too. What I was suggesting is that to condition such dependencies to be optionally installed , and otherwise to allow pynwb to be installed/used without dragging the full kitchen sink behind. Interfaces which spit out Pandas could raise exceptions suggesting to install pandas and declaring that e.g. for that purpose pynwb should be installed as pynwb[full] or pynwb[pandas] or alike. For extra user friendliness could even offer to run pip install pandas on user's behalf if detected installation through pip (somehow).

@rly
Copy link
Contributor

rly commented Dec 17, 2024

I see. I am curious - what does the dandi client need from pynwb? Just validation?

@yarikoptic
Copy link
Contributor Author

validation and access to metadata to extract/harmonize while uploading files to DANDI. We also collect statistics over data types , see e.g. "Assets summary" at https://dandiarchive.org/dandiset/000003 .

@rly
Copy link
Contributor

rly commented Jan 9, 2025

Thanks. We discussed this at our developer meeting today and all lean toward keeping pandas as part of the default installation because it is required for accessing and manipulating table data within NWB. This is a tension between pynwb being both a low-level API and providing higher-level data access functions.

An alternate approach is for us to release a "pynwb-minimal" package that does not include pandas (and scipy). pyproject.toml does not support extras removing requirements, so we cannot do something like pip install "pynwb[minimal]".

Let's keep the discussion open for now and revisit this later as use cases evolve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior priority: low alternative solution already working and/or relevant to only specific user(s)
Projects
None yet
Development

No branches or pull requests

3 participants