Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Adding Narwhals as compatibility layer between libraries #339

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

DeaMariaLeon
Copy link
Contributor

Related to issue #325

If we add the proposed code, nothing changes when using a pandas dataframe. And the rest of the libraries can be handled with Narwhals. That's needed because Narwhals and Polars don't allow duplicated column names, so the pandas path has to be kept.

We also have to do this because on iTables (currently), the returned results from the function datatables_row are different when using Polars vs. pandas in some cases.

So, with the proposed changes, this function would now work for:

import pyarrow as pa 
df = pd.DataFrame({"a": [2, 1]})
datatables_rows(pa.table(df))

out:
'[[2], [1]]'

Or this:

import modin.pandas as mpd
df = pd.DataFrame({"a": [2, 1]})

data = {"a": [1, 2, 3]}
datatables_rows(mpd.DataFrame(pd.DataFrame(data))

out:
'[[1], [2], [3]]'

I could eliminate from lines 88 to 114 from file datatables_format.py and handle all with Narwhals. But the results would be different for some cases in test_databases_format.py. I can elaborate...

I would like to have some feedback before going further please. Thanks.
Friendly ping @MarcoGorelli

Copy link

Thank you for making this pull request.

Did you know? You can try it on Binder: Binder:lab.

Also, the version of ITables developed in this PR can be installed with pip:

pip install git+https://github.com/DeaMariaLeon/itables.git@Narwhalify2

(this requires nodejs, see more at Developing ITables)

Copy link

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks for starting this 🙏 !

I could eliminate from lines 88 to 114 from file datatables_format.py and handle all with Narwhals. But the results would be different for some cases in test_databases_format.py. I can elaborate...

sure, please do!

# Polars DataFrame
df = nw.from_native(df)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably use eager_only=True if we're calling rows

# Polars DataFrame
df = nw.from_native(df)
data = list(df.iter_rows())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by - this would be more efficient as data = df.rows() (it produces the same thing, just faster)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the PR for this. Thanks for the feedback.

@codecov-commenter
Copy link

codecov-commenter commented Nov 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.20%. Comparing base (038f597) to head (14b8d22).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #339      +/-   ##
==========================================
- Coverage   94.76%   93.20%   -1.56%     
==========================================
  Files          24       24              
  Lines        1280     1281       +1     
==========================================
- Hits         1213     1194      -19     
- Misses         67       87      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mwouts
Copy link
Owner

mwouts commented Nov 30, 2024

Thank you @DeaMariaLeon for you PR, and thanks @MarcoGorelli for reviewing it. I am eager to test it! (today or tomorrow hopefully)

A few comments:

  1. It's great to see that (at least in this first stage) the PR is very well contained
  2. I am OK with adjusting the tests if in exchange we can simplify the code
  3. Oh yes sorry for having tests on all these corner cases e.g. the duplicated columns etc, but users sometime generate that kind of tables 😉
  4. I am fine with adding narwhals as a dependency (I plan to add anywidget too) but maybe I would like to ensure that ITables still works when narwhals is not installed (you can leave that with me)
  5. I plan to add more tests / more pages in the documentation to ensure/demo that all the sample tables work well for all the dataframe frameworks supported by Narwhals.
  6. No worries re the failing tests on Python 3.7, I can change the minimum Python version to 3.8

Thanks again for working on this!

@DeaMariaLeon
Copy link
Contributor Author

@mwouts, thanks for the feedback.

I am OK with adjusting the tests if in exchange we can simplify the code

Ok. I'll share how it would look like if a path for pandas is not needed.

Oh yes sorry for having tests ...

You don't have to be sorry. 🙂


In case the "elaboration" is still useful, (#339 (review)), here it is:

When running test_datatables_rows (in test_datatables_format.py), the differences of using pandas vs. Polars dfs are found when testing float, list, dict, and df_with_named_column_axis.
Plus Narwhals doesn't have transpose. Since it's only used in the tests, there may be a work-around. (Per private conversation with Marco). And we'll still have the issue of not allowing duplicated column names.

So before adding Narwhals:

  1. The float test uses round for a pandas df. If the input is a Polars df, this will fail. Polars returns more than 6 digits.

import pandas as pd
import numpy as np
import polars as pl
import math

# with pandas df
datatables_rows(pd.DataFrame({"x": [0.2, math.pi, np.nan, -np.inf]}))
output: '[[0.2], [3.141593], [NaN], [-Infinity]]'

# with Polars
datatables_rows(pl.DataFrame({"x": [0.2, math.pi, np.nan, -np.inf]}))
output: '[[0.2], [3.141592653589793], [NaN], [-Infinity]]'
  1. Using list in the test:
# with pandas
datatables_rows(pd.DataFrame({"list": [[1], [2, 3]]}))
output: '[["[1]"], ["[2, 3]"]]'

# with polars
datatables_rows(pl.DataFrame({"list": [[1], [2, 3]]}))
output: '[[[1]], [[2, 3]]]'

  1. With dict:
# with pandas
datatables_rows(pd.DataFrame({"dict": [{"a": 1}, {"b": [1, 2]}]}))
output: '[["{\'a\': 1}"], ["{\'b\': [1, 2]}"]]'

# with polars
datatables_rows(pl.DataFrame({"dict": [{"a": 1}, {"b": [1, 2]}]}))
output: '[[{"a": 1}], [{"a": null}]]'
  1. When testing df_with_named_column_axis, the Polars test needs to be modified but results can be the same.

  2. To test a transposed_df, as I said earlier, Narwhals doesn't have transpose, but there may be a worke-around.

@DeaMariaLeon
Copy link
Contributor Author

Just to be clear @mwouts: if we add Narwhals to handle all the libraries (without keeping the pandas' path separately), all the test results will be like Polars'.

@mwouts
Copy link
Owner

mwouts commented Dec 2, 2024

Just to be clear @mwouts: if we add Narwhals to handle all the libraries (without keeping the pandas' path separately), all the test results will be like Polars'.

Yes, that would make sense. The reason why you see only 6 digits for Pandas DataFrames is that ITables does use the pandas formatters. That's an attempt to give a rendering that is similar to what the users get when they don't use ITables. Of course I could think of other ways, e.g. apply the formatting on the Javascript side directly (but then passing numbers with a higher precision would also make the notebooks heavier).

Maybe it's less impactful for now to leave the Pandas part as is? We might have enough to do for now with the estimated size and the downsampling function, plus testing all the various table types. Are you able to run the documentation notebooks? See this paragraph and the next one.

@DeaMariaLeon
Copy link
Contributor Author

It seems that all the Jupyter notebook examples run correctly, @mwouts. I run the ones in the link "paragraphs" you mentioned, and the ones in the Advance parameters. But I don't know how to share them here - export to pdf didn't work and Jupiter files can't be attached.

@mwouts
Copy link
Owner

mwouts commented Dec 7, 2024

It seems that all the Jupyter notebook examples run correctly, @mwouts. I run the ones in the link "paragraphs" you mentioned, and the ones in the Advance parameters. But I don't know how to share them here - export to pdf didn't work and Jupiter files can't be attached.

Hi @DeaMariaLeon , yes sure, it's more for you to see how changes you make might change how the dataframes are rendered - no need to share the notebooks, I can checkout your branch and rerun them locally.

Re your PR, I would love to see it work for actual modin dataframes, see my PR #341 . It currently works for the dataframes that don't need to be downsampled.

The downsampling is very important for ITables because the table data gets embedded in the notebook (I don't have anything like a server mode at the moment), and if we are not careful about this we can make the notebook huge and crash the web browser.

To make the downsampling work (https://github.com/mwouts/itables/blob/main/src/itables/downsample.py) I would need to figure out to do this with narwhals:

  1. how to estimate the size of a dataframe
  2. how to keep only the first and last lines of a dataframe (and first and last columns too)

Let me know if you have any hints re the above! Thanks!

@DeaMariaLeon
Copy link
Contributor Author

Hi @mwouts, we'll be adding "estimated_size" to Narwhals in order to implement downsampling.
Besides that, I think that we have everything we need (in Narwhals) already.

@DeaMariaLeon
Copy link
Contributor Author

Hi @mwouts, we have now estimated_size in Narwhals. You don't need nbytes() any more. And you can select the units you need, so as_nbytes doesn't need to convert.

I just saw on the PR you experimented with Modin (the one on top of this one), you use series. I wonder if we should add that too. Our new function works for dataframes only.

Should I open a separate PR to (try to) modify downsample.py.. unless you want to do it yourself. If it's the later, I'll share how to select the last and first rows and columns.

@mwouts
Copy link
Owner

mwouts commented Dec 15, 2024

Hi @mwouts, we have now estimated_size in Narwhals. You don't need nbytes() any more. And you can select the units you need, so as_nbytes doesn't need to convert.

Thank you @DeaMariaLeon , that's awesome! That's in the last version of narwhals==1.18.3 then, correct?

I just saw on the PR you experimented with Modin (the one on top of this one), you use series. I wonder if we should add that too. Our new function works for dataframes only.

Yes ITables user can pass a Series to show, but when that happens I convert it to a dataframe with a single column. I will need to figure out how to do this in all generality (possibly with .to_frame()...), but to answer your question about estimated_size I won't need it for series.

Should I open a separate PR to (try to) modify downsample.py.. unless you want to do it yourself. If it's the later, I'll share how to select the last and first rows and columns.

Yes thanks for offering I think I will do try on my end. I am looking at the narwhals documentation and I have found already head, tail and concat for the row selection, and select for the column selection, let me see how it goes.

@DeaMariaLeon
Copy link
Contributor Author

That's in the last version of narwhals==1.18.3 then, correct?

Yes

thanks for offering I think I will do try on my end

Very good - you'll be much faster!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants