WIP: Adding Narwhals as compatibility layer between libraries #339

DeaMariaLeon · 2024-11-29T19:10:51Z

Related to issue #325

If we add the proposed code, nothing changes when using a pandas dataframe. And the rest of the libraries can be handled with Narwhals. That's needed because Narwhals and Polars don't allow duplicated column names, so the pandas path has to be kept.

We also have to do this because on iTables (currently), the returned results from the function datatables_row are different when using Polars vs. pandas in some cases.

So, with the proposed changes, this function would now work for:

import pyarrow as pa 
df = pd.DataFrame({"a": [2, 1]})
datatables_rows(pa.table(df))

out:
'[[2], [1]]'

Or this:

import modin.pandas as mpd
df = pd.DataFrame({"a": [2, 1]})

data = {"a": [1, 2, 3]}
datatables_rows(mpd.DataFrame(pd.DataFrame(data))

out:
'[[1], [2], [3]]'

I could eliminate from lines 88 to 114 from file datatables_format.py and handle all with Narwhals. But the results would be different for some cases in test_databases_format.py. I can elaborate...

I would like to have some feedback before going further please. Thanks.
Friendly ping @MarcoGorelli

github-actions · 2024-11-29T19:11:06Z

Thank you for making this pull request.

Did you know? You can try it on Binder: .

Also, the version of ITables developed in this PR can be installed with pip:

pip install git+https://github.com/DeaMariaLeon/itables.git@Narwhalify2

(this requires nodejs, see more at Developing ITables)

MarcoGorelli

nice, thanks for starting this 🙏 !

I could eliminate from lines 88 to 114 from file datatables_format.py and handle all with Narwhals. But the results would be different for some cases in test_databases_format.py. I can elaborate...

sure, please do!

MarcoGorelli · 2024-11-29T19:54:26Z

src/itables/datatables_format.py

        # Polars DataFrame
+        df = nw.from_native(df)


we should probably use eager_only=True if we're calling rows

MarcoGorelli · 2024-11-29T19:54:52Z

src/itables/datatables_format.py

        # Polars DataFrame
+        df = nw.from_native(df)
        data = list(df.iter_rows())


drive-by - this would be more efficient as data = df.rows() (it produces the same thing, just faster)

I did the PR for this. Thanks for the feedback.

codecov-commenter · 2024-11-30T13:32:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.20%. Comparing base (038f597) to head (14b8d22).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #339      +/-   ##
==========================================
- Coverage   94.76%   93.20%   -1.56%     
==========================================
  Files          24       24              
  Lines        1280     1281       +1     
==========================================
- Hits         1213     1194      -19     
- Misses         67       87      +20

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mwouts · 2024-11-30T13:58:19Z

Thank you @DeaMariaLeon for you PR, and thanks @MarcoGorelli for reviewing it. I am eager to test it! (today or tomorrow hopefully)

A few comments:

It's great to see that (at least in this first stage) the PR is very well contained
I am OK with adjusting the tests if in exchange we can simplify the code
Oh yes sorry for having tests on all these corner cases e.g. the duplicated columns etc, but users sometime generate that kind of tables 😉
I am fine with adding narwhals as a dependency (I plan to add anywidget too) but maybe I would like to ensure that ITables still works when narwhals is not installed (you can leave that with me)
I plan to add more tests / more pages in the documentation to ensure/demo that all the sample tables work well for all the dataframe frameworks supported by Narwhals.
No worries re the failing tests on Python 3.7, I can change the minimum Python version to 3.8

Thanks again for working on this!

DeaMariaLeon · 2024-12-01T11:45:10Z

@mwouts, thanks for the feedback.

I am OK with adjusting the tests if in exchange we can simplify the code

Ok. I'll share how it would look like if a path for pandas is not needed.

Oh yes sorry for having tests ...

You don't have to be sorry. 🙂

In case the "elaboration" is still useful, (#339 (review)), here it is:

When running test_datatables_rows (in test_datatables_format.py), the differences of using pandas vs. Polars dfs are found when testing float, list, dict, and df_with_named_column_axis.
Plus Narwhals doesn't have transpose. Since it's only used in the tests, there may be a work-around. (Per private conversation with Marco). And we'll still have the issue of not allowing duplicated column names.

So before adding Narwhals:

The float test uses round for a pandas df. If the input is a Polars df, this will fail. Polars returns more than 6 digits.


import pandas as pd
import numpy as np
import polars as pl
import math

# with pandas df
datatables_rows(pd.DataFrame({"x": [0.2, math.pi, np.nan, -np.inf]}))
output: '[[0.2], [3.141593], [NaN], [-Infinity]]'

# with Polars
datatables_rows(pl.DataFrame({"x": [0.2, math.pi, np.nan, -np.inf]}))
output: '[[0.2], [3.141592653589793], [NaN], [-Infinity]]'

Using list in the test:

# with pandas
datatables_rows(pd.DataFrame({"list": [[1], [2, 3]]}))
output: '[["[1]"], ["[2, 3]"]]'

# with polars
datatables_rows(pl.DataFrame({"list": [[1], [2, 3]]}))
output: '[[[1]], [[2, 3]]]'

With dict:

# with pandas
datatables_rows(pd.DataFrame({"dict": [{"a": 1}, {"b": [1, 2]}]}))
output: '[["{\'a\': 1}"], ["{\'b\': [1, 2]}"]]'

# with polars
datatables_rows(pl.DataFrame({"dict": [{"a": 1}, {"b": [1, 2]}]}))
output: '[[{"a": 1}], [{"a": null}]]'

When testing df_with_named_column_axis, the Polars test needs to be modified but results can be the same.
To test a transposed_df, as I said earlier, Narwhals doesn't have transpose, but there may be a worke-around.

DeaMariaLeon · 2024-12-02T08:01:55Z

Just to be clear @mwouts: if we add Narwhals to handle all the libraries (without keeping the pandas' path separately), all the test results will be like Polars'.

mwouts · 2024-12-02T22:40:43Z

Just to be clear @mwouts: if we add Narwhals to handle all the libraries (without keeping the pandas' path separately), all the test results will be like Polars'.

Yes, that would make sense. The reason why you see only 6 digits for Pandas DataFrames is that ITables does use the pandas formatters. That's an attempt to give a rendering that is similar to what the users get when they don't use ITables. Of course I could think of other ways, e.g. apply the formatting on the Javascript side directly (but then passing numbers with a higher precision would also make the notebooks heavier).

Maybe it's less impactful for now to leave the Pandas part as is? We might have enough to do for now with the estimated size and the downsampling function, plus testing all the various table types. Are you able to run the documentation notebooks? See this paragraph and the next one.

DeaMariaLeon · 2024-12-04T11:17:10Z

It seems that all the Jupyter notebook examples run correctly, @mwouts. I run the ones in the link "paragraphs" you mentioned, and the ones in the Advance parameters. But I don't know how to share them here - export to pdf didn't work and Jupiter files can't be attached.

mwouts · 2024-12-07T12:49:39Z

It seems that all the Jupyter notebook examples run correctly, @mwouts. I run the ones in the link "paragraphs" you mentioned, and the ones in the Advance parameters. But I don't know how to share them here - export to pdf didn't work and Jupiter files can't be attached.

Hi @DeaMariaLeon , yes sure, it's more for you to see how changes you make might change how the dataframes are rendered - no need to share the notebooks, I can checkout your branch and rerun them locally.

Re your PR, I would love to see it work for actual modin dataframes, see my PR #341 . It currently works for the dataframes that don't need to be downsampled.

The downsampling is very important for ITables because the table data gets embedded in the notebook (I don't have anything like a server mode at the moment), and if we are not careful about this we can make the notebook huge and crash the web browser.

To make the downsampling work (https://github.com/mwouts/itables/blob/main/src/itables/downsample.py) I would need to figure out to do this with narwhals:

how to estimate the size of a dataframe
how to keep only the first and last lines of a dataframe (and first and last columns too)

Let me know if you have any hints re the above! Thanks!

DeaMariaLeon · 2024-12-07T17:59:39Z

Hi @mwouts, we'll be adding "estimated_size" to Narwhals in order to implement downsampling.
Besides that, I think that we have everything we need (in Narwhals) already.

DeaMariaLeon · 2024-12-14T17:22:46Z

Hi @mwouts, we have now estimated_size in Narwhals. You don't need nbytes() any more. And you can select the units you need, so as_nbytes doesn't need to convert.

I just saw on the PR you experimented with Modin (the one on top of this one), you use series. I wonder if we should add that too. Our new function works for dataframes only.

Should I open a separate PR to (try to) modify downsample.py.. unless you want to do it yourself. If it's the later, I'll share how to select the last and first rows and columns.

mwouts · 2024-12-15T13:35:01Z

Hi @mwouts, we have now estimated_size in Narwhals. You don't need nbytes() any more. And you can select the units you need, so as_nbytes doesn't need to convert.

Thank you @DeaMariaLeon , that's awesome! That's in the last version of narwhals==1.18.3 then, correct?

I just saw on the PR you experimented with Modin (the one on top of this one), you use series. I wonder if we should add that too. Our new function works for dataframes only.

Yes ITables user can pass a Series to show, but when that happens I convert it to a dataframe with a single column. I will need to figure out how to do this in all generality (possibly with .to_frame()...), but to answer your question about estimated_size I won't need it for series.

Should I open a separate PR to (try to) modify downsample.py.. unless you want to do it yourself. If it's the later, I'll share how to select the last and first rows and columns.

Yes thanks for offering I think I will do try on my end. I am looking at the narwhals documentation and I have found already head, tail and concat for the row selection, and select for the column selection, let me see how it goes.

DeaMariaLeon · 2024-12-15T14:16:48Z

That's in the last version of narwhals==1.18.3 then, correct?

Yes

thanks for offering I think I will do try on my end

Very good - you'll be much faster!

DeaMariaLeon added 2 commits November 29, 2024 15:33

wip

d9d03d2

Adding Narwhals

14b8d22

MarcoGorelli reviewed Nov 29, 2024

View reviewed changes

DeaMariaLeon mentioned this pull request Dec 1, 2024

Perf improvement on Polars #340

Merged

mwouts mentioned this pull request Dec 1, 2024

Use Narwhals to support Polars and Modin DataFrames #341

Open

added eager_only=True

761ab12

DeaMariaLeon mentioned this pull request Dec 7, 2024

[Enh]: Add estimated_size narwhals-dev/narwhals#1535

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Adding Narwhals as compatibility layer between libraries #339

WIP: Adding Narwhals as compatibility layer between libraries #339

DeaMariaLeon commented Nov 29, 2024

github-actions bot commented Nov 29, 2024

MarcoGorelli left a comment

MarcoGorelli Nov 29, 2024

MarcoGorelli Nov 29, 2024

DeaMariaLeon Dec 1, 2024

codecov-commenter commented Nov 30, 2024 •

edited

Loading

mwouts commented Nov 30, 2024 •

edited

Loading

DeaMariaLeon commented Dec 1, 2024

DeaMariaLeon commented Dec 2, 2024

mwouts commented Dec 2, 2024

DeaMariaLeon commented Dec 4, 2024

mwouts commented Dec 7, 2024

DeaMariaLeon commented Dec 7, 2024

DeaMariaLeon commented Dec 14, 2024

mwouts commented Dec 15, 2024

DeaMariaLeon commented Dec 15, 2024

WIP: Adding Narwhals as compatibility layer between libraries #339

Are you sure you want to change the base?

WIP: Adding Narwhals as compatibility layer between libraries #339

Conversation

DeaMariaLeon commented Nov 29, 2024

github-actions bot commented Nov 29, 2024

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli Nov 29, 2024

Choose a reason for hiding this comment

MarcoGorelli Nov 29, 2024

Choose a reason for hiding this comment

DeaMariaLeon Dec 1, 2024

Choose a reason for hiding this comment

codecov-commenter commented Nov 30, 2024 • edited Loading

Codecov Report

mwouts commented Nov 30, 2024 • edited Loading

DeaMariaLeon commented Dec 1, 2024

DeaMariaLeon commented Dec 2, 2024

mwouts commented Dec 2, 2024

DeaMariaLeon commented Dec 4, 2024

mwouts commented Dec 7, 2024

DeaMariaLeon commented Dec 7, 2024

DeaMariaLeon commented Dec 14, 2024

mwouts commented Dec 15, 2024

DeaMariaLeon commented Dec 15, 2024

codecov-commenter commented Nov 30, 2024 •

edited

Loading

mwouts commented Nov 30, 2024 •

edited

Loading