-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iter_row_groups raises KeyError when dots in column names and data is not primitive #935
Comments
Does the same data load via to_pandas, versus iterating? |
Hi @martindurant , thank you for the very fast reply ⚡ Same error with file.to_pandas() yes def test_load_problematic_data_with_to_pandas(self, tmp_path: Path) -> None:
"""Test that column names with dots are causing issues when cell contents are not primitives."""
df = polars.DataFrame(
{
"foo.with.strings": ["hey", "there", None],
"foo.with.ints": [1, 2, 3],
"foo.with.lists": [[], [], []],
}
)
df.write_parquet(tmp_path / "output_table.parquet")
file = fastparquet.ParquetFile(tmp_path / "output_table.parquet")
with pytest.raises(KeyError):
pandas_df = file.to_pandas() # this raises
assert pandas_df.shape == (3, 6) the error (when removing the pytest.raises)
|
After some debugging, it seems that the argument passed to |
I'll look into it. Certainly, it is a sloppy convention that "columns" are specified both by ["path", "subpath"] and "path.subpath", where the components could in theory contain "." characters. Interesting that you are using polars to generate the data. Can I ask what use fastparquet is for you? |
good point ^^ im using and the frictionless-py plugin for parquet uses fastparquet. and, the dots in column names are not in my control atm, defined upstream. as a workaround, i might override the parquet plugin using polars since it is in my dependencies. |
This actually does load on |
you mean this works? df = polars.DataFrame(
{
"foo.with.strings": ["hey", "there", None],
"foo.with.ints": [1, 2, 3],
"foo.with.lists": [[], [], []],
}
)
df.write_parquet(tmp_path / "output_table.parquet")
file = fastparquet.ParquetFile(tmp_path / "output_table.parquet")
# no error here 👇
pandas_df = file.to_pandas() with the unit test i could also reproduce with non empty lists "foo.with.lists": [["a"], ["b"], ["c"]], |
Correct, if you include data in foo.with.lists (I used [[0], [0], [0]]), it loads. I don't immediately see a PR that might be responsible for that, but many deps have also updated a lot in the meantime, particularly pandas. |
ok good news then it probably means the next patch release will fix this? |
Maybe indeed, although I think the lists came out wrong anyway :| In fact, fastparquet will stop supporting pandas at all, in favour of pure numpy. Output will be an iterator over row-groups, and dictionaries giving the positions in the schema or light structured wrapper, something like:
This will provide FAST operations for people who don't need pandas and are happy to unwrap the arrays themselves. Release will coincide with pandas making pyarrow a hard dependency. |
👌 sounds nice since it will mean less dependencies :) thank you for testing and making sure there was indeed sthg to fix 🙏 please let me know if I could help in any way. im very new to this repo so i dont know the code but still |
TL;DR
Column names with dots are badly handled when the column data is not a primitive, e.g.
List[str]
.-> It wrongly tries to find a key named
foo
if the column is"foo.with.lists": [[], [], []],
Describe the issue:
Here is a unit test that shows the difference of behaviour and the reproduced bug
Anything else we need to know?:
This is heavily used by frictionless-py here
Environment:
The text was updated successfully, but these errors were encountered: