Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iter_row_groups raises KeyError when dots in column names and data is not primitive #935

Open
adrienDog opened this issue Sep 10, 2024 · 11 comments

Comments

@adrienDog
Copy link

TL;DR
Column names with dots are badly handled when the column data is not a primitive, e.g. List[str].

-> It wrongly tries to find a key named foo if the column is "foo.with.lists": [[], [], []],

Describe the issue:
Here is a unit test that shows the difference of behaviour and the reproduced bug

import fastparquet
import polars

import pytest


class TestReproduceGroupKeyErrorForNamesWithDots:
    def test_valid_data(self, tmp_path: Path) -> None:
        """Test that column names with dots are not causing issues when cell contents are primitives."""
        df = polars.DataFrame(
            {
                "foo.with.strings": ["hey", "there", None],
                "foo.with.ints": [1, 2, 3],
            }
        )
        df.write_parquet(tmp_path / "output.parquet")

        file = fastparquet.ParquetFile(tmp_path / "output.parquet")

        d = dict(enumerate(file.iter_row_groups(), start=1))
        assert len(d) == 1

    def test_reproduce_error(self, tmp_path: Path) -> None:
        """Test that column names with dots are causing issues when cell contents are not primitives."""
        df = polars.DataFrame(
            {
                "foo.with.strings": ["hey", "there", None],
                "foo.with.ints": [1, 2, 3],
                "foo.with.lists": [[], [], []],
            }
        )
        df.write_parquet(tmp_path / "output.parquet")

        file = fastparquet.ParquetFile(tmp_path / "output.parquet")

        with pytest.raises(KeyError): # for key 'foo'
            d = dict(enumerate(file.iter_row_groups(), start=1))
            assert len(d) == 1

Anything else we need to know?:
This is heavily used by frictionless-py here

Environment:

  • Python version: Python 3.9.18
  • Operating System: macos
  • Install method (conda, pip, source): rye (pip)
@martindurant
Copy link
Member

Does the same data load via to_pandas, versus iterating?

@adrien-owkin
Copy link

adrien-owkin commented Sep 10, 2024

Hi @martindurant , thank you for the very fast reply ⚡

Same error with file.to_pandas() yes

    def test_load_problematic_data_with_to_pandas(self, tmp_path: Path) -> None:
        """Test that column names with dots are causing issues when cell contents are not primitives."""
        df = polars.DataFrame(
            {
                "foo.with.strings": ["hey", "there", None],
                "foo.with.ints": [1, 2, 3],
                "foo.with.lists": [[], [], []],
            }
        )
        df.write_parquet(tmp_path / "output_table.parquet")

        file = fastparquet.ParquetFile(tmp_path / "output_table.parquet")

        with pytest.raises(KeyError):
            pandas_df  = file.to_pandas() # this raises
            assert pandas_df.shape == (3, 6)

the error (when removing the pytest.raises)


    def schema_element(self, name):
        """Get the schema element with the given name or path"""
        root = self.root
        if isinstance(name, str):
            name = name.split('.')
        for part in name:
>           root = root["children"][part]
E           KeyError: 'foo'

@adrien-owkin
Copy link

After some debugging, it seems that the argument passed to schema_element is always a list e.g. ['foo.with.strings'] except for columns with non primitive types where the name is simply a string hence is split by .

@martindurant
Copy link
Member

I'll look into it. Certainly, it is a sloppy convention that "columns" are specified both by ["path", "subpath"] and "path.subpath", where the components could in theory contain "." characters.

Interesting that you are using polars to generate the data. Can I ask what use fastparquet is for you?

@adrien-owkin
Copy link

good point ^^

im using frictionless-py to validate a datapackage (see datapackage.org specs).

and the frictionless-py plugin for parquet uses fastparquet. and, the dots in column names are not in my control atm, defined upstream.

as a workaround, i might override the parquet plugin using polars since it is in my dependencies.

@martindurant
Copy link
Member

This actually does load on main, almost IF the array-of-lists column contains some data. Otherwise it hits a point of trying to decompress a zero-length buffer.

@adrien-owkin
Copy link

you mean this works?

        df = polars.DataFrame(
            {
                "foo.with.strings": ["hey", "there", None],
                "foo.with.ints": [1, 2, 3],
                "foo.with.lists": [[], [], []],
            }
        )
        df.write_parquet(tmp_path / "output_table.parquet")

        file = fastparquet.ParquetFile(tmp_path / "output_table.parquet")

        # no error here 👇 
        pandas_df  = file.to_pandas()

with the unit test i could also reproduce with non empty lists

"foo.with.lists": [["a"], ["b"], ["c"]],

@martindurant
Copy link
Member

Correct, if you include data in foo.with.lists (I used [[0], [0], [0]]), it loads. I don't immediately see a PR that might be responsible for that, but many deps have also updated a lot in the meantime, particularly pandas.

@adrienDog
Copy link
Author

adrienDog commented Sep 10, 2024

ok good news then it probably means the next patch release will fix this?

@martindurant
Copy link
Member

martindurant commented Sep 10, 2024

Maybe indeed, although I think the lists came out wrong anyway :|

In fact, fastparquet will stop supporting pandas at all, in favour of pure numpy. Output will be an iterator over row-groups, and dictionaries giving the positions in the schema or light structured wrapper, something like:

{0: {
  'foo.with.strings-data': array([0, 1, -1], dtype=int8),
  'foo.with.strings-cats': ["hey", "there"],
  'foo.with.ints-data': array([1, 2, 3], dtype=uint8),
  'foo.with.lists.list-offsets': array([0, 1, 2, 3]),
  'foo.with.lists.list.element-data': array([0, 0, 0], dtype=uint8),
  'foo.with.lists.list.element-cats': [0]}
}

This will provide FAST operations for people who don't need pandas and are happy to unwrap the arrays themselves. Release will coincide with pandas making pyarrow a hard dependency.

@adrienDog
Copy link
Author

👌 sounds nice since it will mean less dependencies :)

thank you for testing and making sure there was indeed sthg to fix 🙏 please let me know if I could help in any way. im very new to this repo so i dont know the code but still

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants