iter_row_groups raises KeyError when dots in column names and data is not primitive #935

adrienDog · 2024-09-10T12:00:15Z

TL;DR
Column names with dots are badly handled when the column data is not a primitive, e.g. List[str].

-> It wrongly tries to find a key named foo if the column is "foo.with.lists": [[], [], []],

Describe the issue:
Here is a unit test that shows the difference of behaviour and the reproduced bug

import fastparquet
import polars

import pytest


class TestReproduceGroupKeyErrorForNamesWithDots:
    def test_valid_data(self, tmp_path: Path) -> None:
        """Test that column names with dots are not causing issues when cell contents are primitives."""
        df = polars.DataFrame(
            {
                "foo.with.strings": ["hey", "there", None],
                "foo.with.ints": [1, 2, 3],
            }
        )
        df.write_parquet(tmp_path / "output.parquet")

        file = fastparquet.ParquetFile(tmp_path / "output.parquet")

        d = dict(enumerate(file.iter_row_groups(), start=1))
        assert len(d) == 1

    def test_reproduce_error(self, tmp_path: Path) -> None:
        """Test that column names with dots are causing issues when cell contents are not primitives."""
        df = polars.DataFrame(
            {
                "foo.with.strings": ["hey", "there", None],
                "foo.with.ints": [1, 2, 3],
                "foo.with.lists": [[], [], []],
            }
        )
        df.write_parquet(tmp_path / "output.parquet")

        file = fastparquet.ParquetFile(tmp_path / "output.parquet")

        with pytest.raises(KeyError): # for key 'foo'
            d = dict(enumerate(file.iter_row_groups(), start=1))
            assert len(d) == 1

Anything else we need to know?:
This is heavily used by frictionless-py here

Environment:

Python version: Python 3.9.18
Operating System: macos
Install method (conda, pip, source): rye (pip)

The text was updated successfully, but these errors were encountered:

martindurant · 2024-09-10T13:17:00Z

Does the same data load via to_pandas, versus iterating?

adrien-owkin · 2024-09-10T13:34:24Z

Hi @martindurant , thank you for the very fast reply ⚡

Same error with file.to_pandas() yes

    def test_load_problematic_data_with_to_pandas(self, tmp_path: Path) -> None:
        """Test that column names with dots are causing issues when cell contents are not primitives."""
        df = polars.DataFrame(
            {
                "foo.with.strings": ["hey", "there", None],
                "foo.with.ints": [1, 2, 3],
                "foo.with.lists": [[], [], []],
            }
        )
        df.write_parquet(tmp_path / "output_table.parquet")

        file = fastparquet.ParquetFile(tmp_path / "output_table.parquet")

        with pytest.raises(KeyError):
            pandas_df  = file.to_pandas() # this raises
            assert pandas_df.shape == (3, 6)

the error (when removing the pytest.raises)


    def schema_element(self, name):
        """Get the schema element with the given name or path"""
        root = self.root
        if isinstance(name, str):
            name = name.split('.')
        for part in name:
>           root = root["children"][part]
E           KeyError: 'foo'

adrien-owkin · 2024-09-10T13:36:09Z

After some debugging, it seems that the argument passed to schema_element is always a list e.g. ['foo.with.strings'] except for columns with non primitive types where the name is simply a string hence is split by .

martindurant · 2024-09-10T13:47:10Z

I'll look into it. Certainly, it is a sloppy convention that "columns" are specified both by ["path", "subpath"] and "path.subpath", where the components could in theory contain "." characters.

Interesting that you are using polars to generate the data. Can I ask what use fastparquet is for you?

adrien-owkin · 2024-09-10T13:56:26Z

good point ^^

im using frictionless-py to validate a datapackage (see datapackage.org specs).

and the frictionless-py plugin for parquet uses fastparquet. and, the dots in column names are not in my control atm, defined upstream.

as a workaround, i might override the parquet plugin using polars since it is in my dependencies.

martindurant · 2024-09-10T13:58:43Z

This actually does load on main, almost IF the array-of-lists column contains some data. Otherwise it hits a point of trying to decompress a zero-length buffer.

adrien-owkin · 2024-09-10T14:08:24Z

you mean this works?

        df = polars.DataFrame(
            {
                "foo.with.strings": ["hey", "there", None],
                "foo.with.ints": [1, 2, 3],
                "foo.with.lists": [[], [], []],
            }
        )
        df.write_parquet(tmp_path / "output_table.parquet")

        file = fastparquet.ParquetFile(tmp_path / "output_table.parquet")

        # no error here 👇 
        pandas_df  = file.to_pandas()

with the unit test i could also reproduce with non empty lists

"foo.with.lists": [["a"], ["b"], ["c"]],

martindurant · 2024-09-10T14:13:23Z

Correct, if you include data in foo.with.lists (I used [[0], [0], [0]]), it loads. I don't immediately see a PR that might be responsible for that, but many deps have also updated a lot in the meantime, particularly pandas.

adrienDog · 2024-09-10T14:44:09Z

ok good news then it probably means the next patch release will fix this?

martindurant · 2024-09-10T14:58:24Z

Maybe indeed, although I think the lists came out wrong anyway :|

In fact, fastparquet will stop supporting pandas at all, in favour of pure numpy. Output will be an iterator over row-groups, and dictionaries giving the positions in the schema or light structured wrapper, something like:

{0: {
  'foo.with.strings-data': array([0, 1, -1], dtype=int8),
  'foo.with.strings-cats': ["hey", "there"],
  'foo.with.ints-data': array([1, 2, 3], dtype=uint8),
  'foo.with.lists.list-offsets': array([0, 1, 2, 3]),
  'foo.with.lists.list.element-data': array([0, 0, 0], dtype=uint8),
  'foo.with.lists.list.element-cats': [0]}
}

This will provide FAST operations for people who don't need pandas and are happy to unwrap the arrays themselves. Release will coincide with pandas making pyarrow a hard dependency.

adrienDog · 2024-09-10T15:08:22Z

👌 sounds nice since it will mean less dependencies :)

thank you for testing and making sure there was indeed sthg to fix 🙏 please let me know if I could help in any way. im very new to this repo so i dont know the code but still

yohplala mentioned this issue Sep 11, 2024

pure-numpy interface to parquet #931

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iter_row_groups raises KeyError when dots in column names and data is not primitive #935

iter_row_groups raises KeyError when dots in column names and data is not primitive #935

adrienDog commented Sep 10, 2024

martindurant commented Sep 10, 2024

adrien-owkin commented Sep 10, 2024 •

edited

Loading

adrien-owkin commented Sep 10, 2024

martindurant commented Sep 10, 2024

adrien-owkin commented Sep 10, 2024

martindurant commented Sep 10, 2024

adrien-owkin commented Sep 10, 2024

martindurant commented Sep 10, 2024

adrienDog commented Sep 10, 2024 •

edited

Loading

martindurant commented Sep 10, 2024 •

edited

Loading

adrienDog commented Sep 10, 2024

iter_row_groups raises KeyError when dots in column names and data is not primitive #935

iter_row_groups raises KeyError when dots in column names and data is not primitive #935

Comments

adrienDog commented Sep 10, 2024

martindurant commented Sep 10, 2024

adrien-owkin commented Sep 10, 2024 • edited Loading

adrien-owkin commented Sep 10, 2024

martindurant commented Sep 10, 2024

adrien-owkin commented Sep 10, 2024

martindurant commented Sep 10, 2024

adrien-owkin commented Sep 10, 2024

martindurant commented Sep 10, 2024

adrienDog commented Sep 10, 2024 • edited Loading

martindurant commented Sep 10, 2024 • edited Loading

adrienDog commented Sep 10, 2024

adrien-owkin commented Sep 10, 2024 •

edited

Loading

adrienDog commented Sep 10, 2024 •

edited

Loading

martindurant commented Sep 10, 2024 •

edited

Loading