Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError when loading partitioned dataset with None values #803

Open
Andreas5739738 opened this issue Aug 26, 2022 · 2 comments
Open

ValueError when loading partitioned dataset with None values #803

Andreas5739738 opened this issue Aug 26, 2022 · 2 comments

Comments

@Andreas5739738
Copy link

What happened: Fastparquet raises a ValueError when attempting to load a Parquet file with None values in a partitioned column:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_9855/558527708.py in <cell line: 2>()
      1 from fastparquet import ParquetFile
----> 2 ParquetFile('people-partitioned.parquet').to_pandas()

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/fastparquet/api.py in to_pandas(self, columns, categories, filters, index, row_filter)
    768                             else v[start:start + thislen])
    769                      for (name, v) in views.items()}
--> 770             self.read_row_group_file(rg, columns, categories, index,
    771                                      assign=parts, partition_meta=self.partition_meta,
    772                                      row_filter=sel, infile=infile)

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/fastparquet/api.py in read_row_group_file(self, rg, columns, categories, index, assign, partition_meta, row_filter, infile)
    372         f = infile or self.open(fn, mode='rb')
    373 
--> 374         core.read_row_group(
    375             f, rg, columns, categories, self.schema, self.cats,
    376             selfmade=self.selfmade, index=index,

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/fastparquet/core.py in read_row_group(file, rg, columns, categories, schema_helper, cats, selfmade, index, assign, scheme, partition_meta, row_filter)
    620         key, val = [p for p in partitions if p[0] == cat][0]
    621         val = val_to_num(val, meta=partition_meta.get(key))
--> 622         assign[cat][:] = cats[cat].index(val)

ValueError: 25 is not in list

What you expected to happen: Fastparquet loads the partitioned dataset with the same result as the unpartitioned dataset:

Age	Name

0 20 John
1 25 Joe
2 Jane

Minimal Complete Verifiable Example:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(20,'John'), (25, 'Joe'), (None, 'Jane')], ['Age', 'Name'])
df.write.parquet('people.parquet')
df.write.parquet('people-partitioned.parquet', partitionBy='Age')

from fastparquet import ParquetFile
ParquetFile('people-partitioned.parquet').to_pandas()

Anything else we need to know?:

Environment:

  • Dask version:'2022.8.1'
  • Fastparquet version: '0.8.2'
  • Python version: '3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51) \n[GCC 9.4.0]'
  • Operating System: Linux
  • Install method (conda, pip, source): conda
@martindurant
Copy link
Member

Unfortunately, this loads fine for me:

   Name                         Age
0  John                          20
1   Joe                          25
2  Jane  __HIVE_DEFAULT_PARTITION__

except that we do not recognise the very special value that spark has used to represent None.

What version of spark do you have, and did it make a folder structure with names different than

Age=20/                         Age=25/                         Age=__HIVE_DEFAULT_PARTITION__/

@martindurant
Copy link
Member

(I should have said, my successful run was with pyspark 3.1.2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants