We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What happened: Fastparquet raises a ValueError when attempting to load a Parquet file with None values in a partitioned column:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /tmp/ipykernel_9855/558527708.py in <cell line: 2>() 1 from fastparquet import ParquetFile ----> 2 ParquetFile('people-partitioned.parquet').to_pandas() ~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/fastparquet/api.py in to_pandas(self, columns, categories, filters, index, row_filter) 768 else v[start:start + thislen]) 769 for (name, v) in views.items()} --> 770 self.read_row_group_file(rg, columns, categories, index, 771 assign=parts, partition_meta=self.partition_meta, 772 row_filter=sel, infile=infile) ~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/fastparquet/api.py in read_row_group_file(self, rg, columns, categories, index, assign, partition_meta, row_filter, infile) 372 f = infile or self.open(fn, mode='rb') 373 --> 374 core.read_row_group( 375 f, rg, columns, categories, self.schema, self.cats, 376 selfmade=self.selfmade, index=index, ~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/fastparquet/core.py in read_row_group(file, rg, columns, categories, schema_helper, cats, selfmade, index, assign, scheme, partition_meta, row_filter) 620 key, val = [p for p in partitions if p[0] == cat][0] 621 val = val_to_num(val, meta=partition_meta.get(key)) --> 622 assign[cat][:] = cats[cat].index(val) ValueError: 25 is not in list
What you expected to happen: Fastparquet loads the partitioned dataset with the same result as the unpartitioned dataset:
Age Name
0 20 John 1 25 Joe 2 Jane
Minimal Complete Verifiable Example:
import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame([(20,'John'), (25, 'Joe'), (None, 'Jane')], ['Age', 'Name']) df.write.parquet('people.parquet') df.write.parquet('people-partitioned.parquet', partitionBy='Age') from fastparquet import ParquetFile ParquetFile('people-partitioned.parquet').to_pandas()
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered:
Unfortunately, this loads fine for me:
Name Age 0 John 20 1 Joe 25 2 Jane __HIVE_DEFAULT_PARTITION__
except that we do not recognise the very special value that spark has used to represent None.
What version of spark do you have, and did it make a folder structure with names different than
Age=20/ Age=25/ Age=__HIVE_DEFAULT_PARTITION__/
Sorry, something went wrong.
(I should have said, my successful run was with pyspark 3.1.2)
No branches or pull requests
What happened: Fastparquet raises a ValueError when attempting to load a Parquet file with None values in a partitioned column:
What you expected to happen: Fastparquet loads the partitioned dataset with the same result as the unpartitioned dataset:
0 20 John
1 25 Joe
2 Jane
Minimal Complete Verifiable Example:
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered: