-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential Parquet File Metadata Corruption After Process Timeout #879
Comments
Indeed, the metadata should have more than just b"PAR1", which is just the magic marker saying this is a parquet file. The utility function |
I'm going to use If I have an S3 structure like
Does that mean I should call
in this particular case? |
Yes, that looks right.
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: alordthorsen ***@***.***>
Sent: Tuesday, August 22, 2023 5:27:20 PM
To: dask/fastparquet ***@***.***>
Cc: Martin Durant ***@***.***>; Comment ***@***.***>
Subject: Re: [dask/fastparquet] Potential Parquet File Metadata Corruption After Process Timeout (Issue #879)
I'm going to use fastparquet.writer.merge just to increase my own understanding of how things work for now.
If I have an S3 structure like
transformed/flattened_rates/_common_metadata
transformed/flattened_rates/_metadata
transformed/flattened_rates/year=2023/month=8/part.0.parquet
transformed/flattened_rates/year=2023/month=8/part.1.parquet
transformed/flattened_rates/year=2023/month=8/part.10.parquet
transformed/flattened_rates/year=2023/month=8/part.100.parquet
transformed/flattened_rates/year=2023/month=8/part.101.parquet
transformed/flattened_rates/year=2023/month=8/part.102.parquet
Does that mean I should call
fastparquet.writer.merge(s3_paths, open_with=s3.open, root="transformed/flattened_rates/")
in this particular case?
—
Reply to this email directly, view it on GitHub<#879 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABODEZDVSH6N5J4IB6BYVSLXWUP3RANCNFSM6AAAAAA32KIJEQ>.
You are receiving this because you commented.Message ID: ***@***.***>
|
I originally was hitting Traceback (most recent call last):
File "/Users/alexlordthorsen/git/rates/flatten-rates-etl/scripts/recover_corrupted_metadata_file.py", line 47, in <module>
app()
File "/Users/alexlordthorsen/.venvs/data_platform_39/lib/python3.9/site-packages/typer/main.py", line 328, in __call__
raise e
File "/Users/alexlordthorsen/.venvs/data_platform_39/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/Users/alexlordthorsen/.venvs/data_platform_39/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/Users/alexlordthorsen/.venvs/data_platform_39/lib/python3.9/site-packages/typer/core.py", line 716, in main
return _main(
File "/Users/alexlordthorsen/.venvs/data_platform_39/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/Users/alexlordthorsen/.venvs/data_platform_39/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/alexlordthorsen/.venvs/data_platform_39/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/alexlordthorsen/.venvs/data_platform_39/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/Users/alexlordthorsen/git/rates/flatten-rates-etl/scripts/recover_corrupted_metadata_file.py", line 39, in main
attempt_recovery(parquet_file_keys, metadata_root=s3_key)
File "/Users/alexlordthorsen/git/rates/flatten-rates-etl/scripts/recover_corrupted_metadata_file.py", line 23, in attempt_recovery
merge(object_paths, open_with=myopen, root=metadata_root)
File "/Users/alexlordthorsen/.venvs/data_platform_39/lib/python3.9/site-packages/fastparquet/writer.py", line 1465, in merge
out = ParquetFile(file_list, verify_schema, open_with, root)
File "/Users/alexlordthorsen/.venvs/data_platform_39/lib/python3.9/site-packages/fastparquet/api.py", line 124, in __init__
basepath, fmd = metadata_from_many(fn, verify_schema=verify,
File "/Users/alexlordthorsen/.venvs/data_platform_39/lib/python3.9/site-packages/fastparquet/util.py", line 217, in metadata_from_many
basepath, file_list = analyse_paths(file_list, root=root)
File "/Users/alexlordthorsen/.venvs/data_platform_39/lib/python3.9/site-packages/fastparquet/util.py", line 368, in analyse_paths
assert all(p[:l] == basepath for p in path_parts_list
AssertionError: All paths must begin with the given root with a I think I have this working
@martindurant would you be open to a PR that changes these TypeErrors into specific error types with more specific messages and a doc change to write.merge to explain that |
Yes, catching that error would be fine. It essentially means parsing failed. |
Hmmmm, I'm still hitting this error even after a rebuild. I'm now re-examining this error in my code
Looking at the timing of when the error is restarting and when I'm seeing these logs I'm pretty sure this is my core issue I originally grabbed this idea from here |
I guess the solution is to write the metadata file only one when everything else is done, as dask does. Or maybe remove it... |
Describe the issue:
After hitting a timeout in an AWS Lambda I'm no longer able to read from a parquet file. I'm hitting this stack trace
and I believe my _metadata file is corrupted. I suspect we hit the timeout in the middle of the
write
command and potentially in the metadata write.I attempted to upload the files for this report but github won't allow binary file uploads.
The contents of
_metadata
are justPAR1%
which feels incorrect but I don't know enough about parquet to be able to know without digging deeper into the standard.The
_common_metadata
file looks more correct to me (I have a couple hundred columns in my case so I'm not going to post here unless that's required but a sample from the end of the file looks likeMinimal Complete Verifiable Example:
Anything else we need to know?:
I'm looking for some way to recover this metadata file I think. either on the fly as part of my write process or as a script I can run manually.
I plan to avoid timeouts like this in the future but I'm wondering if there's anything I can do to help ensure I don't hit this issue again.
Environment:
The text was updated successfully, but these errors were encountered: