You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.
I am running into an issue where reading Parquet int96 timestamps into arrow2 timestamp[ns] arrays can potentially overflow silently, providing wrong results.
This issue was also noted in pyarrow/arrow-cpp ARROW-12096.
Here is a quick example:
First, write a Parquet file with int96 timestamps, where some timestamps are out of range for the timestamp[ns] type:
importpyarrowaspaimportpyarrow.parquetaspapqimportdatetime# Use PyArrow to write Parquet files with int96 timestampstable=pa.Table.from_pydict({
"timestamps": pa.array([
datetime.datetime(1000, 1, 1),
datetime.datetime(2000, 1, 1),
datetime.datetime(3000, 1, 1),
], pa.timestamp("ms"))
})
papq.write_table(table, "timestamps.parquet", use_deprecated_int96_timestamps=True, store_schema=False)
Reading this file in a unit test results in an overflow panic:
---- io::parquet::read::read_int96_timestamps stdout ----
thread 'io::parquet::read::read_int96_timestamps' panicked at 'attempt to multiply with overflow', /Users/jaychia/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.17.2/src/types.rs:112:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
I am running into an issue where reading Parquet int96 timestamps into arrow2
timestamp[ns]
arrays can potentially overflow silently, providing wrong results.This issue was also noted in pyarrow/arrow-cpp ARROW-12096.
Here is a quick example:
First, write a Parquet file with int96 timestamps, where some timestamps are out of range for the
timestamp[ns]
type:Reading this file in a unit test results in an overflow panic:
Solution
I would like to propose a two part solution:
ParquetSchemaInferenceOptions
, which will allow users to specify how they wantarrow2
to infer the Arrow types for ParquetInt96
types. (see PR: Add SchemaInferenceOptions options to infer_schema and option to configure int96 inference #1533 )The text was updated successfully, but these errors were encountered: