-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Align TimestampUnit and TimestampPrecision in unit tests #11698
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
e1e50f9
to
c707cfc
Compare
@@ -820,13 +832,15 @@ TEST_F(ParquetTableScanTest, timestampInt96Dictionary) { | |||
WriterOptions options; | |||
options.writeInt96AsTimestamp = true; | |||
options.enableDictionary = true; | |||
options.parquetWriteTimestampUnit = TimestampUnit::kMicro; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int96 consists of days and nanos, so I wonder if it makes sense to write with the default nano precision.
That means the precision of parquet data is nanoseconds, and Presto and Spark can choose to read the parquet as millis or micros correspondingly. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, per apache/arrow#36005, we could NOT write a Timestamp in nanoseconds. In fact, if you look at the Timestamp data changes in testInt96TimestampRead
in this PR, I only specify at most in the microseconds, because if we have 1999-12-08 13:39:26.123456789
while both parquetWriteTimestampUnit
and the reading precision are in nanoseconds, we could only get 1999-12-08 13:39:26.123456000
, where the nanosecond part gets truncated.
Also, I'm not sure how the dictionary encoding works in Parquet. Even w/ enableDictionary
is true
in timestampInt96Dictionary
, it does not call WriteArrowDictionary
. Is it related to the test data in testInt96TimestampRead
where the dictionary encoding does not reduce data size so it falls back to plain encoding?
velox/velox/dwio/parquet/writer/arrow/ColumnWriter.cpp
Lines 1631 to 1632 in b437950
if (leaf_array.type()->id() == ::arrow::Type::DICTIONARY) { | |
return WriteArrowDictionary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for sharing the Arrow's limitation!
Is it related to the test data in testInt96TimestampRead where the dictionary encoding does not reduce data size so it falls back to plain encoding?
I assume it does not fall back, and I printed some log at the below point and verified it entered the dictionary reading.
case thrift::Type::INT96: { |
About how the dictionary encoding works, I notice 'DictEncoderImpl' will be used if enableDictionary is true.
velox/velox/dwio/parquet/writer/arrow/Encoding.cpp
Lines 4133 to 4139 in 7a02321
std::unique_ptr<Encoder> MakeEncoder( | |
Type::type type_num, | |
Encoding::type encoding, | |
bool use_dictionary, | |
const ColumnDescriptor* descr, | |
MemoryPool* pool) { | |
if (use_dictionary) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you a lot for pointing to PageReader. It saved me a day to figure out why dictionary encoded Int64 Timestamp gets read as a vector of int64
, instead of Velox Timestamp
in 128-bit!
Fyi, I just found the place where Arrow timestamp in nanoseconds gets truncated into microseconds.
velox/velox/dwio/parquet/writer/arrow/ColumnWriter.cpp
Lines 2584 to 2595 in ac5c15e
} else if ( | |
(version == ParquetVersion::PARQUET_1_0 || | |
version == ParquetVersion::PARQUET_2_4) && | |
source_type.unit() == ::arrow::TimeUnit::NANO) { | |
// Absent superseding user instructions, when writing Parquet version <= 2.4 | |
// files, timestamps in nanoseconds are coerced to microseconds | |
std::shared_ptr<ArrowWriterProperties> properties = | |
(ArrowWriterProperties::Builder()) | |
.coerce_timestamps(::arrow::TimeUnit::MICRO) | |
->disallow_truncated_timestamps() | |
->build(); | |
return WriteCoerce(properties.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for sharing the information! Is truncating limited to versions of Parquet 1.0 and 2.4? It appears that Velox uses the Parquet-2.6 version.
version_(ParquetVersion::PARQUET_2_6), |
I also notice the below exception in the Spark query runner if without https://github.com/facebookincubator/velox/pull/11698/files#diff-04bd90dfc1fe1e50c098001f60ae4bdca4b61e662918661a141c7db948b28d47R54. It seems nano int64 timestamps can be generated.
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @zuyu
@@ -899,16 +899,15 @@ core::TypedExprPtr extractFiltersFromRemainingFilter( | |||
namespace { | |||
|
|||
#ifdef VELOX_ENABLE_PARQUET |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be able to remove the ifdef VELOX_ENABLE_PARQUET now.
The Meta dev should be able to test this after import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do it in a separate PR, as the change would touch ~20 files unrelated to this PR.
3124cae
to
6a0e4c6
Compare
Fixes #11607