fix: Align TimestampUnit and TimestampPrecision in unit tests #11698

zuyu · 2024-11-30T01:19:47Z

netlify · 2024-11-30T01:20:03Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`0dbad46`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/675a1f469d0f8000090e34ce

velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp

rui-mo · 2024-12-11T02:53:42Z

velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp

@@ -820,13 +832,15 @@ TEST_F(ParquetTableScanTest, timestampInt96Dictionary) {
  WriterOptions options;
  options.writeInt96AsTimestamp = true;
  options.enableDictionary = true;
+  options.parquetWriteTimestampUnit = TimestampUnit::kMicro;


Int96 consists of days and nanos, so I wonder if it makes sense to write with the default nano precision.
That means the precision of parquet data is nanoseconds, and Presto and Spark can choose to read the parquet as millis or micros correspondingly. Thanks.

Actually, per apache/arrow#36005, we could NOT write a Timestamp in nanoseconds. In fact, if you look at the Timestamp data changes in testInt96TimestampRead in this PR, I only specify at most in the microseconds, because if we have 1999-12-08 13:39:26.123456789 while both parquetWriteTimestampUnit and the reading precision are in nanoseconds, we could only get 1999-12-08 13:39:26.123456000, where the nanosecond part gets truncated.

Also, I'm not sure how the dictionary encoding works in Parquet. Even w/ enableDictionary is true in timestampInt96Dictionary, it does not call WriteArrowDictionary. Is it related to the test data in testInt96TimestampRead where the dictionary encoding does not reduce data size so it falls back to plain encoding?

velox/velox/dwio/parquet/writer/arrow/ColumnWriter.cpp

Lines 1631 to 1632 in b437950

if (leaf_array.type()->id() == ::arrow::Type::DICTIONARY) {

return WriteArrowDictionary(

Thank you for sharing the Arrow's limitation!

Is it related to the test data in testInt96TimestampRead where the dictionary encoding does not reduce data size so it falls back to plain encoding?

I assume it does not fall back, and I printed some log at the below point and verified it entered the dictionary reading.

velox/velox/dwio/parquet/reader/PageReader.cpp

Line 380 in 7a02321

case thrift::Type::INT96: {

About how the dictionary encoding works, I notice 'DictEncoderImpl' will be used if enableDictionary is true.

velox/velox/dwio/parquet/writer/arrow/Encoding.cpp

Lines 4133 to 4139 in 7a02321

std::unique_ptr<Encoder> MakeEncoder(

Type::type type_num,

Encoding::type encoding,

bool use_dictionary,

const ColumnDescriptor* descr,

MemoryPool* pool) {

if (use_dictionary) {

Thank you a lot for pointing to PageReader. It saved me a day to figure out why dictionary encoded Int64 Timestamp gets read as a vector of int64, instead of Velox Timestamp in 128-bit!

Fyi, I just found the place where Arrow timestamp in nanoseconds gets truncated into microseconds.

velox/velox/dwio/parquet/writer/arrow/ColumnWriter.cpp

Lines 2584 to 2595 in ac5c15e

} else if (

(version == ParquetVersion::PARQUET_1_0 ||

version == ParquetVersion::PARQUET_2_4) &&

source_type.unit() == ::arrow::TimeUnit::NANO) {

// Absent superseding user instructions, when writing Parquet version <= 2.4

// files, timestamps in nanoseconds are coerced to microseconds

std::shared_ptr<ArrowWriterProperties> properties =

(ArrowWriterProperties::Builder())

.coerce_timestamps(::arrow::TimeUnit::MICRO)

->disallow_truncated_timestamps()

->build();

return WriteCoerce(properties.get());

Thank you for sharing the information! Is truncating limited to versions of Parquet 1.0 and 2.4? It appears that Velox uses the Parquet-2.6 version.

velox/velox/dwio/parquet/writer/arrow/Properties.h

Line 326 in b44ffc9

version_(ParquetVersion::PARQUET_2_6),

I also notice the below exception in the Spark query runner if without https://github.com/facebookincubator/velox/pull/11698/files#diff-04bd90dfc1fe1e50c098001f60ae4bdca4b61e662918661a141c7db948b28d47R54. It seems nano int64 timestamps can be generated.

Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false)).

majetideepak

Thanks, @zuyu

majetideepak · 2024-12-11T21:22:57Z

velox/connectors/hive/HiveConnectorUtil.cpp

@@ -899,16 +899,15 @@ core::TypedExprPtr extractFiltersFromRemainingFilter(
 namespace {

 #ifdef VELOX_ENABLE_PARQUET


I think we should be able to remove the ifdef VELOX_ENABLE_PARQUET now.
The Meta dev should be able to test this after import.

Let's do it in a separate PR, as the change would touch ~20 files unrelated to this PR.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 30, 2024

zuyu force-pushed the 11607 branch 2 times, most recently from e1e50f9 to c707cfc Compare December 2, 2024 17:50

zuyu marked this pull request as ready for review December 10, 2024 17:49

zuyu requested a review from majetideepak as a code owner December 10, 2024 17:49

majetideepak reviewed Dec 10, 2024

View reviewed changes

velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp Outdated Show resolved Hide resolved

velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp Outdated Show resolved Hide resolved

velox/dwio/parquet/tests/reader/ParquetTableScanTest.cpp Show resolved Hide resolved

zuyu force-pushed the 11607 branch from c707cfc to 788cee1 Compare December 11, 2024 01:20

rui-mo reviewed Dec 11, 2024

View reviewed changes

zuyu requested a review from majetideepak December 11, 2024 06:17

zuyu force-pushed the 11607 branch from 788cee1 to 00eb016 Compare December 11, 2024 19:54

majetideepak approved these changes Dec 11, 2024

View reviewed changes

majetideepak added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Dec 11, 2024

majetideepak reviewed Dec 11, 2024

View reviewed changes

zuyu mentioned this pull request Dec 11, 2024

refactor: Enable Parquet and Arrow by default #11832

Open

zuyu force-pushed the 11607 branch 2 times, most recently from 3124cae to 6a0e4c6 Compare December 11, 2024 23:22

fix: Align TimestampUnit and TimestampPrecision in unit tests

0dbad46

zuyu force-pushed the 11607 branch from 6a0e4c6 to 0dbad46 Compare December 11, 2024 23:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Align TimestampUnit and TimestampPrecision in unit tests #11698

fix: Align TimestampUnit and TimestampPrecision in unit tests #11698

zuyu commented Nov 30, 2024

netlify bot commented Nov 30, 2024 •

edited

Loading

rui-mo Dec 11, 2024 •

edited

Loading

zuyu Dec 11, 2024

rui-mo Dec 11, 2024

zuyu Dec 11, 2024 •

edited

Loading

rui-mo Dec 12, 2024

majetideepak left a comment

majetideepak Dec 11, 2024

zuyu Dec 11, 2024 •

edited

Loading

	if (leaf_array.type()->id() == ::arrow::Type::DICTIONARY) {
	return WriteArrowDictionary(

	std::unique_ptr<Encoder> MakeEncoder(
	Type::type type_num,
	Encoding::type encoding,
	bool use_dictionary,
	const ColumnDescriptor* descr,
	MemoryPool* pool) {
	if (use_dictionary) {

	} else if (
	(version == ParquetVersion::PARQUET_1_0 \|\|
	version == ParquetVersion::PARQUET_2_4) &&
	source_type.unit() == ::arrow::TimeUnit::NANO) {
	// Absent superseding user instructions, when writing Parquet version <= 2.4
	// files, timestamps in nanoseconds are coerced to microseconds
	std::shared_ptr<ArrowWriterProperties> properties =
	(ArrowWriterProperties::Builder())
	.coerce_timestamps(::arrow::TimeUnit::MICRO)
	->disallow_truncated_timestamps()
	->build();
	return WriteCoerce(properties.get());

		@@ -899,16 +899,15 @@ core::TypedExprPtr extractFiltersFromRemainingFilter(
		namespace {

		#ifdef VELOX_ENABLE_PARQUET

fix: Align TimestampUnit and TimestampPrecision in unit tests #11698

Are you sure you want to change the base?

fix: Align TimestampUnit and TimestampPrecision in unit tests #11698

Conversation

zuyu commented Nov 30, 2024

netlify bot commented Nov 30, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

rui-mo Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

zuyu Dec 11, 2024

Choose a reason for hiding this comment

rui-mo Dec 11, 2024

Choose a reason for hiding this comment

zuyu Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

rui-mo Dec 12, 2024

Choose a reason for hiding this comment

majetideepak left a comment

Choose a reason for hiding this comment

majetideepak Dec 11, 2024

Choose a reason for hiding this comment

zuyu Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

netlify bot commented Nov 30, 2024 •

edited

Loading

rui-mo Dec 11, 2024 •

edited

Loading

zuyu Dec 11, 2024 •

edited

Loading

zuyu Dec 11, 2024 •

edited

Loading