Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Align TimestampUnit and TimestampPrecision in unit tests #11698

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zuyu
Copy link
Contributor

@zuyu zuyu commented Nov 30, 2024

Fixes #11607

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 30, 2024
Copy link

netlify bot commented Nov 30, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 0dbad46
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/675a1f469d0f8000090e34ce

@zuyu zuyu force-pushed the 11607 branch 2 times, most recently from e1e50f9 to c707cfc Compare December 2, 2024 17:50
@zuyu zuyu marked this pull request as ready for review December 10, 2024 17:49
@zuyu zuyu requested a review from majetideepak as a code owner December 10, 2024 17:49
@@ -820,13 +832,15 @@ TEST_F(ParquetTableScanTest, timestampInt96Dictionary) {
WriterOptions options;
options.writeInt96AsTimestamp = true;
options.enableDictionary = true;
options.parquetWriteTimestampUnit = TimestampUnit::kMicro;
Copy link
Collaborator

@rui-mo rui-mo Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int96 consists of days and nanos, so I wonder if it makes sense to write with the default nano precision.
That means the precision of parquet data is nanoseconds, and Presto and Spark can choose to read the parquet as millis or micros correspondingly. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, per apache/arrow#36005, we could NOT write a Timestamp in nanoseconds. In fact, if you look at the Timestamp data changes in testInt96TimestampRead in this PR, I only specify at most in the microseconds, because if we have 1999-12-08 13:39:26.123456789 while both parquetWriteTimestampUnit and the reading precision are in nanoseconds, we could only get 1999-12-08 13:39:26.123456000, where the nanosecond part gets truncated.

Also, I'm not sure how the dictionary encoding works in Parquet. Even w/ enableDictionary is true in timestampInt96Dictionary, it does not call WriteArrowDictionary. Is it related to the test data in testInt96TimestampRead where the dictionary encoding does not reduce data size so it falls back to plain encoding?

if (leaf_array.type()->id() == ::arrow::Type::DICTIONARY) {
return WriteArrowDictionary(

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing the Arrow's limitation!

Is it related to the test data in testInt96TimestampRead where the dictionary encoding does not reduce data size so it falls back to plain encoding?

I assume it does not fall back, and I printed some log at the below point and verified it entered the dictionary reading.

case thrift::Type::INT96: {

About how the dictionary encoding works, I notice 'DictEncoderImpl' will be used if enableDictionary is true.

std::unique_ptr<Encoder> MakeEncoder(
Type::type type_num,
Encoding::type encoding,
bool use_dictionary,
const ColumnDescriptor* descr,
MemoryPool* pool) {
if (use_dictionary) {

Copy link
Contributor Author

@zuyu zuyu Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you a lot for pointing to PageReader. It saved me a day to figure out why dictionary encoded Int64 Timestamp gets read as a vector of int64, instead of Velox Timestamp in 128-bit!

Fyi, I just found the place where Arrow timestamp in nanoseconds gets truncated into microseconds.

} else if (
(version == ParquetVersion::PARQUET_1_0 ||
version == ParquetVersion::PARQUET_2_4) &&
source_type.unit() == ::arrow::TimeUnit::NANO) {
// Absent superseding user instructions, when writing Parquet version <= 2.4
// files, timestamps in nanoseconds are coerced to microseconds
std::shared_ptr<ArrowWriterProperties> properties =
(ArrowWriterProperties::Builder())
.coerce_timestamps(::arrow::TimeUnit::MICRO)
->disallow_truncated_timestamps()
->build();
return WriteCoerce(properties.get());

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing the information! Is truncating limited to versions of Parquet 1.0 and 2.4? It appears that Velox uses the Parquet-2.6 version.

version_(ParquetVersion::PARQUET_2_6),

I also notice the below exception in the Spark query runner if without https://github.com/facebookincubator/velox/pull/11698/files#diff-04bd90dfc1fe1e50c098001f60ae4bdca4b61e662918661a141c7db948b28d47R54. It seems nano int64 timestamps can be generated.

Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false)).

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @zuyu

@majetideepak majetideepak added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Dec 11, 2024
@@ -899,16 +899,15 @@ core::TypedExprPtr extractFiltersFromRemainingFilter(
namespace {

#ifdef VELOX_ENABLE_PARQUET
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to remove the ifdef VELOX_ENABLE_PARQUET now.
The Meta dev should be able to test this after import.

Copy link
Contributor Author

@zuyu zuyu Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it in a separate PR, as the change would touch ~20 files unrelated to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Parquet] TimestampPrecision / TimestampUnit mismatch in read / write files, particular for unit tests
4 participants