Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support all patterns for Spark CAST(varchar as timestamp) #11114

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions velox/expression/PrestoCastHooks.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -54,17 +54,17 @@ Expected<Timestamp> PrestoCastHooks::castStringToTimestamp(
// If the parsed string has timezone information, convert the timestamp at
// GMT at that time. For example, "1970-01-01 00:00:00 -00:01" is 60 seconds
// at GMT.
if (result.second != nullptr) {
result.first.toGMT(*result.second);
if (result.tz != nullptr) {
result.timestamp.toGMT(*result.tz);

}
// If no timezone information is available in the input string, check if we
// should understand it as being at the session timezone, and if so, convert
// to GMT.
else if (options_.timeZone != nullptr) {
result.first.toGMT(*options_.timeZone);
result.timestamp.toGMT(*options_.timeZone);
}
return result.first;
return result.timestamp;
}

Expected<Timestamp> PrestoCastHooks::castIntToTimestamp(int64_t seconds) const {
Expand Down
3 changes: 2 additions & 1 deletion velox/functions/prestosql/DateTimeFunctions.h
Original file line number Diff line number Diff line change
Expand Up @@ -1499,7 +1499,8 @@ struct FromIso8601Timestamp {
return castResult.error();
}

auto [ts, timeZone] = castResult.value();
auto ts = castResult.value().timestamp;
auto timeZone = castResult.value().tz;
// Input string may not contain a timezone - if so, it is interpreted in
// session timezone.
if (!timeZone) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,8 @@ void castFromString(
if (castResult.hasError()) {
context.setStatus(row, castResult.error());
} else {
auto [ts, timeZone] = castResult.value();
auto ts = castResult.value().timestamp;
auto timeZone = castResult.value().tz;
// Input string may not contain a timezone - if so, it is interpreted in
// session timezone.
if (timeZone == nullptr) {
Expand Down
8 changes: 4 additions & 4 deletions velox/functions/sparksql/specialforms/SparkCastExpr.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ exec::ExprPtr SparkCastCallToSpecialForm::constructSpecialForm(
const TypePtr& type,
std::vector<exec::ExprPtr>&& compiledChildren,
bool trackCpuUsage,
const core::QueryConfig& /*config*/) {
const core::QueryConfig& config) {
VELOX_CHECK_EQ(
compiledChildren.size(),
1,
Expand All @@ -33,14 +33,14 @@ exec::ExprPtr SparkCastCallToSpecialForm::constructSpecialForm(
std::move(compiledChildren[0]),
trackCpuUsage,
false,
std::make_shared<SparkCastHooks>());
std::make_shared<SparkCastHooks>(config));
}

exec::ExprPtr SparkTryCastCallToSpecialForm::constructSpecialForm(
const TypePtr& type,
std::vector<exec::ExprPtr>&& compiledChildren,
bool trackCpuUsage,
const core::QueryConfig& /*config*/) {
const core::QueryConfig& config) {
VELOX_CHECK_EQ(
compiledChildren.size(),
1,
Expand All @@ -51,6 +51,6 @@ exec::ExprPtr SparkTryCastCallToSpecialForm::constructSpecialForm(
std::move(compiledChildren[0]),
trackCpuUsage,
true,
std::make_shared<SparkCastHooks>());
std::make_shared<SparkCastHooks>(config));
}
} // namespace facebook::velox::functions::sparksql
50 changes: 48 additions & 2 deletions velox/functions/sparksql/specialforms/SparkCastHooks.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,60 @@

#include "velox/functions/sparksql/specialforms/SparkCastHooks.h"
#include "velox/functions/lib/string/StringImpl.h"
#include "velox/type/TimestampConversion.h"
#include "velox/type/tz/TimeZoneMap.h"

namespace facebook::velox::functions::sparksql {

SparkCastHooks::SparkCastHooks(const core::QueryConfig& config) : CastHooks() {
const auto sessionTzName = config.sessionTimezone();
if (!sessionTzName.empty()) {
options_.timeZone = tz::locateZone(sessionTzName);
}
}

Expected<Timestamp> SparkCastHooks::castStringToTimestamp(
const StringView& view) const {
return util::fromTimestampString(
// Allows all patterns supported by Spark:
// `[+-]yyyy*`
// `[+-]yyyy*-[m]m`
// `[+-]yyyy*-[m]m-[d]d`
// `[+-]yyyy*-[m]m-[d]d `
// `[+-]yyyy*-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
// `[+-]yyyy*-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
// `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
// `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
//
// where `zone_id` should have one of the forms:
// 1. Z - Zulu time zone UTC+0
// 2. +|-[h]h:[m]m
// 3. A short id, see
// https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html#SHORT_IDS
// 4. An id with one of the prefixes UTC+, UTC-, GMT+, GMT-, UT+ or UT-,
// and a suffix in the following formats:
// a. +|-h[h]
// b. +|-hh[:]mm
// c. +|-hh:mm:ss
// d. +|-hhmmss
// 5. Region-based zone IDs in the form `area/city`, such as `Europe/Paris`
const auto conversionResult = util::fromTimestampWithTimezoneString(
view.data(), view.size(), util::TimestampParseMode::kSparkCast);
if (conversionResult.hasError()) {
return folly::makeUnexpected(conversionResult.error());
}

auto result = conversionResult.value();

if (result.tz != nullptr) {
// If the parsed string has timezone information, convert the timestamp at
// GMT at that time.
result.timestamp.toGMT(*result.tz, result.secondsOffset);
} else if (options_.timeZone != nullptr) {
// If the input string contains no timezone information, determine whether
// it should be interpreted as being in the session timezone and, if so,
// convert it to GMT.
result.timestamp.toGMT(*options_.timeZone);
}
return result.timestamp;
}

Expected<Timestamp> SparkCastHooks::castIntToTimestamp(int64_t seconds) const {
Expand Down
6 changes: 6 additions & 0 deletions velox/functions/sparksql/specialforms/SparkCastHooks.h
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,15 @@
#pragma once

#include "velox/expression/CastHooks.h"
#include "velox/expression/EvalCtx.h"

namespace facebook::velox::functions::sparksql {

// This class provides cast hooks following Spark semantics.
class SparkCastHooks : public exec::CastHooks {
public:
explicit SparkCastHooks(const velox::core::QueryConfig& config);

// TODO: Spark hook allows more string patterns than Presto.
Expected<Timestamp> castStringToTimestamp(
const StringView& view) const override;
Expand Down Expand Up @@ -59,5 +62,8 @@ class SparkCastHooks : public exec::CastHooks {
}

exec::PolicyType getPolicy() const override;

private:
TimestampToStringOptions options_ = {};
};
} // namespace facebook::velox::functions::sparksql
78 changes: 78 additions & 0 deletions velox/functions/sparksql/tests/SparkCastExprTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,10 @@ TEST_F(SparkCastExprTest, invalidDate) {

TEST_F(SparkCastExprTest, stringToTimestamp) {
std::vector<std::optional<std::string>> input{
"2015",
"-2015",
"2015-03",
"-2015-03",
"1970-01-01",
"2000-01-01",
"1970-01-01 00:00:00",
Expand All @@ -241,12 +245,37 @@ TEST_F(SparkCastExprTest, stringToTimestamp) {
"2015-03-18T12:03:17",
"2015-03-18 12:03:17",
"2015-03-18T12:03:17",
"2015-03-18T12:03:17Z",
"2015-03-18 12:03:17Z",
"2015-03-18 12:03:17.123",
"2015-03-18T12:03:17.123",
"2015-03-18T12:03:17.456",
"2015-03-18 12:03:17.456",
"2015-03-18T12:03:17.456Z",
"2015-03-18 12:03:17.456Z",
"2015-03-18T12:03:17-1:0",
"2015-03-18T12:03:17-01:00",
"2015-03-18T12:03:17+07:30",
"2015-03-18T12:03:17+7:3",
"2015-03-18T12:03:17.123-1:0",
"2015-03-18T12:03:17.123-01:00",
"2015-03-18T12:03:17.123+07:30",
"2015-03-18T12:03:17.123+7:3",
"2015-03-18 12:03:17.123UTC+8",
"2015-03-18 12:03:17.123UTC+8:1",
"2015-03-18T12:03:17.123GMT+081010",
"2015-03-18T12:03:17.123GMT+8:10:10",
"2015-03-18T12:03:17.123GMT+8:10",
"2015-03-18T12:03:17.123UT+00:00:10",
"2015-03-18 12:03:17.123456789",
"2015-03-18 12:03:17.123Etc/GMT+1",
"2015-03-18 12:03:17.123CTT",
};
std::vector<std::optional<Timestamp>> expected{
Timestamp(1420070400, 0),
Timestamp(-125754422400, 0),
Timestamp(1425168000, 0),
Timestamp(-125749324800, 0),
Timestamp(0, 0),
Timestamp(946684800, 0),
Timestamp(0, 0),
Expand All @@ -255,14 +284,63 @@ TEST_F(SparkCastExprTest, stringToTimestamp) {
Timestamp(1426680197, 0),
Timestamp(1426680197, 0),
Timestamp(1426680197, 0),
Timestamp(1426680197, 0),
Timestamp(1426680197, 0),
Timestamp(1426680197, 123000000),
Timestamp(1426680197, 123000000),
Timestamp(1426680197, 456000000),
Timestamp(1426680197, 456000000),
Timestamp(1426680197, 456000000),
Timestamp(1426680197, 456000000),
Timestamp(1426680197 + 1 * 60 * 60, 0),
Timestamp(1426680197 + 1 * 60 * 60, 0),
Timestamp(1426680197 - (7 * 60 * 60 + 30 * 60), 0),
Timestamp(1426680197 - (7 * 60 * 60 + 3 * 60), 0),
Timestamp(1426680197 + 1 * 60 * 60, 123000000),
Timestamp(1426680197 + 1 * 60 * 60, 123000000),
Timestamp(1426680197 - (7 * 60 * 60 + 30 * 60), 123000000),
Timestamp(1426680197 - (7 * 60 * 60 + 3 * 60), 123000000),
Timestamp(1426680197 - 8 * 60 * 60, 123000000),
Timestamp(1426680197 - (8 * 60 * 60 + 1 * 60), 123000000),
Timestamp(1426680197 - (8 * 60 * 60 + 10 * 60 + 10), 123000000),
Timestamp(1426680197 - (8 * 60 * 60 + 10 * 60 + 10), 123000000),
Timestamp(1426680197 - (8 * 60 * 60 + 10 * 60), 123000000),
Timestamp(1426680197 - 10, 123000000),
Timestamp(1426680197, 123456000),
// Etc/GMT+1 and GMT-1 are equivalent.
Timestamp(1426680197 + 1 * 60 * 60, 123000000),
Timestamp(1426680197 - 8 * 60 * 60, 123000000),
};
testCast<std::string, Timestamp>("timestamp", input, expected);
}

TEST_F(SparkCastExprTest, invalidStringToTimestamp) {
testInvalidCast<StringView>(
"timestamp",
{"2015-"},
"Cannot cast VARCHAR '2015-' to TIMESTAMP. Unable to parse timestamp value: \"2015-\"");
testInvalidCast<StringView>(
"timestamp",
{"2015-13"},
"Cannot cast VARCHAR '2015-13' to TIMESTAMP. Unable to parse timestamp value: \"2015-13\"");
testInvalidCast<StringView>(
"timestamp",
{"2015-03-18 12:03:17.123utc+080000"},
"Unknown timezone value: \"utc+080000\"");
testInvalidCast<StringView>(
"timestamp",
{"2015-03-18 12:03:17.123UTC+08000012"},
"Failed to normalize spark timezone value: \"UTC+08000012\"");
testInvalidCast<StringView>(
"timestamp",
{"2015-03-18 12:03:17.123UTC+080061"},
"Failed to normalize spark timezone value: \"UTC+080061\"");
testInvalidCast<StringView>(
"timestamp",
{"2015-03-18 12:03:17.123UTC+8:6:10"},
"Failed to normalize spark timezone value: \"UTC+8:6:10\"");
}

TEST_F(SparkCastExprTest, intToTimestamp) {
// Cast bigint as timestamp.
testCast(
Expand Down
13 changes: 13 additions & 0 deletions velox/type/Timestamp.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,19 @@ void Timestamp::toGMT(const tz::TimeZone& zone) {
seconds_ = sysSeconds.count();
}

void Timestamp::toGMT(const tz::TimeZone& zone, int32_t secondsOffset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that if you want to do a timezone offset conversion, you should use this TimeZone constructor and let the TimeZone object handle the conversion internally:

https://github.com/facebookincubator/velox/blob/main/velox/type/tz/TimeZoneMap.h#L98

Looks like to support this case we should first change the TimeZone class to support std::chrono::second instead of std::chrono::minute, then do the appropriate plumbing. If you change it there, you probably won't need to make any changes to the Timestamp class.

Copy link
Contributor

@PHILO-HE PHILO-HE Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pedroerp, yes, if we truly need to support seconds offset, it's better to change TimeZone class to use seconds for offset time representation. My question is, do we really need to support that kind of timezone? As far as I know, there is no timezone with seconds offset in the real world.

Copy link
Contributor Author

@liujiayi771 liujiayi771 Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pedroerp. Thank you for your suggestion. I also @ you in #11411 to ask about your thoughts on the timezone formats supported by Velox can be the union of Presto and Spark. What do you think? Velox supporting more timezone formats shouldn't affect users' usage, and I can move Spark's normalization code into the TimeZoneMap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is, do we really need to support that kind of timezone? As far as I know, there is no timezone with seconds offset in the real world.

@PHILO-HE I asked the same question at some point, but it was pointed out that this is something (somehow) supported by Spark. So if we absolutely need to support it, it would probably better to do so within the TimeZone class, so at least it is a bit more encapsulated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liujiayi771 let me get a hold on @kevinwilfong who has recently changed that code and has more context how this code is organized now.

toGMT(zone);
if (seconds_ + secondsOffset < kMinSeconds ||
seconds_ + secondsOffset > kMaxSeconds) {
VELOX_USER_FAIL(
"The seconds offset in timezone will get invalid timestamp, "
"timestamp is {}, seconds offset is {}",
toString(),
secondsOffset);
}
seconds_ += secondsOffset;
}

std::chrono::time_point<std::chrono::system_clock, std::chrono::milliseconds>
Timestamp::toTimePointMs(bool allowOverflow) const {
using namespace std::chrono;
Expand Down
3 changes: 3 additions & 0 deletions velox/type/Timestamp.h
Original file line number Diff line number Diff line change
Expand Up @@ -348,6 +348,9 @@ struct Timestamp {
// ts.toString(); // returns January 1, 1970 08:00:00
void toGMT(const tz::TimeZone& zone);

// Converts the timestamp to the GMT time and add the seconds offset.
void toGMT(const tz::TimeZone& zone, int32_t secondsOffset);

/// Assuming the timestamp represents a GMT time, converts it to the time at
/// the same moment at zone. For example:
///
Expand Down
Loading
Loading