-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support all patterns for Spark CAST(varchar as timestamp) #11114
base: main
Are you sure you want to change the base?
Support all patterns for Spark CAST(varchar as timestamp) #11114
Conversation
Previous PR: #9388. |
✅ Deploy Preview for meta-velox canceled.
|
003164d
to
54677bc
Compare
Resolve #11028. |
afd30fa
to
12ceb26
Compare
|
||
namespace facebook::velox::functions::sparksql { | ||
|
||
SparkCastHooks::SparkCastHooks(const core::QueryConfig& config) : CastHooks() { | ||
const auto sessionTzName = config.sessionTimezone(); | ||
if (config.adjustTimestampToTimezone() && !sessionTzName.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heads up that adjustTimestampToTimezone()
was added to be compatible with Presto's legacy_timestamp mode, which basically means that if you don't have a time zone associated with a timestamp, it would assume it was in the local user time zone.
I'm not sure if this is the same behavior in Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always set adjustTimestampToTimezone = true
in gluten, and it is consistent with Spark's behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for clarifying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liujiayi771, maybe, we can remove the check for config.adjustTimestampToTimezone()
since this implementation is only for Spark without any compatibility consideration for presto.
velox/type/TimestampConversion.cpp
Outdated
auto sign = timeZoneName[start]; | ||
auto offset = timeZoneName.substr(start + 1); | ||
std::string hm, h, m, s; | ||
RE2 hhPattern(R"((\d?\d))"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how hard would it be to implement the parsing logic without using re2, like we do for other timestamp conversions/parsing? I'm slightly worried about performance, but also adding re2 as a dependency to basically everything else in the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary purpose here is to extract these four scenarios.
- h[h]
- hh[:]mm
- hh:mm:ss
- hhmmss
Using re2 code might make it more concise, but for performance considerations, I will try to implement the logic here without using re2.
velox/type/TimestampConversion.cpp
Outdated
|
||
bool normalizeSparkTimezone(std::string& timeZoneName, Timestamp& ts) { | ||
// Spark support an id with one of the prefixes UTC+, UTC-, GMT+, GMT-, UT+ or | ||
// UT-, and a suffix in the formats: h[h], hh[:]mm, hh:mm:ss, hhmmss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does Spark accept timezone offsets containing seconds? AFAIK, IANA's definition is only (+/-)00:00 (hours and minutes). There are no use cases for timezone conversions with second granularity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Spark's timezone supports a special format that includes a seconds offset, such as UTC+hh:mm:ss
. Therefore, I need to remove the seconds offset from the timezone, and add it to the Timestamp result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is odd. What does Spark do if you have seconds on that definition? Does it add the seconds to the timezone conversion process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Spark will add the seconds to the timezone conversion process. For example:
spark-sql (default)> select cast('2015-03-07 19:33:33 UTC-00:00:01' as timestamp);
2015-03-08 03:33:34
Time taken: 0.044 seconds, Fetched 1 row(s)
velox/type/TimestampConversion.cpp
Outdated
return false; | ||
} | ||
} | ||
RE2 singleHourTz(R"((\+|\-)(\d):)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have similar logic to parse timezone offsets (full or partially defined). Could we reuse it here instead of relying on re2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this implementation, I followed the handling from Spark's source code, where Spark uses regex matching to pad the hours and minutes to two digits. Therefore, I used re2, and I will try to reuse the existing code. I will also attempt to modify other parts that use re2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pedroerp I checked the TimeZoneMap::normalizeTimeZoneOffset
method, and its normalization rules are inconsistent with Spark's rules in this context. Spark corrects timezone offsets like +2:3
, +02:3
, or +2:03
to +02:03
. I will add the normalization rule in Spark to not use re2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Expected<Timestamp> SparkCastHooks::castStringToTimestamp( | ||
const StringView& view) const { | ||
return util::fromTimestampString( | ||
// Allows all patterns supported by Spark: | ||
// `[+-]yyyy*` `[+-]yyyy*-[m]m` `[+-]yyyy*-[m]m-[d]d` `[+-]yyyy*-[m]m-[d]d ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps one pattern one line for readability.
Is there an extra white-space after the last pattern `[+-]yyyy*-[m]m-[d]d `?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an extra white-space after the last pattern
[+-]yyyy*-[m]m-[d]d
?
Yes, this is one of the patterns.
// `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]` | ||
// | ||
// where `zone_id` should have one of the forms: | ||
// - Z - Zulu time zone UTC+0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps use 1) 2) ... 5) for the forms to distinguish with those on L47-L50.
velox/type/TimestampConversion.cpp
Outdated
// Additionally, IANA does not support seconds in the offset, so this func will | ||
// add the seconds offset to the input Timestamp. | ||
bool parseSparkTzWithPrefix( | ||
std::string& timeZoneName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use 'std::string_view' as parameter type to avoid creating intermediate strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This func will return bool to indicate whether an exception has occurred. Using std::string&
for the parameter is because we also need to modify the content of timeZoneName
.
velox/type/TimestampConversion.cpp
Outdated
size_t start) { | ||
VELOX_DCHECK_GE(timeZoneName.length(), start + 2); | ||
auto sign = timeZoneName[start]; | ||
auto offset = timeZoneName.substr(start + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. Can we use 'std::string_view'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following will concatenate offset
with other strings, and it always requires copying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the 'concat' only happens on L846. Can we specially handle it there and use string_view elsewhere? I assume the 'substr' could be avoided then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo I added an appendZeroMinutes
variable to handle this special case in the if branch.
velox/type/TimestampConversion.cpp
Outdated
@@ -865,9 +995,15 @@ fromTimestampWithTimezoneString( | |||
timezonePos++; | |||
} | |||
|
|||
std::string_view timeZoneName(str + pos, timezonePos - pos); | |||
std::string timeZoneName(str + pos, timezonePos - pos); | |||
std::string normalizeTzName = timeZoneName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. Can we use std::string_view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use std::string_view
for timeZoneName
, but normalizeTzName
need to be std::string
, because we will change its content in normalizeSparkTimezone
.
52ba1f6
to
b804894
Compare
|
||
if ((timeZone = tz::locateZone(timeZoneName, false)) == nullptr) { | ||
if ((timeZone = tz::locateZone(normalizeTzName, false)) == nullptr) { | ||
return folly::makeUnexpected( | ||
Status::UserError("Unknown timezone value: \"{}\"", timeZoneName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use 'normalizeTzName' in the error msg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deliberately used timeZoneName
in the msg because timeZoneName
is the original timezone string, while normalizeTzName
has been modified. I want the user to see that the original timezone is problematic in the error msg.
2042f61
to
8b3a6ac
Compare
velox/type/TimestampConversion.cpp
Outdated
@@ -117,6 +117,20 @@ inline bool characterIsDigit(char c) { | |||
return c >= '0' && c <= '9'; | |||
} | |||
|
|||
inline bool | |||
allCharactersIsDigit(std::string_view str, size_t start, size_t end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add a bound check for start and end.
if (start > end || end >= str.size()) {
return false;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor suggestion.
Seems allCharactersIsDigit(std::string_view str)
is enough. We can just pass str.substr()
at caller side to it.
TJavaShortIdIndex lower; | ||
lower.reserve(javaShortIdMap.size()); | ||
for (auto& it : javaShortIdMap) { | ||
lower[boost::algorithm::to_lower_copy(it.first)] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a 'sanitizeName' method to convert name as lower case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including FunctionSignature.h
here seems a bit odd; I think it's better to stick with Boost to maintain consistency with the handling logic of TimeZoneIndex
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be better to move the Spark timezone normalization impl. into some other places dedicated for handling timezone, e.g., TimeZoneMap.cpp, like https://github.com/facebookincubator/velox/blob/main/velox/type/tz/TimeZoneMap.cpp#L180.
Not sure whether most code can be re-used by presto. Anyway, we can call these util functions only for Spark if not suitable for Presto.
|
||
namespace facebook::velox::functions::sparksql { | ||
|
||
SparkCastHooks::SparkCastHooks(const core::QueryConfig& config) : CastHooks() { | ||
const auto sessionTzName = config.sessionTimezone(); | ||
if (config.adjustTimestampToTimezone() && !sessionTzName.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liujiayi771, maybe, we can remove the check for config.adjustTimestampToTimezone()
since this implementation is only for Spark without any compatibility consideration for presto.
// Spark supports a timezone id with one of the prefixes UTC+, UTC-, GMT+, | ||
// GMT-, UT+ or UT-, and a suffix in the following formats: h[h], hh[:]mm, | ||
// hh:mm:ss, hhmmss. | ||
if (startsWith(timeZoneName, "UTC") || startsWith(timeZoneName, "GMT")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So "Etc/GMT" will not be handled here and will directly be located by external date lib, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Etc/GMT" will be handled in TimeZoneMap::normalizeTimeZone
. This code only handles some tz ids like "GMT+8", "UTC-07:06". As you mentioned earlier, Spark normalization can be added to TimeZoneMap
, but these normalizations are specific to Spark. It might be necessary to differentiate within TimeZoneMap
whether it's for Spark.
@PHILO-HE TimestampConversion is used for Timestamp conversion or normalization, it can be used by Presto and Spark. However, some common normalizations are already handled in |
8b3a6ac
to
80e58d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liujiayi771, this pr only considers the normalization for timezone embedded in the input timestamp string, right? I remember Spark prioritize the using of this embedded timezone if it is specified in the input string. If it is not, the timezone passed from config is used. Does the configured timezone need to be normalized similarly in the conversion path for string without timezone embedded?
velox/type/TimestampConversion.cpp
Outdated
@@ -117,6 +117,20 @@ inline bool characterIsDigit(char c) { | |||
return c >= '0' && c <= '9'; | |||
} | |||
|
|||
inline bool | |||
allCharactersIsDigit(std::string_view str, size_t start, size_t end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor suggestion.
Seems allCharactersIsDigit(std::string_view str)
is enough. We can just pass str.substr()
at caller side to it.
velox/type/TimestampConversion.cpp
Outdated
if (ts.getSeconds() - sec < Timestamp::kMinSeconds) { | ||
return false; | ||
} | ||
ts = Timestamp(ts.getSeconds() - sec, ts.getNanos()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a uniform adjustment by ts.toGMT(timezone)
based on the returned timestamp and timezone from fromTimestampWithTimezoneString
.
Can we somehow let ts.toGMT(timezone)
do your proposed adjustment based on sec
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PHILO-HE The goal here is to obtain a timezone id string that can be handled by tz::locateZone
, and this method does not support seconds. The offset with seconds is added to the input Timestamp
in advance so that subsequent processing can ignore the seconds. Ultimately, in SparkCastHooks::castStringToTimestamp
, it is converted to the correct timezone using result.first.toGMT(*result.second)
. The toGMT method cannot be used here because it cannot handle seconds in this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liujiayi771, I get your point. I mean can we extract the adjustment based on sec
and somehow make ts.toGMT
include this adjustment? Then, here we can only do timezone normalization. This decouple may be helpful to support normalizing timezone from config.
Another question is, do we really need to support timezone with second-level offset? It is very rare to use such timezone in real world. Actually almost all timezones have hour, half hour or quarter hour level offset. If there is no real request, maybe we can just ignore such timezone in the implementation, even though Spark supports it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PHILO-HE I see, I can extract this logical to a method in Timestamp
.
do we really need to support timezone with second-level offset?
I'm not very clear on this myself; I just adapted it according to what Spark supports. The code for supporting seconds is not extensive, but it might be somewhat hard to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liujiayi771, then, I recommend to ignore such timezone with second-level offset. You can ask for others' feedback in the community. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+|-h[h]:mm:ss
+|-hhmmss
@liujiayi771 I notice you mentioned above two kinds of suffix for the timezone id, and want to know where they come from. Does Spark mention such requirement? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo They come from the comments in Spark code.
https://github.com/apache/spark/blob/d3d84e045cc484cf7b70d36410a554238d7aef0e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L288-L289
+|-h:mm:ss
is not present in the comments, but I have tested this pattern can work.
Yes, you are right, the timezone id in config may also needs to be normalized. For example, user may set "GMT+8" timezone in config, it is supported by velox. I think this scenario might occur in other Timestamp-related functions and casts as well. The normalization of the timezone in the config should be placed in a more centralized location to apply to all these scenarios. Perhaps the conversion can be done when handling timezone config in the Gluten cpp? This can be managed within the Gluten code by utilizing the utility method added in this PR. What do you think? I can continue working on this modification. |
@liujiayi771, I feel it may be better to put the normalization on Velox side. Can we just extend the existing |
I think we still need to differentiate between Presto and Spark. For example, We are currently unclear whether applying Spark's normalization to Presto will affect Presto. It seems a bit strange to place Spark-specific normalization into the |
bee83b3
to
7bf695d
Compare
7bf695d
to
4b60b3a
Compare
@@ -71,6 +71,19 @@ void Timestamp::toGMT(const tz::TimeZone& zone) { | |||
seconds_ = sysSeconds.count(); | |||
} | |||
|
|||
void Timestamp::toGMT(const tz::TimeZone& zone, int32_t secondsOffset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that if you want to do a timezone offset conversion, you should use this TimeZone constructor and let the TimeZone object handle the conversion internally:
https://github.com/facebookincubator/velox/blob/main/velox/type/tz/TimeZoneMap.h#L98
Looks like to support this case we should first change the TimeZone class to support std::chrono::second instead of std::chrono::minute, then do the appropriate plumbing. If you change it there, you probably won't need to make any changes to the Timestamp class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pedroerp, yes, if we truly need to support seconds offset, it's better to change TimeZone class to use seconds for offset time representation. My question is, do we really need to support that kind of timezone? As far as I know, there is no timezone with seconds offset in the real world.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pedroerp. Thank you for your suggestion. I also @ you in #11411 to ask about your thoughts on the timezone formats supported by Velox can be the union of Presto and Spark. What do you think? Velox supporting more timezone formats shouldn't affect users' usage, and I can move Spark's normalization code into the TimeZoneMap
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is, do we really need to support that kind of timezone? As far as I know, there is no timezone with seconds offset in the real world.
@PHILO-HE I asked the same question at some point, but it was pointed out that this is something (somehow) supported by Spark. So if we absolutely need to support it, it would probably better to do so within the TimeZone class, so at least it is a bit more encapsulated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liujiayi771 let me get a hold on @kevinwilfong who has recently changed that code and has more context how this code is organized now.
Velox supports casting varchar to timestamp, but Spark supports more varchar to timestamp conversion patterns. This PR completes the conversion patterns needed by Spark.
Allows all patterns supported by Spark:
[+-]yyyy*
[+-]yyyy*-[m]m
[+-]yyyy*-[m]m-[d]d
[+-]yyyy*-[m]m-[d]d
[+-]yyyy*-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]
[+-]yyyy*-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]
[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]
T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]
where
zone_id
should have one of the forms:and a suffix in the formats:
area/city
, such asEurope/Paris