Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for UUID comparison functions #10791

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

BryanCutler
Copy link
Contributor

@BryanCutler BryanCutler commented Aug 20, 2024

This adds binary comparison functions <, >, <=, >= to the UUID custom data type. Equality functions are already present. Added unit tests for testing a query with comparisons between UUID values.

The ordering is done lexicographically to conforms with IETF RFC 4122 https://datatracker.ietf.org/doc/html/rfc4122.html, and also matches Presto Java after prestodb/presto#23311.

Related UUID serialization fix at #11197
From #10584

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 20, 2024
Copy link

netlify bot commented Aug 20, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit fa1b203
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/675893b70bb3e500087d7bc9

Copy link
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanCutler. One small comment.

Did you try running the queries in prestodb/presto#23301 with this logic ?

}
};

#define VELOX_GEN_BINARY_EXPR_UUID(Name, uuidCompExpr, TResult) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TResult parameter is not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch! It's actually the same in VELOX_GEN_BINARY_EXPR_TIMESTAMP_WITH_TIME_ZONE too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@BryanCutler
Copy link
Contributor Author

Did you try running the queries in prestodb/presto#23301 with this logic ?

Yes, and the queries for <,>,<=,>= now pass. But it is a little strange, they only pass for me in debug mode, when I run in release mode I get this type of error:

not equal
Actual rows (1 of 1 extra rows shown, 6 rows in total):
    [f768f36d-4f09-4da7-a298-3564d8f3c986, 1fdff429-6f65-4370-bb2a-7ab660dbec92]
Expected rows (1 of 1 missing rows shown, 6 rows in total):
    [1fdff429-6f65-4370-bb2a-7ab660dbec92, f768f36d-4f09-4da7-a298-3564d8f3c986]

where it looks like one UUID is randomly generated, and I'm not sure where it's coming from. I'll look into that a bit more.

@BryanCutler
Copy link
Contributor Author

@aditi-pandit I didn't realize the temp table from the test in prestodb/presto#23301 included a randomly generated UUID with (cast(uuid() AS VARCHAR)). That was the reason for failing in release mode, without this random value all the comparison tests pass.

Copy link
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanCutler. Had some minor comments.

velox/functions/prestosql/UuidFunctions.h Outdated Show resolved Hide resolved
template <typename T> \
struct Name##Uuid : public UuidCompareBase { \
VELOX_DEFINE_FUNCTION_TYPES(T); \
FOLLY_ALWAYS_INLINE void call( \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : Add newline in between.

Copy link
Contributor Author

@BryanCutler BryanCutler Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit I don't see a consistent format for this, most of the time it looks how it is currently (when the arguments don't fit within a single line) but I do see some cases where the format is:

template <typename T>
struct MultiplyVoidOutputFunction {
  template <typename TInput>
  FOLLY_ALWAYS_INLINE void
  call(TInput& result, const TInput& a, const TInput& b) {
    result = functions::multiply(a, b);
  }
};

Do you mean the latter, with a newline after void?

Copy link
Collaborator

@aditi-pandit aditi-pandit Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler : I meant newline in between

VELOX_DEFINE_FUNCTION_TYPES(T);      
                                    
FOLLY_ALWAYS_INLINE void call(      

You can stick with the formatting from make format-fix otherwise.

@@ -108,5 +108,45 @@ TEST_F(UuidFunctionsTest, unsupportedCast) {
evaluate("cast(123 as uuid())", input), "Cannot cast BIGINT to UUID.");
}

TEST_F(UuidFunctionsTest, comparisons) {
auto data0 = makeFlatVector<std::string>({
"33355449-2c7d-43d7-967a-f53cd23215ad",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to add more tests where the lower 8 bytes are the same for better coverage of the comparison logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, for the lower bits I have a check for <, = and I'll add one more for >

Copy link
Contributor Author

@BryanCutler BryanCutler Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some additional cases, one with equal lower bytes

@aditi-pandit aditi-pandit changed the title Add Support for UUID Comparison Functions Add support for UUID Comparison Functions Aug 26, 2024
@aditi-pandit aditi-pandit changed the title Add support for UUID Comparison Functions Add support for UUID Comparison functions Aug 26, 2024
Copy link
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanCutler.

@aditi-pandit
Copy link
Collaborator

@Yuhta @kagamiori @mbasmanova : Please can you help with the review.

@BryanCutler
Copy link
Contributor Author

Gentle ping @Yuhta @kagamiori @mbasmanova to please review

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler Thank you for adding this functionality. I wonder where is the semantics of comparing UUIDs come from? Do you have a pointer? Would be nice to update PR description and documentation (https://facebookincubator.github.io/velox/develop/types.html) to clarify.

#include "velox/functions/prestosql/types/UuidType.h"

namespace facebook::velox::functions {

struct UuidCompareBase {
FOLLY_ALWAYS_INLINE int compare64(uint64_t x, uint64_t y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be static functions or free functions.

Let's also put this code into 'details' namespace.

}

int compare128(const int128_t& lhs, const int128_t& rhs) {
int comp = compare64(HugeInt::upper(lhs), HugeInt::upper(rhs));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comp

Do not abbreviate: comp -> result or similar.

"33355449-2c7d-43d7-967a-f53cd23215ad",
});

auto actual = evaluate("cast(c0 as uuid) < cast(c1 as uuid)", makeRowVector({data0, data1}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason not to test one value at a time using evaluateOnce? This makes tests easier to read and debug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just not familiar with the test environment. I will try with evaluateOnce.

@mbasmanova mbasmanova changed the title Add support for UUID Comparison functions Add support for UUID comparison functions Sep 9, 2024
@BryanCutler
Copy link
Contributor Author

Thanks for reviewing @mbasmanova

I wonder where is the semantics of comparing UUIDs come from? Do you have a pointer? Would be nice to update PR description and documentation

I basically followed the comparison done in in Presto Java https://github.com/prestodb/presto/blob/97d88cfdebb553b5c7fb0e217b0211573df3d190/presto-common/src/main/java/com/facebook/presto/common/type/UuidType.java#L97

Since int128_t doesn't have support for atomic operations, we first check the higher bytes, then lower. Returning an int from the comparison allows it to be streamlined to do minimal checks. I will update the PR and documentation.

@mbasmanova
Copy link
Contributor

@BryanCutler I wonder whether it makes sense to compare UUIDs at all? What does it mean that one UUID is greater than the other? Perhaps, this type should be non-orderable.

@BryanCutler
Copy link
Contributor Author

@BryanCutler I wonder whether it makes sense to compare UUIDs at all? What does it mean that one UUID is greater than the other? Perhaps, this type should be non-orderable.

@mbasmanova I think it makes sense, just as you would compare values for std::byte, etc. Most UUID implementations support such operations, see https://www.boost.org/doc/libs/1_85_0/libs/uuid/doc/uuid.html#Operators

@aditi-pandit
Copy link
Collaborator

@BryanCutler I wonder whether it makes sense to compare UUIDs at all? What does it mean that one UUID is greater than the other? Perhaps, this type should be non-orderable.

@mbasmanova I think it makes sense, just as you would compare values for std::byte, etc. Most UUID implementations support such operations, see https://www.boost.org/doc/libs/1_85_0/libs/uuid/doc/uuid.html#Operators

@mbasmanova : I found UUID comparisons in the Java spec also https://docs.oracle.com/javase/8/docs/api/java/util/UUID.html

My understanding is that a source generating UUIDs uses a combination of timestamp with the DNS address of the generating device. So there is some notion of ordering of UUIDs as a result.

@BryanCutler
Copy link
Contributor Author

@mbasmanova just wanted to confirm if you are ok with adding this functionality? I still need to update the tests per your request and will get to that soon, thanks!

@mbasmanova
Copy link
Contributor

@BryanCutler Yes, please, move ahead with addressing the comments. Thank you for clarifying the use case.

@BryanCutler
Copy link
Contributor Author

@mbasmanova @aditi-pandit after looking into prestodb/presto#23311 there are a couple issues with UUIDs in Velox:

  1. UUID data is being stored as big endian in the int128_t value, which comes from using boost:uuid during casting to/from strings. This causes comparisons like 11000000-0000-0022-0000-000000000000 < 22000000-0000-0011-0000-000000000000 to fail.
  2. Serialization of the int128_t value to an int128 Array Block results in [LSW, MSW], however the Presto Java UUIDType expects long values from the int128 Array Block to be [MSW, LSW].

For (1), since Velox stores UUID values with an int128_t, unlike boost::uuid which is a byte array, I think the value should be in little endian. This means instead of directly using memcpy bytes from the boost::uuid during casting, the bytes would need to swapped to get int128 in LE order. wdyt?

To fix (2), the word order would need to be switched during serialization. Would you prefer I fix that in this PR or open another?

@aditi-pandit
Copy link
Collaborator

aditi-pandit commented Sep 25, 2024

I feel the first fix should be to address is the issue of the incorrect results in prestodb/presto#23311. Post that we can do this comparisons PR.

From your comments seems that requires 2 changes:
i) Fixing the seriailization order between the native and java sides (2) above
ii) Changing the uuid int128_t to be little endian in the copy from boost::uuid.

@mbasmanova : Using boost::uuid (instead of int128_t) to represent UUID in Velox could be cleaner given these issues. We could make UUID TypeKind::VARBINARY and use the boost::uuid bytes to represent it. wdyt ?

@BryanCutler : If we did that, would we need the serialization changes from Native to Java side as well ? My intuition is yes, but I could be wrong.

@Yuhta : Jimmy, would be great to hear your thoughts as well.

@BryanCutler
Copy link
Contributor Author

Thanks @aditi-pandit , if we did change the storage from int128_t to a byte array in boost::uuid, then we would still need to adjust the serialization to Java, since Presto expects an Int128ArrayBlock.

@Yuhta
Copy link
Contributor

Yuhta commented Oct 1, 2024

@aditi-pandit I don't think VARBINARY is the right physical type for UUID, it is meant for variable length data, while UUID is fixed length. I would say just keep using int128 and fix the endianness so the comparison between UUID is the same as comparison between int128.

@aditi-pandit
Copy link
Collaborator

Thanks @Yuhta for your response. I do see a pattern of using VARBINARY for fixed lengths (in IPPREFIX type) but that has length 17 bytes which can't be captured with any integral type. For UUID we do have the option of using int128_t.

@BryanCutler : Let's go with Jimmy's suggestion. Will make our implementation simpler as well.

@BryanCutler BryanCutler force-pushed the add-uuid-comp-functions-10584 branch from e4f984a to 5aa91d4 Compare October 7, 2024 18:44
@BryanCutler
Copy link
Contributor Author

Thanks @Yuhta and @aditi-pandit , I have fixed the serialization issue here and added a unit test, I can put that in a separate PR if preferred. I will also test these changes again with prestodb/presto#23311

@BryanCutler
Copy link
Contributor Author

BryanCutler commented Oct 8, 2024

After testing the comparisons with Presto Java in prestodb/presto#23311 there are some different results because Presto Java uses Slice.compareTo() which compares values by swapping bytes of each long value, see https://github.com/airlift/slice/blob/0.38/src/main/java/io/airlift/slice/Slice.java#L1336. So the values are ordered according to the LSB of each long value, which seems a little strange and I'm not sure if that should be changed.

** moved the serializer fix to #11197, will update this pr soon
** updated - the underlying issue in Presto Java is an additional byte swap, not an overflow from https://bugs.openjdk.org/browse/JDK-7025832

return x < y ? -1 : (x == y ? 0 : 1);
}

static int compare128(const int128_t& lhs, const int128_t& rhs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just compare (uint128_t)lhs and (uint128_t)rhs

@BryanCutler
Copy link
Contributor Author

Just compare (uint128_t)lhs and (uint128_t)rhs

Thanks @Yuhta , I can look into that but there is a bigger concern that this type of ordering will not match Presto Java, see above #10791 (comment).

Presto Java UuidOperations use Slice.compareTo() which follows a strange ordering based on swapping bytes. As far as I can tell this ordering was changed in the Slice library 2.0+, which uses Arrays.compareUnsigned and I don't believe it would have the same problem, but not sure upgrading that is planned in the near-term.

I'm not sure if we should try to follow Presto Java with the comparisons or use the spec at RFC-4122 to order lexicographically, e.g. comparing uint128_t in little endian, and differ from Presto Java. Please share recommendations on this @Yuhta @aditi-pandit

@BryanCutler
Copy link
Contributor Author

There is a recent related PR cherry-picked from trindodb, this might address some part of the Presto Java issue prestodb/presto#17939.

@Yuhta
Copy link
Contributor

Yuhta commented Oct 16, 2024

@BryanCutler We should ask Presto Java community to decide the ordering of UUID and document them officially first.

@BryanCutler
Copy link
Contributor Author

@BryanCutler We should ask Presto Java community to decide the ordering of UUID and document them officially first.

There is a PR up at prestodb/presto#23847 that corrects Presto Java to conform to the IETF RFC 4122 definition of ordering UUIDs. When that is resolved, this PR can be updated to match.

@BryanCutler
Copy link
Contributor Author

Status: PR prestodb/presto#23847 has been merged, and I have updated the comparisons to use a cast to uint128_t, and this includes the serialization commit from #10791 which stores the UUID value as little endian. Once that PR is merged, it can be dropped from here.

I have also checked these changes again to ensure the Presto Java tests from prestodb/presto#23301

@BryanCutler BryanCutler changed the title Add support for UUID comparison functions feat: Add support for UUID comparison functions Nov 21, 2024
@BryanCutler BryanCutler force-pushed the add-uuid-comp-functions-10584 branch 2 times, most recently from a703f14 to 9d917dc Compare November 21, 2024 21:54
@BryanCutler
Copy link
Contributor Author

Finished rebasing after fixing serialzation issue at #10791

const auto uuidEval = [&](const std::optional<std::string>& lhs,
const std::string& operation,
const std::optional<std::string>& rhs) {
return evaluateOnce<bool>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try this version of evaluateOnce https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/tests/utils/FunctionBaseTest.h#L241

that can override the logical type. It might reduce the casting in your expression then.

So the call would be something like
evaluateOnce<bool>(fmt::format("c0 {} c1", operation), {UUID(), UUID()}, lhs, rhs)

But of course that assumes that you are not relying on the casting to reshuffle bytes in a way that the uint128_t comparisons hold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does work, but then I would need to construct the UUIDs with boost and then convert to an int128_t to pass to the eval. I think the end result is less clear, and this way also tests the UUID casting.

rhs);
};

ASSERT_EQ(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming that you got the same results with Presto Java.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've copied these values to the Presto Java UUID tests to run and get the same results.

@BryanCutler BryanCutler force-pushed the add-uuid-comp-functions-10584 branch from 9d917dc to 86278a9 Compare December 2, 2024 22:20
Copy link
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanCutler for working with the revisions.

@aditi-pandit
Copy link
Collaborator

@Yuhta : Please can you help further review and merge.

@Yuhta Yuhta added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Dec 3, 2024
@facebook-github-bot
Copy link
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@Yuhta
Copy link
Contributor

Yuhta commented Dec 3, 2024

@BryanCutler BryanCutler force-pushed the add-uuid-comp-functions-10584 branch 2 times, most recently from 8d88f1e to 76dc4be Compare December 5, 2024 19:28
This adds UUID comparison functions that were previously unsupported.
Functions added are <, >, <=, >=. Equal was already supported. The
ordering is done lexicographically and conforms to IETF [RFC 4122].
The ordering also matches Presto Java after #[23847].

[RFC 4122]: https://datatracker.ietf.org/doc/html/rfc4122.html
[23847]: prestodb/presto#23847
@BryanCutler BryanCutler force-pushed the add-uuid-comp-functions-10584 branch from 76dc4be to fa1b203 Compare December 10, 2024 19:17
@BryanCutler
Copy link
Contributor Author

BryanCutler commented Dec 10, 2024

It seems like one of the problems is the biased expression fuzzer test tries to create a hive table with IPADDRESS or UUID as a type, and they are not supported so the reference query fails.

Locally, this is one of the errors I see

I20241210 12:16:31.433363 354597 PrestoQueryRunner.cpp:804] Execute presto sql: 
CREATE TABLE tmp(c0, c1, row_number) WITH (format = 'DWRF') AS SELECT cast(null as IPADDRESS), cast(null as BOOLEAN), cast(null as BIGINT)
E20241210 12:16:31.447225 354597 Exceptions.h:66] Line: /home/bryan/git/velox/velox/exec/fuzzer/PrestoQueryRunner.cpp:98, Function:throwIfFailed, Expression:  
Presto query failed: 65536 Unsupported type: ipaddress, Source: RUNTIME, ErrorCode: INVALID_STATE

The test might have to be adjusted for these types, I'll look into what we can do about it.

@BryanCutler
Copy link
Contributor Author

Another PR has failed biased expression test, not quite the same error since testing different functions https://github.com/facebookincubator/velox/actions/runs/12279548356/job/34266985156?pr=11612

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants