-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mango: add $beginsWith operator #4810
Conversation
85e8a9d
to
fb13ad9
Compare
Great work! I think it would be worth to cover the |
I miss the parts related to the documentation. |
Please add |
It would certainly be helpful to have Clouseau in the CI and, presumably, then in the dev container which uses the same CI images. |
The |
Okay. Right now there is a failing test case (for the ======================================================================
FAIL: test_beginswith (03-operator-test.OperatorTextTests.test_beginswith)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/gaborpali/projects/github/couchdb/src/mango/test/03-operator-test.py", line 146, in test_beginswith
self.assertEqual(len(docs), 2)
AssertionError: 0 != 2
----------------------------------------------------------------------
Ran 384 tests in 37.973s
FAILED (failures=1, skipped=1) |
Adds a `$beginsWith` operator to selectors, with json and text index support. This is a compliment / precursor to optimising `$regex` support as proposed in #4776. For `json` indexes, a $beginsWith operator translates into a key range query, as is common practice for _view queries. For example, to find all rows with a key beginning with "W", we can use a range `start_key="W", end_key="W\ufff0"`. Given Mango uses compound keys, this is slightly more complex in practice, but the idea is the same. As with other range operators (`$gt`, `$gte`, etc), `$beginsWith` can be used in combination with equality operators and result sorting but must result in a contiguous key range. That is, a range of `start_key=[10, "W"], end_key=[10, "W\ufff0", {}]` would be valid, but `start_key=["W", 10], end_key=["W\ufff0", 10, {}]` would not, because the second element of the key may result in a non-contiguous range. For text indexes, `$beginsWith` translates to a Lucene query on the specified field of `W*`. If a non-string operand is provided to `$beginsWith`, the request will fail with a 400 / `invalid_operator` error.
6f7f750
to
cdfb26d
Compare
cdfb26d
to
3a94e00
Compare
@pgj I think I've addressed all the comments, though will need to squash the commits |
In the documentation, 1.3.6.1.1. Selectors Basics should also talk about couchdb/src/docs/src/api/database/find.rst Lines 201 to 207 in a5af777
|
e0c1ab6
to
8e29e60
Compare
+1 |
😜 |
@big-r81 we have been in contact with @willholley about the PR out-of-band. On merging the PR he was aware that I did not have any more pending comments on the contents. I wanted to approve it formally but I was a bit late to the party hence I added the comment with the +1 to signal that I agree with the change. |
Hrm... Something is wrong with the merged version. CI says the following: [2023-10-30T18:57:57.544Z] FAIL: test_basic (25-beginswith-test.BeginsWithOperator)
[2023-10-30T18:57:57.544Z] ----------------------------------------------------------------------
[2023-10-30T18:57:57.544Z] Traceback (most recent call last):
[2023-10-30T18:57:57.544Z] File "/home/jenkins/workspace/kins-cm1_FullPlatformMatrix_main/centos7/build/src/mango/test/25-beginswith-test.py", line 49, in test_basic
[2023-10-30T18:57:57.544Z] self.assertEqual(len(docs), 2)
[2023-10-30T18:57:57.544Z] AssertionError: 0 != 2
[2023-10-30T18:57:57.544Z]
[2023-10-30T18:57:57.544Z] ======================================================================
[2023-10-30T18:57:57.544Z] FAIL: test_compound_key (25-beginswith-test.BeginsWithOperator)
[2023-10-30T18:57:57.544Z] ----------------------------------------------------------------------
[2023-10-30T18:57:57.544Z] Traceback (most recent call last):
[2023-10-30T18:57:57.544Z] File "/home/jenkins/workspace/kins-cm1_FullPlatformMatrix_main/centos7/build/src/mango/test/25-beginswith-test.py", line 68, in test_compound_key
[2023-10-30T18:57:57.544Z] self.assertEqual(len(docs), 1)
[2023-10-30T18:57:57.544Z] AssertionError: 0 != 1 |
Locally (with Dreyfus/Clouseau enabled) |
It probably a flaky test. We should investigate as we've been trying to squash some of those lately. Is the race condition fairly obvious? Usually we can add a retry loop. |
It is not flaky, these errors happen every time -- but only on CentOS 7. I was able to reproduce the issue locally with the help of a container created from the |
Oh very interesting, definitely worth investigating. One thing that immediately jumps to mind is that CentOS 7 has an older libicu library which may have different collation rules. Any end/start (min/max) marker perhaps might collate differently there. |
@big-r81 oops on the premature merge - I'm too used to working with repos with branch protection enabled in GH and didn't think to manually check for the +1. |
curious that the failure didn't show in the PR checks - it definitely smells like a collation issue and likely an issue with the special characters used in the tests rather than the implementation of |
@willholley np, only a reminder. We will add this review-branch-protection in the future. At the moment, we will get the "green" merge button, if the CI runs through... The failure didn't show up in the PR checks, because CentOS and other OSes are only called in the full CI run. For PRs, we only test a linux distro (debian) with different erlang version (24,25,26). |
In my understanding, the two failing test cases do not feature any special characters. Neither the selectors nor the documents. C.f. "begins with |
@pgj can you test on your local CentOS vm the state before the PR? |
It is the derived string upper limit that does not work. Consider the following: > Lower = <<"A">>.
<<"A">>.
> Upper = <<Lower/binary, 16#10FFFF>>.
<<"A\377">>. On CentOS 7: > couch_ejson_compare:get_icu_version().
{50,2,0,0}
> couch_ejson_compare:get_collator_version().
{58,0,6,50}
> mango_json:cmp(Lower, <<"AUS">>).
-1
> mango_json:cmp(<<"AUS">>, Upper).
1 On anywhere else (for example, locally): > couch_ejson_compare:get_icu_version().
{73,2,0,0}
> couch_ejson_compare:get_collator_version().
{153,120,0,0}
> mango_json:cmp(Lower, <<"AUS">>).
-1
> mango_json:cmp(<<"AUS">>, Upper).
-1 The |
Yeah, basically the lack of support for U+FFFF: This code point is tailored to have a primary weight higher than all other characters. This allows the reliable specification of a range, such as “Sch” ≤ X ≤ “Sch\uFFFF”, to include all strings starting with "sch" or equivalent. Given that the other issue was fixed by unconditionally applying a shim on top of the ICU string comparison call, we should do the same here? @nickva what do you think? |
@pgj excellent find!
It could work. The previous fix was easy enough as it was an isolated element -- a single marker. This one might involve doing some string wrangling in C, which is always fun, and by fun I mean dangerous, of course ;-) I guess we could extend the current check to first see if the string ends in |
I wonder whether there's a pragmatically high code point we could use as the high bound in |
I was thinking about the same. Unfortunately I am not that experienced with Unicode collation and ICU and I do not have any idea. I will keep studying the implementation for some inspiration and work on the shim-based approach so we could at least have something as a fix. |
Another sentinel value might work, good idea. CentOS is EOL next year so then we can deprecate support for it and remove a few hacks we have for it. |
Sure but what should be that value specifically? My impression is that 0xFFFF was created for that exact purpose because it was missing before. It would be good to know how others have implemented this feature in lack of this symbol. |
Looking over at the PR itself, one issue that jumps out is in the line
At some point the ICU library decided to treat that differently but they could have as well have thrown an error too. But then, if even if we did So then the fix might be to actually encode the
Even better, since unicode standard seems to define a max sortable code point
Give One interesting case to think about is that with |
Thanks @nickva! I will experiment with the options you described above and get back to you with the results. |
thanks @nickva - this solution looks good in my tests using the CentoOS container |
Erlang itself supports
It works nicely. Jenkins seems to be happy with that, see #4829 for the details. |
Overview
Adds a
$beginsWith
operator to selectors, with json and text index support. This is a compliment / precursor to optimising$regex
support as proposed in #4776.For
json
indexes, a $beginsWith operator translates into a key range query, as is common practice for _view queries. For example, to find all rows with a key beginning with "W", we can use a rangestart_key="W", end_key="W\ufff0"
. Given Mango uses compound keys, this is slightly more complex in practice, but the idea is the same. As with other range operators ($gt
,$gte
, etc),$beginsWith
can be used in combination with equality operators and result sorting but must result in a contiguous key range. That is, a range ofstart_key=[10, "W"], end_key=[10, "W\ufff0", {}]
would be valid, butstart_key=["W", 10], end_key=["W\ufff0", 10, {}]
would not, because the second element of the key may result in a non-contiguous range.For text indexes,
$beginsWith
translates to a Lucene query on the specified field ofW*
.If a non-string operand is provided to
$beginsWith
, the request will fail with a 400 /invalid_operator
error.Testing recommendations
I've only tested this with json indexes so far because the dev container doesn't have
text
index support.A basic eunit test is added for the new
selector
support. You can run these usingmake eunit apps=mango
.Integration tests are added to the Mango test suite, which can be run via
make mango-test
.Related Issues or Pull Requests
n/a
Checklist
rel/overlay/etc/default.ini
src/docs
folder