Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to compare to value? #363

Open
tbenst opened this issue Sep 2, 2021 · 19 comments
Open

How to compare to value? #363

tbenst opened this issue Sep 2, 2021 · 19 comments

Comments

@tbenst
Copy link

tbenst commented Sep 2, 2021

I'm not sure how to do x .> "Young" Is there a way to make Young a CategoricalValue?

julia> x = CategoricalArray(["Old", "Young", "Middle", "Young"], levels=["Young", "Middle", "Old"], ordered=true)
4-element CategoricalArray{String,1,UInt32}:
 "Old"
 "Young"
 "Middle"
 "Young"

julia> x[1] > x[2] 
true

julia> x .> x[2]
4-element BitVector:
 1
 0
 1
 0

julia> x .> "Young" # I'm not sure how to do this?
ERROR: ArgumentError: cannot compare a `CategoricalValue` to value `v` of type `CategoricalValue{String, UInt32}`
@tbenst
Copy link
Author

tbenst commented Sep 3, 2021

is there any reason not to implement:

function >(v::CategoricalValue{T,R},y::String) where {T,R}
    catv_y = CategoricalArrays.pool(v)[get(CategoricalArrays.pool(v), y)]
    v > catv_y
end

function <(v::CategoricalValue{T,R},y::String) where {T,R}
    catv_y = CategoricalArrays.pool(v)[get(CategoricalArrays.pool(v), y)]
    v < catv_y
end


function ==(v::CategoricalValue{T,R},y::String) where {T,R}
    catv_y = CategoricalArrays.pool(v)[get(CategoricalArrays.pool(v), y)]
    v < catv_y
end

@bkamins
Copy link
Member

bkamins commented Sep 3, 2021

See #346 and use:

julia>  x = CategoricalArray(["Old", "Young", "Middle", "Young"], levels=["Young", "Middle", "Old"], ordered=true)       
4-element CategoricalArray{String,1,UInt32}:
 "Old"
 "Young"
 "Middle"
 "Young"

julia> x .< CategoricalValue("Young", x)
4-element BitVector:
 0
 0
 0
 0

julia> x .< CategoricalValue("Middle", x)
4-element BitVector:
 0
 1
 0
 1

@tbenst
Copy link
Author

tbenst commented Sep 3, 2021

@bkamins thanks! That’s a bit more ergonomic but still fairly unwieldy—is there a reason to not do CategoricalValue("Middle", x) automatically on the users behalf? I’m a new user of this package, but the default behavior of R for factors/categoricals is to support such comparisons with strings, I think it’s quite ergonomic and follows principle of least surprise

@bkamins
Copy link
Member

bkamins commented Sep 3, 2021

As explained in linked #346 doing it automatically leads to non transitivity of <, which breaks a basic contract of such operation.

An example from R:

> x <- ordered("1", levels=c("1", "2"))
> y <- ordered("1", levels=c("2", "1"))
> x < "2" && "2" < y
[1] TRUE

so should imply that x < y but we get:

> x < y
Error in Ops.ordered(x, y) : level sets of factors are different

The problem is that, as opposed to R Julia exposes CategoricalValue type, which can be stored in any container.

@tbenst
Copy link
Author

tbenst commented Sep 4, 2021

Hm, I'm not sure I agree--I think R is principled for the example you give. There are two categories-- a) co-product of String and levels=c("1", "2"), and b) co-product of String and levels=c("2,"1"). Transitivity only holds within a category. x < y tries to compare objects from two different categories which is indeed an error.

Edit: to clarify, I think it is also principled to only allow comparison within the same categorical pool. I just wanted to point out that from a category theory standpoint, x < "2" and "2" < y are not the same function, they are two different morphisms. It would seem that CategoricalArrays supports total orders with CategoricalValue objects, while R allows for partial orders over the co-product / Union type including string. Both designs maintain transitivity, but there may be other reasons to prefer one over the other.

@bkamins
Copy link
Member

bkamins commented Sep 4, 2021

Consider the following in Julia (this is not doable in R):

julia> x = categorical([1], levels=[1,2], ordered=true)
1-element CategoricalArray{Int64,1,UInt32}:
 1

julia> y = categorical([1], levels=[2,1], ordered=true)
1-element CategoricalArray{Int64,1,UInt32}:
 1

julia> z = Any[x[1], 2, y[1]]
3-element Vector{Any}:
  CategoricalValue{Int64, UInt32} 1 (1/2)
 2
  CategoricalValue{Int64, UInt32} 1 (2/2)

And now if you called:

issorted(z)

under your rules you would get true, while currently you get:

julia> issorted(z)
ERROR: ArgumentError

Also if you called sort(z) under your rule the result be undefined (it would either error or could produce different results depending on the sorting algorithm).

Now you might ask if things like z variable happen in practice. The answer is AFAICT yes. If I remember correctly in MLJ.jl ecosystem it is possible to get vectors mixing CategoricalValues from different pools in a single container in some operations.


Having said that, I keep the issue open as when doing the current design we were aware that it is inconvenient. However - design wise - it is better to be restrictive (as we are now) and then consider loosening the rules as this approach is non breaking. Let us discuss here the pros and cons.

@tbenst
Copy link
Author

tbenst commented Sep 4, 2021

Thanks for the thoughtful discussion! I’ll reply to your comment here to consolidate.

I completely agree with you on isless, and on sorting. These need to be total orders.

However, note that < is explicitly for partial orders, and is not used for sorting. That same documentation goes on to say Types with a partial order should implement <. Here are the docs for <:

<(x, y)

Less-than comparison operator. Falls back to isless. Because of the behavior of
floating-point NaN values, this operator implements a partial order.

Implementation

New numeric types with a canonical partial order should implement this function
for two arguments of the new type. Types with a canonical total order should
implement isless instead. (x < y) | (x == y)

As such, I think the contract in Base explicitly endorses < for partial orders. So I think the case of comparing strings and CategoricalValue with < is a design decision that can be considered, but isless cannot be implemented in this fashion for the excellent points you made.

@bkamins
Copy link
Member

bkamins commented Sep 4, 2021

This is an excellent point. Indeed the benefit of having isless and < separated in Julia could be employed here.

#346 was implemented to support isless and I did not spend enough thought on < in this context. Indeed non-categorical value could be treated as kind of NaN for floats (with the difference that all thre values: true, false, or missing could be returned in this case).

@nalimilan - what do you think?

@nalimilan
Copy link
Member

I'm glad to hear that some people would want to use < with strings. I wasn't sure before removing it whether users really cared about it.

That said, AFAICT the docstring for < implies that transitivity is required, since that's one of the conditions for the existence of a partial order. (< on floating point is transitive, since comparison involving NaN always return false.) In practice, we could probably break this assumption without it being too much of an issue, but this deserves careful consideration. Originally, we allowed both isless and < mixing CategoricalValue and strings, and it took a few years to realize that the behavior of isless could be problematic in some contexts.

BTW, R is well-known for being full of dangerous corner cases (in particular in operations involving factors), so I wouldn't take it as a good reference for this kind of thing.

@bkamins
Copy link
Member

bkamins commented Sep 4, 2021

@ablaom - could you please cross check/comment how breaking transitivity of < with categorical values would play with MLJ.jl ecosystem? Thank you! (we would still keep isless transitive)

@tbenst
Copy link
Author

tbenst commented Sep 4, 2021

I’m still a bit puzzled by the comments that such a change could break transitivity. Transitivity is maintained, it’s just now a partial order: if x > “2” and y < “2”, then either x > y == true or x > y == missing [or errors or …]

edit: I suppose our confusion arises from how we differ in our consideration of categorical([1], levels=[1,2], ordered=true) and categorical([1], levels=[1,2], ordered=true). I think the fact that they are the same type in Julia is really a type system limitation, for performance reasons? I would argue that they are not the same type in a Type Theory sense, and hence x >1 “2” and y >2 “2” makes no prediction that x >1 y or x >2 y, since there are two different relations at play

oops and yes, meant to use levels=[2,1] for one of them

@nalimilan
Copy link
Member

The transitivity issue that I'm referring to is this:

julia> x = categorical(["a", "b", "c"], levels=["c", "b", "a"], ordered=true)
3-element CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"

julia> x[3] < "b" && "b" < x[1]
true

julia> x[3] == "c" && x[1] == "a"
true

julia> "c" < "a"
false

@tbenst
Copy link
Author

tbenst commented Sep 4, 2021

Ah I was editing my comment right as you posted that. Multiple dispatch is hiding that there are multiple methods at play there, transitivity is not broken for an individual method. So a question becomes, is transitivity supposed to be maintained at a method level or a function level? I think method level makes more sense—these are different relations represented by same symbol. We could rewrite to make this point more explicit that transitivity is not implied since these are different relations

julia> x[3] <1 "b" && "b" <1 x[1]
true

julia> x[3] ==1 "c" && x[1] ==1 "a"
true

julia> "c" <2 "a"
false

@nalimilan
Copy link
Member

edit: I suppose our confusion arises from how we differ in our consideration of categorical([1], levels=[1,2], ordered=true) and categorical([1], levels=[1,2], ordered=true). I think the fact that they are the same type in Julia is really a type system limitation, for performance reasons?

I guess you meant to use different levels between the two arrays? If so yes, the decision to avoid making levels part of the type is due to performance (compilation cost) considerations, but also on practical considerations (you wouldn't be able to add, remove or reorder levels without first creating a new array).

I would argue that they are not the same type in a Type Theory sense, and hence x >1 “2” and y >2 “2” makes no prediction that x >1 y or x >2 y, since there are two different relations at play

Type theory is nice and all, but in practical terms what matters whether we think that people might use < and get confusing results or not.

@nalimilan
Copy link
Member

So a question becomes, is transitivity supposed to be maintained at a method level or a function level? I think method level makes more sense—these are different relations represented by same symbol. We could rewrite to make this point more explicit that transitivity is not implied since these are different relations

In Julia, all methods defined for a function are supposed to follow the same API contract. Otherwise it's impossible to write generic code reliably. Of course we can make some minor exceptions to this rule, but that requires careful consideration. Otherwise, we could perfectly introduce a different function and/or operator.

@tbenst
Copy link
Author

tbenst commented Sep 4, 2021

Transitivity is broken on current implementation (although maybe a bug?)

julia> x = categorical(["a", "b", "c"], levels=["c", "b", "a"], ordered=true)

julia> y = categorical(["a", "b", "c"], levels=["a","b","c"], ordered=true);

julia> x[1] == y[1] # y1 >= x1
true

julia> y[2] > y[1] # y2 >= y1
true

julia> y[2] > x[1] # y2 >= x1
ERROR: ArgumentError: CategoricalValue objects from unequal pools cannot be tested for order
Stacktrace:
 [1] <(x::CategoricalValue{String, UInt32}, y::CategoricalValue{String, UInt32})
   @ CategoricalArrays ~/.julia/packages/CategoricalArrays/rDwMt/src/value.jl:172
 [2] >(x::CategoricalValue{String, UInt32}, y::CategoricalValue{String, UInt32})
   @ Base ./operators.jl:305
 [3] top-level scope
   @ REPL[6]:1

@nalimilan
Copy link
Member

Well yes, but at least it throws an error, so it's much less dangerous than silently returning a misleading result. There's no good solution here anyway: should y[2] > x[1] be equivalent to y[2] > y[1] == true or to x[2] > x[1] == false?

@tbenst
Copy link
Author

tbenst commented Sep 4, 2021

I agree that the current implementation is reasonable! I'm more just pointing out the folly of aiming for transitivity for the comparison functions, across all methods. It's not currently the case in Julia nor is it reasonable to expect, imho. here's a silent "error"

julia> x[1] == "a"
true

julia> x[2] == "b"
true

julia> x[2] < x[1]
true

julia> "b" < "a"
false

@nalimilan
Copy link
Member

Note that the full error message is relatively explicit after fixing the typo at #366:

julia> x .> "Young"
ERROR: ArgumentError: cannot compare a `CategoricalValue` to value `v` of type `String`: wrap `v` using `CategoricalValue(v, catvalue)` or `CategoricalValue(v, catarray)` first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants