Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPTIONAL MATCH issues + Continued Development and Support \ Community for Morpheus (Spark 3 support, bugfixes ) #947

Open
MarcianoAvihay opened this issue Jun 7, 2020 · 3 comments

Comments

@MarcianoAvihay
Copy link

MarcianoAvihay commented Jun 7, 2020

hi,
i have encountered what i believe is a bug in the optional match implementation , and hoping to get some guidance as to how i could help remediate it (and submit a pull request for all to have this fix) .

the issue is this ->

i have done multiple tests that all lead to the same result -> using OPTIONAL MATCH to match against a non-existent relationship expansion , and afterwards having another OPTIONAL MATCH from the same variable to something that does exist , would never return the latter , and will only return a NULL .

re- example:

test graph ->

val test = morpheus.cypher(
"""
CONSTRUCT
CREATE (p1:Person {name: "Alice"})
CREATE (p2:Person {name: "Bob"})
CREATE (p3:Person {name: "Eve"})
CREATE (p4:Person {name: "Paul"})
CREATE (p1)-[:KNOWS]->(p3)
CREATE (p1)-[:KNOWS2]->(p2)
CREATE (p1)-[:KNOWS3]->(p3)
CREATE (p1)-[:KNOWS4]->(p4)
CREATE (p1)-[:KNOWS5]->(p2)

return GRAPH
""".stripMargin
).graph

--- query -->

val testres = test.cypher(
"""
match (p1:Person )
optional match (p1)-[:KNOWS]->(p2)
optional match (p1)-[:KNOWS15]->(p3)
optional match (p1)-[:KNOWS4]->(p4)
return p1.name, p2.name,p3.name,p4.name
""".stripMargin)

--- yields the following result ->

+-------+-------+-------+-------+
|p1_name|p2_name|p3_name|p4_name|
+-------+-------+-------+-------+
| Bob| null| null| null|
| Eve| null| null| null|
| Paul| null| null| null|
| Alice| Eve| null| null|
+-------+-------+-------+-------+

*** But , if we would change the query order to (having the non existent expansion last):

val testres = test2.cypher(
"""
match (p1:Person )
optional match (p1)-[:KNOWS]->(p2)
optional match (p1)-[:KNOWS4]->(p4)
optional match (p1)-[:KNOWS15]->(p3)
return p1.name, p2.name,p3.name,p4.name
""".stripMargin)

-- we would then get the correct result :
+-------+-------+-------+-------+
|p1_name|p2_name|p3_name|p4_name|
+-------+-------+-------+-------+
| Alice| Eve| null| Paul|
| Bob| null| null| null|
| Eve| null| null| null|
| Paul| null| null| null|

-- which is not expected - since alice is also connected via KNOWS4 to paul.
same query and graph setup in neo4j yields (in any ordering of the query) :

p1.name p2.name p3.name p4.name
"Alice" "Eve" null "Paul"
"Bob" null null null
"Eve" null null null
"Paul" null null null

i found in the OptionalMatchTests.scala in morpheus the following test which doesn't cover the above, and nothing else that does (there is no test in morpheus \ tck that cover doesnt exist - exist :

val g = initGraph(
"""
|CREATE (:DoesExist {property: 42})
|CREATE (:DoesExist {property: 43})
|CREATE (:DoesExist {property: 44})
""".stripMargin)

  val res = g.cypher(
    """
      |OPTIONAL MATCH (f:DoesExist)
      |OPTIONAL MATCH (n:DoesNotExist)
      |RETURN collect(DISTINCT n.property) AS a, collect(DISTINCT f.property) AS b
    """.stripMargin)

can any of the dev's please share they're thoughts as to how hard would it be to try and get to the expected behaviour? and where should i start looking in the solution to try and fix ?

is it a matter of a complex modification to the relation+logical planner to get the required behaviour ? if there is some small tweak that comes to mind it would greatly help me ( i am trying to use morpheus to automate a combination of left outer joins and inner joins needed in a very large datasets

thanks very much !

@MarcianoAvihay
Copy link
Author

MarcianoAvihay commented Jun 18, 2020

So , an update -> i was able to resolve this bug with a strange workaround ->
main modifications are within the relationalplanner class (specifically , the planoptional method, and added a new operator to the sparkTable class which i use in the planner) .

the workaround is probably far from ideal, but it works -> basically this bug ( consecutive optional matches where the first doesn't match and the second does match but is not returned) is due to the fact that the join columns also join on the non matched label and properties (although they are null) which causes them not to return . I think solving it is a MUST in order to actually say optional match is supported in this project ( what's the point of having optional match support that works only if it matches ? )

therefore , my workaround was to drop all columns from the right hand side operator , which contain only null values -> and then when the common expressions are calculated , they do not include those columns and hence there is no join on a null value column .

the better way was to deal just with the join columns , but this creates other issues .

the main issue was that when joining , the auto calculation of the join expression is wrong , as it takes into account unrelated columns from the rhs and lhs and uses them as key , much more columns than really needed ( for example , having a schema (a:A), (b:B), (c:C) on the lhs , while there is an expansion on the rhs for (a)->(d:D) , instead of using just a and d as the join key , would also use b and c , which may appear null on the rhs and non null on the lhs ( and this is i believe due to the use of caching to the original + auto calculating the join expressions based on common headers)

Now , wondering if anyone else is interested in pushing this project forward as a joint effort .
seems like the current dev's \ admins are not really interested in (formally) keeping pushing this repository , so we can either -> keep developing on a fork - or submit pull requests here

currently i have 2 contributions :

  1. working spark-3 support ( seems stable on rc-1 and preview-2 , didn't check on the latest spark-3 release yet)
  2. solution \ workaround for the bug described above ( although i would love having someone to review it \ suggest other ways for dealing with it )

once i'm done testing those , will push them either to a fork or here ( depending if i see it interests more people in the community)

@MarcianoAvihay MarcianoAvihay changed the title OPTIONAL MATCH issues? OPTIONAL MATCH issues + Continued Development and Support \ Community for Morpheus (Spark 3 support, bugfixes ) Jun 18, 2020
@drew-moore
Copy link

@MarcianoAvihay do you still have your branch adding spark 3 support around, and if so would you mind pushing it? I'd like to pick up where you left off :)

@dtylor
Copy link

dtylor commented Mar 16, 2022

@MarcianoAvihay One option to consider... Apache Age also converts cypher to relational (postgres) sql statements. They implement the optional match using a LEFT LATERAL JOIN which I think would address the issue you identified. apache/age#175 . Spark v3.2 now supports LATERAL joins apache/age#175. There isn't a Spark dataframe api method to accomplish LATERAL joins, so a temporary view would have to be created to support the join.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants