Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix join with sort push down #13560

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

haohuaijin
Copy link
Contributor

@haohuaijin haohuaijin commented Nov 25, 2024

Which issue does this PR close?

Closes #13559

Rationale for this change

What changes are included in this PR?

Are these changes tested?

test by existed test and add a sqllogicaltest test

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Nov 25, 2024
@alamb
Copy link
Contributor

alamb commented Nov 25, 2024

Thank you @haohuaijin

This appears to be code that was introduced in #12559 from @berkaysynnada

@berkaysynnada is there any way you can help review this PR?

@berkaysynnada
Copy link
Contributor

Hi @haohuaijin, and sorry for the delayed response. I have been very busy over the past few days. I have reviewed your fix and have some comments about the problem and the solution.

The issue originates from handle_custom_pushdown(), which overfits to operators that simply concatenate the input children's schemas side by side. In your example, the join is a right-semi join, which exposed this bug.

Your fix refers to column names when propagating the order. However, if there are multiple columns in the child schema, this fix may still be insufficient. To address this, I suggest two potential solutions:

  1. You can write a specific handler for hash joins. This approach would not rely on a general mapping of input_fields to output_fields but would instead use a join type parameter to propagate the ordering. This solution should work for any type of join in our project and is my recommended approach. You could copy the current logic from this function and adapt it for different join types.
  2. You could generalize the logic further to include a mapping of input children fields to output children fields and propagate the ordering accordingly. However, implementing this solution would be non-trivial and would require extensive testing.
    If you go with the first solution, I suggest adding a note to the function’s header to clarify its purpose:
    "The function can be used for operators that simply concatenate the input children's schemas side by side."

By the way, @alamb, isn’t it risky to have an index_of() API in the Schema implementation? Doesn’t this create the potential for mismatches when columns are matched based solely on their names?

@alamb
Copy link
Contributor

alamb commented Nov 30, 2024

By the way, @alamb, isn’t it risky to have an index_of() API in the Schema implementation? Doesn’t this create the potential for mismatches when columns are matched based solely on their names?

Yes, I agree there is this danger

I think the core challenge is that arrow Schema can have duplicated field names, but there are various places in DataFusion that assume the field name is unique (to resolve column referenes)

Even more confusing is that the unique column name assumption is only in the LogicalPlan level, rather than the ExecutonPlan level (which uses offsets). 🤯

@findepi has been observing similar issues when thinking about supported repeated column names in the logical plan

@findepi
Copy link
Member

findepi commented Dec 2, 2024

@findepi has been observing similar issues when thinking about supported repeated column names in the logical plan

(see discussion under #13489 and in #13476, #6543)

@alamb
Copy link
Contributor

alamb commented Dec 3, 2024

So how shall we fix this bug? Is this PR ok to go (even if not perfect)? Or should we do something else?

@berkaysynnada
Copy link
Contributor

I can open a PR implementing my first suggestion as it does not require much effort. However, it will handle only hash join properly, and other operators will still go on the custom handler function, and it remains same, not being such custom actually. I'd like to hear from @haohuaijin before deciding.

@haohuaijin
Copy link
Contributor Author

haohuaijin commented Dec 4, 2024

Sorry for the delay @berkaysynnada and @alamb, and thanks for the suggestion for @berkaysynnada
I go for the first solution, I think for the generalize solution, we need to have the information that how to map the input schema to the output schema (for different join type, the map is different and for hash join, it have projection). And current in ExecutionPlan we don't have this information.

BTW, the error reported by CI seems not related to this PR.

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you @haohuaijin, LGTM. I have just a few minor suggestions.

The plans in the tests look improved (BTW did the previous way of the fix change the plans like that? 🤔 )

datafusion/core/src/physical_optimizer/sort_pushdown.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_optimizer/sort_pushdown.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_optimizer/sort_pushdown.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_optimizer/sort_pushdown.rs Outdated Show resolved Hide resolved
@haohuaijin
Copy link
Contributor Author

haohuaijin commented Dec 4, 2024

Thanks for your reviews and suggestions @berkaysynnada

BTW did the previous way of the fix change the plans like that?

the previous fix didn't change the plan. Maybe because the current solutions take into account the join type(for right join, the left side don't output) and projection, so make the plan change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

An error occurred when the sort push down rule pushed sort below join
4 participants