Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Literals of nested type are ignored when estimating the output size in PreProjectSplitIterator #11903

Open
firestarman opened this issue Dec 24, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@firestarman
Copy link
Collaborator

firestarman commented Dec 24, 2024

Describe the bug
When running the following toy query, the estimated output size (167.688 KB) returned by "PreProjectSplitIterator.calcMinOutputSize" is quite less than the actual size (1269.768 KB) after the project .

spark.range(1024*10).selectExpr("cast(id as long)").createOrReplaceTempView("tt")
sql("select
       distinct(id),
       array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23) as a24,
       array(1,2,3) as a3 from tt"
).collect

Projection list:

  ==>Expr 0: GpuBoundReference input[0, bigint, false](id#15)
  ==>Expr 1: GpuAlias [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] AS [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]#24
     Children:
    ==>Expr 0: GpuLiteral [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
  ==>Expr 2: GpuAlias [1,2,3] AS [1,2,3]#25
     Children:
    ==>Expr 0: GpuLiteral [1,2,3]

According to the projection list in the above query, the output batch size is about 12 times bigger than the input batch in theory without offset and validity buffers, and we saw a batch of ~4.4G size in some custom queries on T4.

Look at the code here, GPU literals of nested type (array of integer) are not included in the estimated output size. Then the output size is still small enough to make no real splitting happen (but splitting is expected), producing a very big batch in the production env.

@firestarman firestarman added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 24, 2024
@firestarman firestarman self-assigned this Dec 24, 2024
@firestarman firestarman reopened this Dec 24, 2024
@firestarman firestarman removed the ? - Needs Triage Need team to review and classify label Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant