Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50767][SQL] Remove codegen of from_json #49411

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

panbingkun
Copy link
Contributor

@panbingkun panbingkun commented Jan 8, 2025

What changes were proposed in this pull request?

The pr aims to remove codegen of from_json.

Why are the changes needed?

Based on the discussion and testing with SubExprEliminationBenchmark #48466 (comment),
after implementing codegen for from_json, there is a performance regression in the withFilter scenario with subExprElimination = true, codegen = true
Let's remove it first and will submit it after we solve the above issue.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GA & Manually test.

Was this patch authored or co-authored using generative AI tooling?

No.

@LuciferYang
Copy link
Contributor

Could you update the test results for SubExprEliminationBenchmark in this pr to ensure that the changes are as expected? @panbingkun

@panbingkun
Copy link
Contributor Author

panbingkun commented Jan 8, 2025

Could you update the test results for SubExprEliminationBenchmark in this pr to ensure that the changes are as expected? @panbingkun

Well, okay.

JDK17: https://github.com/panbingkun/spark/actions/runs/12666974432
JDK21: https://github.com/panbingkun/spark/actions/runs/12666979089

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 8, 2025

+1 for including SubExprEliminationBenchmark here. All the others are covered in the following.

@cloud-fan
Copy link
Contributor

Does it mean any time we add codegen support for some functions, there is a risk of perf regression? Are we sure from_json is the only one or it's simply because SubExprEliminationBenchmark tests from_json?

@panbingkun
Copy link
Contributor Author

Does it mean any time we add codegen support for some functions, there is a risk of perf regression? Are we sure from_json is the only one or it's simply because SubExprEliminationBenchmark tests from_json?

  • In the two scenarios tested in SubExprEliminationBenchmark , it has been found that the filter scenario has perf regression.
  • I can test other functions, such as those that have existed for a long time in history.

@LuciferYang
Copy link
Contributor

@panbingkun After giving it some more thought, we could try enabling -Xlog:compilation to test the corresponding case. Is it possible that the filter in this test case generated an extremely large method after enabling Codegen, which in turn affected the compilation optimization?

If it weren't a || filter involving all fields, would the impact be less severe?

@LuciferYang
Copy link
Contributor

I found the following content in the log:

19:22:37.212 main INFO CodeGenerator: Generated method too long to be JIT compiled: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext is 58762 bytes

@panbingkun
Copy link
Contributor Author

panbingkun commented Jan 8, 2025

  • case 1 (3 filtering conditions with a data size of 1000000)
object FromJsonBenchmark extends SqlBasedBenchmark {
  import spark.implicits._

  def withFilter(rowsNum: Int, numIters: Int): Unit = {
    val benchmark = new Benchmark("from_json in Filter", rowsNum, output = output)

    withTempPath { path =>
      prepareDataInfo(benchmark)
      val numCols = 500
      val schema = writeWideRow(path.getAbsolutePath, rowsNum, numCols)

      val jsonValue = from_json($"value", schema)
      val predicate = jsonValue.getField(s"col0") >= lit(100000) ||
        jsonValue.getField(s"col50") >= lit(100000) ||
        jsonValue.getField(s"col123") >= lit(100000)

      val caseName = s"from_object, codegen: no"
      benchmark.addCase(caseName, numIters) { _ =>
        val df = spark.read
          .text(path.getAbsolutePath)
          .where(predicate)
        df.write.mode("overwrite").format("noop").save()
      }
      benchmark.run()
    }
  }

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val numIters = 3
    runBenchmark("Benchmark for performance of from_json codegen") {
      withFilter(1_000_000, numIters)
    }
  }
}
  • codegen for from_json
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          61029          62195        1781          0.0       61028.8       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          61391          66201        7157          0.0       61391.2       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          60653          61195         481          0.0       60652.7       1.0X
  • non-codegen for from_json
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                         61289          62508        1155          0.0       61288.6       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                          61219          61663         386          0.0       61218.9       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                          61056          61362         287          0.0       61055.9       1.0X

@panbingkun
Copy link
Contributor Author

panbingkun commented Jan 8, 2025

  • case 2 (1 filtering conditions with a data size of 100000)
object FromJsonBenchmark extends SqlBasedBenchmark {
  import spark.implicits._

  def withFilter(rowsNum: Int, numIters: Int): Unit = {
    val benchmark = new Benchmark("from_json in Filter", rowsNum, output = output)

    withTempPath { path =>
      prepareDataInfo(benchmark)
      val numCols = 500
      val schema = writeWideRow(path.getAbsolutePath, rowsNum, numCols)

      val jsonValue = from_json($"value", schema)
      val predicate = jsonValue.getField(s"col0") >= lit(100000)

      val caseName = s"from_object, codegen: no"
      benchmark.addCase(caseName, numIters) { _ =>
        val df = spark.read
          .text(path.getAbsolutePath)
          .where(predicate)
        df.write.mode("overwrite").format("noop").save()
      }
      benchmark.run()
    }
  }

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val numIters = 3
    runBenchmark("Benchmark for performance of from_json codegen") {
      withFilter(1_000_00, numIters)
    }
  }
}
  • codegen for from_json
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          2325           2341          26          0.0       23249.6       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          2237           2266          36          0.0       22373.5       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          2317           2403          74          0.0       23172.3       1.0X
  • non-codegen for from_json
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                           2264           2286          20          0.0       22639.3       1.0X


OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                           2475           3010         554          0.0       24752.2       1.0X


OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                           2315           2780         480          0.0       23150.6       1.0X

@panbingkun
Copy link
Contributor Author

From the above scenario, it seems that there is no a performance regression, and I am investigating other reasons.

@LuciferYang
Copy link
Contributor

before: code-gen-before.txt
after:after-code-gen.txt

It can be seen that after from_json enabling codegen, an huge processNext method is generated for filter. I suspect that this method is the reason why JIT cannot optimize it.

@LuciferYang
Copy link
Contributor

@cloud-fan @dongjoon-hyun @panbingkun How do we proceed with this issue?

I think the risk of generating huge filter methods has always existed, but it was hidden in this benchmark because from_json did not support code generation previously. So I believe the support of code generation in from_json is not the root cause.

As more functions come to support code generation, the probability of generating huge methods will increase, It should apply to more than just filters, right?.

Perhaps we need to find a more universal approach to split the generated methods in order to avoid this risk?

@panbingkun
Copy link
Contributor Author

Give me some time to look at the root cause.

@panbingkun
Copy link
Contributor Author

panbingkun commented Jan 9, 2025

In the withFilter scenario of SubExprEliminationBenchmark, the root cause as follows:

  val df = spark.read
              .text(path.getAbsolutePath)
              .where(predicate)
  df.write.mode("overwrite").format("noop").save()

Ultimately, optimize the 500 calls to from_json to only 1 call

there is no subexpressionElimination optimization here, 500 calls will ultimately be applied to JsonToStructs.

@panbingkun
Copy link
Contributor Author

panbingkun commented Jan 9, 2025

If we can implement subexpressionElimination optimization in the method FilterExec.doConsume, like ProjectExec.doConsume, that would be great.

override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String = {
val exprs = bindReferences[Expression](projectList, child.output)
val (subExprsCode, resultVars, localValInputs) = if (conf.subexpressionEliminationEnabled) {
// subexpression elimination
val subExprs = ctx.subexpressionEliminationForWholeStageCodegen(exprs)
val genVars = ctx.withSubExprEliminationExprs(subExprs.states) {
exprs.map(_.genCode(ctx))
}
(ctx.evaluateSubExprEliminationState(subExprs.states.values), genVars,
subExprs.exprCodesNeedEvaluate)
} else {
("", exprs.map(_.genCode(ctx)), Seq.empty)
}

cc @cloud-fan

@panbingkun
Copy link
Contributor Author

Does it mean any time we add codegen support for some functions, there is a risk of perf regression? Are we sure from_json is the only one or it's simply because SubExprEliminationBenchmark tests from_json?

So the answer to this question is:

In the SubExprEliminationBenchmark testing scenario, any expression that implements Codegen will have this issue, not just from_json, because our FilterExec is inconsistent in codegen and interpreted.

@cloud-fan
Copy link
Contributor

@panbingkun great investigation! +1 to implement subexpression elimination for FilterExec

@LuciferYang
Copy link
Contributor

@panbingkun great investigation! +1 to implement subexpression elimination for FilterExec

So this feature doesn't need to be revert, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants