[SPARK-50767][SQL] Remove codegen of `from_json` #49411

panbingkun · 2025-01-08T06:27:10Z

What changes were proposed in this pull request?

The pr aims to remove codegen of from_json.

Why are the changes needed?

Based on the discussion and testing with SubExprEliminationBenchmark #48466 (comment),
after implementing codegen for from_json, there is a performance regression in the withFilter scenario with subExprElimination = true, codegen = true
Let's remove it first and will submit it after we solve the above issue.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GA & Manually test.

Was this patch authored or co-authored using generative AI tooling?

No.

LuciferYang · 2025-01-08T06:34:05Z

Could you update the test results for SubExprEliminationBenchmark in this pr to ensure that the changes are as expected? @panbingkun

panbingkun · 2025-01-08T06:35:54Z

Could you update the test results for SubExprEliminationBenchmark in this pr to ensure that the changes are as expected? @panbingkun

Well, okay.

JDK17: https://github.com/panbingkun/spark/actions/runs/12666974432
JDK21: https://github.com/panbingkun/spark/actions/runs/12666979089

dongjoon-hyun · 2025-01-08T06:42:22Z

+1 for including SubExprEliminationBenchmark here. All the others are covered in the following.

[SPARK-50766][TESTS] Regenerate benchmark results #49409

cloud-fan · 2025-01-08T06:46:58Z

Does it mean any time we add codegen support for some functions, there is a risk of perf regression? Are we sure from_json is the only one or it's simply because SubExprEliminationBenchmark tests from_json?

panbingkun · 2025-01-08T07:07:21Z

Does it mean any time we add codegen support for some functions, there is a risk of perf regression? Are we sure from_json is the only one or it's simply because SubExprEliminationBenchmark tests from_json?

In the two scenarios tested in SubExprEliminationBenchmark , it has been found that the filter scenario has perf regression.
I can test other functions, such as those that have existed for a long time in history.

LuciferYang · 2025-01-08T07:13:03Z

@panbingkun After giving it some more thought, we could try enabling -Xlog:compilation to test the corresponding case. Is it possible that the filter in this test case generated an extremely large method after enabling Codegen, which in turn affected the compilation optimization?

If it weren't a || filter involving all fields, would the impact be less severe?

LuciferYang · 2025-01-08T11:24:51Z

I found the following content in the log:

19:22:37.212 main INFO CodeGenerator: Generated method too long to be JIT compiled: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext is 58762 bytes

panbingkun · 2025-01-08T11:26:11Z

case 1 (3 filtering conditions with a data size of 1000000)

object FromJsonBenchmark extends SqlBasedBenchmark {
  import spark.implicits._

  def withFilter(rowsNum: Int, numIters: Int): Unit = {
    val benchmark = new Benchmark("from_json in Filter", rowsNum, output = output)

    withTempPath { path =>
      prepareDataInfo(benchmark)
      val numCols = 500
      val schema = writeWideRow(path.getAbsolutePath, rowsNum, numCols)

      val jsonValue = from_json($"value", schema)
      val predicate = jsonValue.getField(s"col0") >= lit(100000) ||
        jsonValue.getField(s"col50") >= lit(100000) ||
        jsonValue.getField(s"col123") >= lit(100000)

      val caseName = s"from_object, codegen: no"
      benchmark.addCase(caseName, numIters) { _ =>
        val df = spark.read
          .text(path.getAbsolutePath)
          .where(predicate)
        df.write.mode("overwrite").format("noop").save()
      }
      benchmark.run()
    }
  }

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val numIters = 3
    runBenchmark("Benchmark for performance of from_json codegen") {
      withFilter(1_000_000, numIters)
    }
  }
}

codegen for from_json

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          61029          62195        1781          0.0       61028.8       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          61391          66201        7157          0.0       61391.2       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          60653          61195         481          0.0       60652.7       1.0X

non-codegen for from_json

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                         61289          62508        1155          0.0       61288.6       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                          61219          61663         386          0.0       61218.9       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                          61056          61362         287          0.0       61055.9       1.0X

panbingkun · 2025-01-08T11:41:25Z

case 2 (1 filtering conditions with a data size of 100000)

object FromJsonBenchmark extends SqlBasedBenchmark {
  import spark.implicits._

  def withFilter(rowsNum: Int, numIters: Int): Unit = {
    val benchmark = new Benchmark("from_json in Filter", rowsNum, output = output)

    withTempPath { path =>
      prepareDataInfo(benchmark)
      val numCols = 500
      val schema = writeWideRow(path.getAbsolutePath, rowsNum, numCols)

      val jsonValue = from_json($"value", schema)
      val predicate = jsonValue.getField(s"col0") >= lit(100000)

      val caseName = s"from_object, codegen: no"
      benchmark.addCase(caseName, numIters) { _ =>
        val df = spark.read
          .text(path.getAbsolutePath)
          .where(predicate)
        df.write.mode("overwrite").format("noop").save()
      }
      benchmark.run()
    }
  }

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val numIters = 3
    runBenchmark("Benchmark for performance of from_json codegen") {
      withFilter(1_000_00, numIters)
    }
  }
}

codegen for from_json

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          2325           2341          26          0.0       23249.6       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          2237           2266          36          0.0       22373.5       1.0X

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes                          2317           2403          74          0.0       23172.3       1.0X

non-codegen for from_json

OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                           2264           2286          20          0.0       22639.3       1.0X


OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                           2475           3010         554          0.0       24752.2       1.0X


OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no                           2315           2780         480          0.0       23150.6       1.0X

panbingkun · 2025-01-08T11:45:26Z

From the above scenario, it seems that there is no a performance regression, and I am investigating other reasons.

LuciferYang · 2025-01-08T12:10:24Z

before: code-gen-before.txt
after:after-code-gen.txt

It can be seen that after from_json enabling codegen, an huge processNext method is generated for filter. I suspect that this method is the reason why JIT cannot optimize it.

LuciferYang · 2025-01-09T04:00:57Z

@cloud-fan @dongjoon-hyun @panbingkun How do we proceed with this issue?

I think the risk of generating huge filter methods has always existed, but it was hidden in this benchmark because from_json did not support code generation previously. So I believe the support of code generation in from_json is not the root cause.

As more functions come to support code generation, the probability of generating huge methods will increase, It should apply to more than just filters, right?.

Perhaps we need to find a more universal approach to split the generated methods in order to avoid this risk?

panbingkun · 2025-01-09T05:23:01Z

Give me some time to look at the root cause.

panbingkun · 2025-01-09T08:33:14Z

In the withFilter scenario of SubExprEliminationBenchmark, the root cause as follows:

  val df = spark.read
              .text(path.getAbsolutePath)
              .where(predicate)
  df.write.mode("overwrite").format("noop").save()

When from_json does not implement codegen
FilterExec.doExecute -> Predicate.create -> CodeGeneratorWithInterpretedFallback.createObject -> Predicate.createCodeGeneratedObject -> CodegenContext.subexpressionElimination

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

Line 281 in 0123a5e

val evaluator = evaluatorFactory.createEvaluator()

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/FilterEvaluatorFactory.scala

Line 39 in 0123a5e

val predicate = Predicate.create(condition, childOutput)

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallback.scala

Line 45 in 0123a5e

createCodeGeneratedObject(in)

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratePredicate.scala

Line 41 in 0123a5e

val eval = ctx.generateExpressions(Seq(predicate), useSubexprElimination).head

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

Line 1270 in 0123a5e

val commonExprs = equivalentExpressions.getCommonSubexpressions

Ultimately, optimize the 500 calls to `from_json` to only 1 call

When from_json implement codegen
FilterExec.doConsume -> GeneratePredicateHelper.generatePredicateCode

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

Line 252 in 0123a5e

val predicateCode = generatePredicateCode(

there is no `subexpressionElimination` optimization here, 500 calls will ultimately be applied to `JsonToStructs`.

panbingkun · 2025-01-09T08:36:50Z

If we can implement subexpressionElimination optimization in the method FilterExec.doConsume, like ProjectExec.doConsume, that would be great.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

Lines 68 to 80 in 0123a5e

    
           override def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String = { 
        
             val exprs = bindReferences[Expression](projectList, child.output) 
        
             val (subExprsCode, resultVars, localValInputs) = if (conf.subexpressionEliminationEnabled) { 
        
               // subexpression elimination 
        
               val subExprs = ctx.subexpressionEliminationForWholeStageCodegen(exprs) 
        
               val genVars = ctx.withSubExprEliminationExprs(subExprs.states) { 
        
                 exprs.map(_.genCode(ctx)) 
        
               } 
        
               (ctx.evaluateSubExprEliminationState(subExprs.states.values), genVars, 
        
                 subExprs.exprCodesNeedEvaluate) 
        
             } else { 
        
               ("", exprs.map(_.genCode(ctx)), Seq.empty) 
        
             }

cc @cloud-fan

panbingkun · 2025-01-09T08:43:46Z

Does it mean any time we add codegen support for some functions, there is a risk of perf regression? Are we sure from_json is the only one or it's simply because SubExprEliminationBenchmark tests from_json?

So the answer to this question is:

In the SubExprEliminationBenchmark testing scenario, any expression that implements Codegen will have this issue, not just from_json, because our FilterExec is inconsistent in codegen and interpreted.

cloud-fan · 2025-01-09T09:38:43Z

@panbingkun great investigation! +1 to implement subexpression elimination for FilterExec

LuciferYang · 2025-01-09T10:54:41Z

@panbingkun great investigation! +1 to implement subexpression elimination for FilterExec

So this feature doesn't need to be revert, right?

[SPARK-50767][SQL] Remove codegen of from_json

51a5549

github-actions bot added the SQL label Jan 8, 2025

panbingkun mentioned this pull request Jan 8, 2025

[SPARK-49966][SQL] Codegen Support for JsonToStructs(from_json) #48466

Closed

panbingkun added 3 commits January 8, 2025 16:39

fix code style

bee3e84

add SubExprEliminationBenchmark

89dd1ab

Merge branch 'master' into SPARK-50767

14307a9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50767][SQL] Remove codegen of `from_json` #49411

[SPARK-50767][SQL] Remove codegen of `from_json` #49411

panbingkun commented Jan 8, 2025 •

edited

Loading

LuciferYang commented Jan 8, 2025

panbingkun commented Jan 8, 2025 •

edited

Loading

dongjoon-hyun commented Jan 8, 2025 •

edited

Loading

cloud-fan commented Jan 8, 2025

panbingkun commented Jan 8, 2025

LuciferYang commented Jan 8, 2025

LuciferYang commented Jan 8, 2025

panbingkun commented Jan 8, 2025 •

edited

Loading

panbingkun commented Jan 8, 2025 •

edited

Loading

panbingkun commented Jan 8, 2025

LuciferYang commented Jan 8, 2025

LuciferYang commented Jan 9, 2025

panbingkun commented Jan 9, 2025

panbingkun commented Jan 9, 2025 •

edited

Loading

panbingkun commented Jan 9, 2025 •

edited

Loading

panbingkun commented Jan 9, 2025

cloud-fan commented Jan 9, 2025

LuciferYang commented Jan 9, 2025

[SPARK-50767][SQL] Remove codegen of from_json #49411

Are you sure you want to change the base?

[SPARK-50767][SQL] Remove codegen of from_json #49411

Conversation

panbingkun commented Jan 8, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

LuciferYang commented Jan 8, 2025

panbingkun commented Jan 8, 2025 • edited Loading

dongjoon-hyun commented Jan 8, 2025 • edited Loading

cloud-fan commented Jan 8, 2025

panbingkun commented Jan 8, 2025

LuciferYang commented Jan 8, 2025

LuciferYang commented Jan 8, 2025

panbingkun commented Jan 8, 2025 • edited Loading

panbingkun commented Jan 8, 2025 • edited Loading

panbingkun commented Jan 8, 2025

LuciferYang commented Jan 8, 2025

LuciferYang commented Jan 9, 2025

panbingkun commented Jan 9, 2025

panbingkun commented Jan 9, 2025 • edited Loading

Ultimately, optimize the 500 calls to from_json to only 1 call

there is no subexpressionElimination optimization here, 500 calls will ultimately be applied to JsonToStructs.

panbingkun commented Jan 9, 2025 • edited Loading

panbingkun commented Jan 9, 2025

cloud-fan commented Jan 9, 2025

LuciferYang commented Jan 9, 2025

[SPARK-50767][SQL] Remove codegen of `from_json` #49411

[SPARK-50767][SQL] Remove codegen of `from_json` #49411

panbingkun commented Jan 8, 2025 •

edited

Loading

panbingkun commented Jan 8, 2025 •

edited

Loading

dongjoon-hyun commented Jan 8, 2025 •

edited

Loading

panbingkun commented Jan 8, 2025 •

edited

Loading

panbingkun commented Jan 8, 2025 •

edited

Loading

panbingkun commented Jan 9, 2025 •

edited

Loading

Ultimately, optimize the 500 calls to `from_json` to only 1 call

there is no `subexpressionElimination` optimization here, 500 calls will ultimately be applied to `JsonToStructs`.

panbingkun commented Jan 9, 2025 •

edited

Loading