[SPARK-50783] Canonicalize JVM profiler results file name and layout on DFS #49440

pan3793 · 2025-01-10T09:48:48Z

What changes were proposed in this pull request?

This PR canonicalizes the JVM profiler added in SPARK-46094 profiling result files on DFS to

dfsDir/{{APP_ID}}/profile-exec-{{EXECUTOR_ID}}.jfr

which majorly follows the event logs file name pattern and layout.

Why are the changes needed?

According to #44021 (comment), we can integrate the profiling results with Spark UI (both live and history) in the future, so it's good to follow the event logs file name pattern and layout as much as possible.

Does this PR introduce any user-facing change?

No, it's an unreleased feature.

How was this patch tested?

$ bin/spark-submit run-example \
  --master yarn \
  --deploy-mode cluster \
  --conf spark.plugins=org.apache.spark.executor.profiler.ExecutorProfilerPlugin \
  --conf spark.executor.profiling.enabled=true \
  --conf spark.executor.profiling.dfsDir=hdfs:///spark-profiling \
  --conf spark.executor.profiling.fraction=1 \
  SparkPi 100000

hadoop@spark-dev1:~/spark$ hadoop fs -ls /spark-profiling/
Found 1 items
drwxrwx---   - hadoop supergroup          0 2025-01-13 10:29 /spark-profiling/application_1736320707252_0023_1

hadoop@spark-dev1:~/spark$ hadoop fs -ls /spark-profiling/application_1736320707252_0023_1
Found 48 items
-rw-rw----   3 hadoop supergroup    5255028 2025-01-13 10:29 /spark-profiling/application_1736320707252_0023_1/profile-exec-1.jfr
-rw-rw----   3 hadoop supergroup    3840775 2025-01-13 10:29 /spark-profiling/application_1736320707252_0023_1/profile-exec-10.jfr
-rw-rw----   3 hadoop supergroup    3889002 2025-01-13 10:29 /spark-profiling/application_1736320707252_0023_1/profile-exec-11.jfr
-rw-rw----   3 hadoop supergroup    3570697 2025-01-13 10:29 /spark-profiling/application_1736320707252_0023_1/profile-exec-12.jfr
...

Was this patch authored or co-authored using generative AI tooling?

No.

pan3793 · 2025-01-10T09:51:02Z

connector/profiler/src/main/scala/org/apache/spark/executor/profiler/ExecutorJVMProfiler.scala

+  private val appId = try {
+    conf.getAppId
+  } catch {
+    case _: NoSuchElementException => "local-" + System.currentTimeMillis


curiosity, is this possible？

I remember getting this error when developing this feature. The app id had not been generated when the profiler was being initialized. I don't know if we might still be getting this, but safer this way.

pan3793 · 2025-01-10T09:53:10Z

connector/profiler/src/main/scala/org/apache/spark/executor/profiler/ExecutorJVMProfiler.scala


+  private val PROFILER_FOLDER_PERMISSIONS = new FsPermission(Integer.parseInt("770", 8).toShort)
+  private val PROFILER_FILE_PERMISSIONS = new FsPermission(Integer.parseInt("660", 8).toShort)


follow event log behavior, to allow SHS process reading the file. we may do the following things in the future:

support downloading profiling results from SHS, like we have done for event logs

support integration with History UI

+1, thank you.
It would be really nice if you take up the integration with the History UI.

pan3793 · 2025-01-10T09:55:26Z

core/src/main/scala/org/apache/spark/util/Utils.scala

@@ -2954,6 +2954,15 @@ private[spark] object Utils
    str.replaceAll("[ :/]", "-").replaceAll("[.${}'\"]", "_").toLowerCase(Locale.ROOT)
  }

+  def nameForAppAndAttempt(appId: String, appAttemptId: Option[String]): String = {


this can be reused in several places, for example #42575 (comment)

pan3793 · 2025-01-10T09:58:34Z

connector/profiler/README.md

@@ -54,7 +54,7 @@ Then enable the profiling in the configuration.
  <td><code>spark.executor.profiling.dfsDir</code></td>
  <td>(none)</td>
  <td>
-      An HDFS compatible path to which the profiler's output files are copied. The output files will be written as <i>dfsDir/application_id/profile-appname-exec-executor_id.jfr</i> <br/>
+      An HDFS compatible path to which the profiler's output files are copied. The output files will be written as <i>dfsDir/{{APP_ID}}/profile-{{APP_ID}}-exec-{{EXECUTOR_ID}}.jfr</i> <br/>


actually, {{APP_ID}} is nameForAppAndAttempt(see code) for YARN cluster mode, but writing behavior details in the configuration description is a little bit verbose ...

please let me know if you have better suggestion to polish this sentence.

I think this is fine.

pan3793 · 2025-01-10T09:59:22Z

connector/profiler/README.md

@@ -72,7 +72,7 @@ Then enable the profiling in the configuration.
  <td>event=wall,interval=10ms,alloc=2m,lock=10ms,chunktime=300s</td>
  <td>
      Options to pass to the profiler. Detailed options are documented in the comments here:
-      <a href="https://github.com/async-profiler/async-profiler/blob/32601bccd9e49adda9510a2ed79d142ac6ef0ff9/src/arguments.cpp#L52">Profiler arguments</a>.  
+      <a href="https://github.com/async-profiler/async-profiler/blob/v3.0/src/arguments.cpp#L44">Profiler arguments</a>.  


prefer using tag for consistency

pan3793 · 2025-01-10T10:00:17Z

cc @dongjoon-hyun @parthchandra @mridulm @LuciferYang

parthchandra

lgtm

parthchandra · 2025-01-11T00:46:34Z

connector/profiler/README.md

@@ -54,7 +54,7 @@ Then enable the profiling in the configuration.
  <td><code>spark.executor.profiling.dfsDir</code></td>
  <td>(none)</td>
  <td>
-      An HDFS compatible path to which the profiler's output files are copied. The output files will be written as <i>dfsDir/application_id/profile-appname-exec-executor_id.jfr</i> <br/>
+      An HDFS compatible path to which the profiler's output files are copied. The output files will be written as <i>dfsDir/{{APP_ID}}/profile-{{APP_ID}}-exec-{{EXECUTOR_ID}}.jfr</i> <br/>


I think this is fine.

parthchandra · 2025-01-11T00:50:44Z

connector/profiler/src/main/scala/org/apache/spark/executor/profiler/ExecutorJVMProfiler.scala

+  private val appId = try {
+    conf.getAppId
+  } catch {
+    case _: NoSuchElementException => "local-" + System.currentTimeMillis


I remember getting this error when developing this feature. The app id had not been generated when the profiler was being initialized. I don't know if we might still be getting this, but safer this way.

parthchandra · 2025-01-11T00:52:11Z

connector/profiler/src/main/scala/org/apache/spark/executor/profiler/ExecutorJVMProfiler.scala


+  private val PROFILER_FOLDER_PERMISSIONS = new FsPermission(Integer.parseInt("770", 8).toShort)
+  private val PROFILER_FILE_PERMISSIONS = new FsPermission(Integer.parseInt("660", 8).toShort)


+1, thank you.
It would be really nice if you take up the integration with the History UI.

dongjoon-hyun

If we don't use APP_NAME in the new file name, can we avoid the repetition like the following?

- dfsDir/{{APP_ID}}/profile-{{APP_ID}}-exec-{{EXECUTOR_ID}}.jfr
+ dfsDir/{{APP_ID}}/profile-exec-{{EXECUTOR_ID}}.jfr

pan3793 · 2025-01-13T02:33:45Z

@dongjoon-hyun thanks for the suggestion, addressed in e5aa660, I re-verified and updated PR description

dongjoon-hyun · 2025-01-13T17:33:52Z

connector/profiler/src/main/scala/org/apache/spark/executor/profiler/ExecutorJVMProfiler.scala

+      val profilerDirForAppPath = new Path(profilerDirForApp)
+      if (!fs.exists(profilerDirForAppPath)) {
+        // SPARK-30860: use the class method to avoid the umask causing permission issues
+        FileSystem.mkdirs(fs, profilerDirForAppPath, PROFILER_FOLDER_PERMISSIONS)


Thank you. This is a new improvement, isn't it, @pan3793 ?

yes, the change grants the permission for SHS to read/delete the folder and files

dongjoon-hyun

Thank you, @pan3793 .

The PR looks good to me. As a final piece, let's spin-off the following two files because I agree with you that it's useful utility function. It's irrelevant to JVM Profiler stuff. So, we should do that refactor first before this PR, @pan3793 .

core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala
core/src/main/scala/org/apache/spark/util/Utils.scala

dongjoon-hyun · 2025-01-13T17:40:05Z

Please ping me on your spin-off PR. I can merge your new PR swiftly.

…ils` ### What changes were proposed in this pull request? Pure refactor, move method `nameForAppAndAttempt` from `EventLogFileWriter` to `o.a.s.u.Utils`. ### Why are the changes needed? The method could be reused in several other places, e.g. #49440 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GHA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49476 from pan3793/SPARK-50805. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2025-01-14T02:52:51Z

I merged the spin-offed PR, @pan3793 . Could you rebase this to the master?

[SPARK-50805][CORE] Move method nameForAppAndAttempt to o.a.s.u.Utils #49476

…ayout on DFS

pan3793 · 2025-01-14T03:15:37Z

@dongjoon-hyun thanks, rebased

dongjoon-hyun · 2025-01-14T03:16:35Z

Thank you!

dongjoon-hyun · 2025-01-14T03:22:30Z

Since this is a subset of previous status, I manually tested the compilation.

Merged to master for Apache Spark 4.0.0.

Thank you, @pan3793 and @parthchandra .

### What changes were proposed in this pull request? Bump ap-loader version from 3.0-8 to 3.0-9. ### Why are the changes needed? ap-loader has already released v3.0-9, which should bump version from 3.0-8 for `JVMProfiler`. Backport: 1. apache/spark#46402 2. apache/spark#49440 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI. Closes #3072 from SteNicholas/CELEBORN-1842. Authored-by: SteNicholas <[email protected]> Signed-off-by: SteNicholas <[email protected]>

github-actions bot added DOCS CORE labels Jan 10, 2025

pan3793 commented Jan 10, 2025

View reviewed changes

parthchandra approved these changes Jan 11, 2025

View reviewed changes

dongjoon-hyun reviewed Jan 12, 2025

View reviewed changes

dongjoon-hyun reviewed Jan 13, 2025

View reviewed changes

dongjoon-hyun approved these changes Jan 13, 2025

View reviewed changes

pan3793 mentioned this pull request Jan 14, 2025

[SPARK-50805][CORE] Move method nameForAppAndAttempt to o.a.s.u.Utils #49476

Closed

pan3793 added 2 commits January 14, 2025 11:14

[SPARK-50783][CORE] Canonicalize JVM profiler results file name and l…

3c2d53b

…ayout on DFS

address comments

fee48dd

pan3793 force-pushed the SPARK-50783 branch from e5aa660 to fee48dd Compare January 14, 2025 03:15

pan3793 changed the title ~~[SPARK-50783][CORE] Canonicalize JVM profiler results file name and layout on DFS~~ [SPARK-50783] Canonicalize JVM profiler results file name and layout on DFS Jan 14, 2025

github-actions bot removed the CORE label Jan 14, 2025

dongjoon-hyun closed this in e945a90 Jan 14, 2025

SteNicholas mentioned this pull request Jan 20, 2025

[CELEBORN-1842] Bump ap-loader version from 3.0-8 to 3.0-9 apache/celeborn#3072

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50783] Canonicalize JVM profiler results file name and layout on DFS #49440

[SPARK-50783] Canonicalize JVM profiler results file name and layout on DFS #49440

pan3793 commented Jan 10, 2025 •

edited

Loading

pan3793 Jan 10, 2025

parthchandra Jan 11, 2025

pan3793 Jan 10, 2025

parthchandra Jan 11, 2025

pan3793 Jan 10, 2025

pan3793 Jan 10, 2025

parthchandra Jan 11, 2025

pan3793 Jan 10, 2025

pan3793 commented Jan 10, 2025

parthchandra left a comment

parthchandra Jan 11, 2025

parthchandra Jan 11, 2025

parthchandra Jan 11, 2025

dongjoon-hyun left a comment

pan3793 commented Jan 13, 2025

dongjoon-hyun Jan 13, 2025

pan3793 Jan 14, 2025

dongjoon-hyun left a comment

dongjoon-hyun commented Jan 13, 2025

dongjoon-hyun commented Jan 14, 2025

pan3793 commented Jan 14, 2025

dongjoon-hyun commented Jan 14, 2025

dongjoon-hyun commented Jan 14, 2025


		private val PROFILER_FOLDER_PERMISSIONS = new FsPermission(Integer.parseInt("770", 8).toShort)
		private val PROFILER_FILE_PERMISSIONS = new FsPermission(Integer.parseInt("660", 8).toShort)

[SPARK-50783] Canonicalize JVM profiler results file name and layout on DFS #49440

[SPARK-50783] Canonicalize JVM profiler results file name and layout on DFS #49440

Conversation

pan3793 commented Jan 10, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pan3793 commented Jan 10, 2025

parthchandra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

pan3793 commented Jan 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jan 13, 2025

dongjoon-hyun commented Jan 14, 2025

pan3793 commented Jan 14, 2025

dongjoon-hyun commented Jan 14, 2025

dongjoon-hyun commented Jan 14, 2025

pan3793 commented Jan 10, 2025 •

edited

Loading