Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Intermittent crash on NDS query 96 with grace hopper cluster #11854

Open
revans2 opened this issue Dec 10, 2024 · 0 comments
Open

[BUG] Intermittent crash on NDS query 96 with grace hopper cluster #11854

revans2 opened this issue Dec 10, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@revans2
Copy link
Collaborator

revans2 commented Dec 10, 2024

Describe the bug
In CI we have been seeing occasional failure related for NDS scale factor 3k when running on a grace hoper cluster. It appears to only ever crash when we are running with parquet data with decimals, not floats, for many of the number types.

We need someone to go through all of the historic runs and see if we can fully understand what is happening here, before we dig into a single possible explanation.

One of the odd things is that for at least a few of the runs we see errors when trying to deserialize a task.

24/12/09 08:25:06 INFO Executor: Running task 140.0 in stage 34.0 (TID 12218)
24/12/09 08:25:06 INFO TorrentBroadcast: Started reading broadcast variable 41 with 1 pieces (estimated total size 4.0 MiB)
24/12/09 08:25:06 INFO MemoryStore: Block broadcast_41_piece0 stored as bytes in memory (estimated size 35.0 KiB, free 8.4 GiB)
24/12/09 08:25:06 INFO TorrentBroadcast: Reading broadcast variable 41 took 2 ms
24/12/09 08:25:06 INFO MemoryStore: Block broadcast_41 stored as values in memory (estimated size 81.8 KiB, free 8.4 GiB)
24/12/09 08:25:06 ERROR Executor: Exception in task 140.0 in stage 34.0 (TID 12218)
java.io.IOException: unexpected exception type
	at java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1750)
	at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1280)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2222)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
	at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)
--
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
--
	at java.lang.invoke.CallSite.makeSite(CallSite.java:341)
	at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307)
	at java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297)
	at org.apache.spark.sql.catalyst.InternalRow$.$deserializeLambda$(InternalRow.scala)
	... 337 more
Caused by: java.lang.NullPointerException
	at java.lang.invoke.CallSite.makeSite(CallSite.java:325)
	... 340 more
24/12/09 08:25:06 INFO CoarseGrainedExecutorBackend: Got assigned task 12226
24/12/09 08:25:06 INFO Executor: Running task 158.0 in stage 34.0 (TID 12226)
24/12/09 08:25:06 INFO TorrentBroadcast: Started reading broadcast variable 40 with 1 pieces (estimated total size 4.0 MiB)

When we zoom in on the last part of the calls we see.

        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87)
        at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:129)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:87)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)

This is on Spark 3.4.3, so ShuffleMapTask.scala:87 is just trying to deserialize a (RDD[_], ShuffleDependency[_, _, _]) tuple of RDD + ShuffleDependency.

The odd part is that this appears to pass on retry. Currently I suspect that it is some kind of memory/network corruption because the grace hopper hardware we are running on is pre-production, but it is not specific to a single node and it is specific to a single query, so that makes it more fun to try and debug.

@revans2 revans2 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 10, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants