[core][experimental] Fix bug when propagating DAG application exceptions #45237

stephanie-wang · 2024-05-10T01:29:13Z

Why are these changes needed?

If a task in the DAG raised an application-level exception, we would re-raise correctly if it was read directly by the driver, but not if it was read by another actor in the DAG. This PR fixes the issue by writing the exception to the next actor.

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang · 2024-05-10T01:30:06Z

python/ray/runtime_context.py

@@ -251,7 +251,7 @@ def get_actor_id(self) -> Optional[str]:
 """
 # only worker mode has actor_id
 if self.worker.mode != ray._private.worker.WORKER_MODE:
- logger.warning(
+ logger.debug(


Also changed this to DEBUG because it's very spammy when the driver calls this method. It seems fine to me if the driver calls it, not sure why this should be a warning.

Looks good, thanks!

Signed-off-by: Stephanie Wang <[email protected]>

jackhumphries · 2024-05-13T20:42:06Z

python/ray/runtime_context.py

@@ -251,7 +251,7 @@ def get_actor_id(self) -> Optional[str]:
 """
 # only worker mode has actor_id
 if self.worker.mode != ray._private.worker.WORKER_MODE:
- logger.warning(
+ logger.debug(


Looks good, thanks!

jackhumphries · 2024-05-13T20:44:14Z

python/ray/dag/compiled_dag_node.py

+ return True
+ except Exception as exc:
+ # Previous task raised an application-level exception.
+ # Propagate it and skip the actual task.
 output_writer.write(exc)


Why does this not need _wrap_exception()?

I believe this is because the exception has already been wrapped by the original task that errored. Will add a comment!

Ah, I see. It's wrapped in a RayTaskError, which is itself an exception. So exc is an instance of RayTaskError here.

jackhumphries · 2024-05-13T20:45:55Z

python/ray/dag/tests/test_accelerated_dag.py

+ self.count = 0
+
+ def _fail_if_needed(self):
+ if self.fail_after and self.count > self.fail_after:


minor preference: self.count >= self.fail_after (because self.count starts at 0)

jackhumphries · 2024-05-13T20:47:34Z

python/ray/dag/tests/test_accelerated_dag.py

+ output_channels.end_read()
+
+ with pytest.raises(RuntimeError):
+ for i in range(99):


Why loop 99 times? Shouldn't this already throw an exception on the first iteration because fail_after is set to 100?

Ah the failures are randomized.

rkooo567 · 2024-05-13T22:42:56Z

python/ray/dag/compiled_dag_node.py

- except ValueError as exc:
- # ValueError is raised if a type hint was set and the returned
- # type did not match the hint.
+ except IOError:


QQ: should we also string match the error message, or is this certain IOError == channel closed in this case? (it could be wrong if we have different close API?)

Hmm true, we should probably introduce a different Ray system error instead of using IOError. I'll add an issue to track this.

Signed-off-by: Stephanie Wang <[email protected]>

…ons (ray-project#45237) If a task in the DAG raised an application-level exception, we would re-raise correctly if it was read directly by the driver, but not if it was read by another actor in the DAG. This PR fixes the issue by writing the exception to the next actor. --------- Signed-off-by: Stephanie Wang <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>

stephanie-wang added 3 commits May 9, 2024 18:27

Error handling

f3981e0

Signed-off-by: Stephanie Wang <[email protected]>

debug

714d21f

Signed-off-by: Stephanie Wang <[email protected]>

lint

3588167

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang commented May 10, 2024

View reviewed changes

stephanie-wang assigned jjyao, jackhumphries and rkooo567 May 10, 2024

test

845ecdb

Signed-off-by: Stephanie Wang <[email protected]>

jackhumphries reviewed May 13, 2024

View reviewed changes

rkooo567 approved these changes May 13, 2024

View reviewed changes

stephanie-wang added 2 commits May 14, 2024 09:18

Merge remote-tracking branch 'upstream/master' into dag-errors

b53763c

comment

749d4d1

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang merged commit a3439ad into ray-project:master May 14, 2024
6 checks passed

stephanie-wang deleted the dag-errors branch May 14, 2024 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][experimental] Fix bug when propagating DAG application exceptions #45237

[core][experimental] Fix bug when propagating DAG application exceptions #45237

stephanie-wang commented May 10, 2024

stephanie-wang May 10, 2024

jackhumphries May 13, 2024

jackhumphries May 13, 2024

jackhumphries May 13, 2024

stephanie-wang May 13, 2024

jackhumphries May 13, 2024

jackhumphries May 13, 2024

jackhumphries May 13, 2024

stephanie-wang May 13, 2024

rkooo567 May 13, 2024

stephanie-wang May 13, 2024

[core][experimental] Fix bug when propagating DAG application exceptions #45237

[core][experimental] Fix bug when propagating DAG application exceptions #45237

Conversation

stephanie-wang commented May 10, 2024

Why are these changes needed?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment