[BUG] Decouple the retry logic from the see in GpuAggregateExec and have a retry limit #11834

revans2 · 2024-12-06T22:35:31Z

Describe the bug
In GpuAggregateExec we can re-partition data if it is too large to fit on the GPU. But if we get unlucky and the hashes skew to not enough buckets we might need to partition the data again. Currently this is done by updating the hash seed. and trying again.

Some recent changes https://github.com/NVIDIA/spark-rapids/pull/11792/files removed a limit on the number of repartions that we can do. But the warning is printed out when some cryptic code if (hasSeed +7 > 200).

We should have the hash seed only be a hash seed and not need to carry carry information about how many times a repatition has happened. We should also have a limit on the number of repartitions that we do, just so if something bad happens we don't get into a live lock situation. That limit can be huge like 20, and we can have a separate limit to log a warning, hopefully with more human readable code.

The text was updated successfully, but these errors were encountered:

revans2 added ? - Needs Triage Need team to review and classify task Work required that improves the product but is not user facing tech debt labels Dec 6, 2024

revans2 mentioned this issue Dec 6, 2024

[BUG] Fix issue 11790 #11792

Merged

binmahone mentioned this issue Dec 9, 2024

address some comments for 11792 #11816

Merged

mattahrens assigned revans2 Dec 10, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Decouple the retry logic from the see in GpuAggregateExec and have a retry limit #11834

[BUG] Decouple the retry logic from the see in GpuAggregateExec and have a retry limit #11834

revans2 commented Dec 6, 2024

[BUG] Decouple the retry logic from the see in GpuAggregateExec and have a retry limit #11834

[BUG] Decouple the retry logic from the see in GpuAggregateExec and have a retry limit #11834

Comments

revans2 commented Dec 6, 2024