[Inductor max autotune] Make autotune_select_algorithm more robust #124928

kadeng · 2024-04-25T11:14:12Z

Stack from ghstack (oldest at bottom):

-> [Inductor max autotune] Make autotune_select_algorithm more robust #124928

This diff makes sure that a custom exception is thrown when no valid
choices remain during autotuning. This allows to gracefully fall back
to a default choice, even if that default choice has not been passed to
autotune_select_algorithm.

Additionally, this diff handles RuntimeErrors during autotuning gracefully, e.g. the corresponding choice is ignored but it does not lead to the compilation failure of the entire model if a problematic choice is encountered during autotuning.
( An error is being logged, though).

Test Plan:
CI

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-04-25T11:14:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124928

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 9464a45 with merge base 8a0529e ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
sebotnet33ts_256

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / rocm6.1-py3.8-inductor / test (inductor, 1, 1, linux.rocm.gpu.2) (gh) (trunk failure)
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShard2DTraining::test_train_parity_2d_mlp

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

kadeng · 2024-05-02T10:12:12Z

@int3 I noticed you wrote a diff that's doing something similar. Will take a look and see that this one is still compatible with your changes.

int3 · 2024-05-02T14:38:15Z

torch/_inductor/select_algorithm.py

+ (not isinstance(selected_time, float))
+ or (selected_time < 0.0)
+ or (not math.isfinite(selected_time))
+ or math.isnan(selected_time)


feels a little overkill (pretty sure we only generate inf and regular floats) but not a big deal

int3 · 2024-05-02T14:38:48Z

Ah I didn't see you had this in the works. Yeah, this is more or less compatible with my changes. I can add my tests after you land this.

[ghstack-poisoned]

This diff makes sure that a custom exception is thrown when no valid choices remain during autotuning. This allows to gracefully fall back to a default choice, even if that default choice has not been passed to autotune_select_algorithm. Additionally, this diff handles RuntimeErrors during autotuning gracefully, e.g. the corresponding choice is ignored but it does not lead to the compilation failure of the entire model if a problematic choice is encountered during autotuning. ( An error is being logged, though). TODO: * Add unit test * Add an assertion that we use autune_in_subproc when CUTLASS backend is enabled ghstack-source-id: b092fce1684a822311c5733c13c54e41b463c03e Pull Request resolved: #124928

This diff makes sure that a custom exception is thrown when no valid choices remain during autotuning. This allows to gracefully fall back to a default choice, even if that default choice has not been passed to autotune_select_algorithm. Additionally, this diff handles RuntimeErrors during autotuning gracefully, e.g. the corresponding choice is ignored but it does not lead to the compilation failure of the entire model if a problematic choice is encountered during autotuning. ( An error is being logged, though). TODO: * Add unit test * Add an assertion that we use autune_in_subproc when CUTLASS backend is enabled ghstack-source-id: 42d7c737a4918b2db7af54bb43b2188615f6aecb Pull Request resolved: pytorch#124928

[ghstack-poisoned]

This diff makes sure that a custom exception is thrown when no valid choices remain during autotuning. This allows to gracefully fall back to a default choice, even if that default choice has not been passed to autotune_select_algorithm. Additionally, this diff handles RuntimeErrors during autotuning gracefully, e.g. the corresponding choice is ignored but it does not lead to the compilation failure of the entire model if a problematic choice is encountered during autotuning. ( An error is being logged, though). TODO: * Add unit test * Add an assertion that we use autune_in_subproc when CUTLASS backend is enabled ghstack-source-id: d735ec8f3f9951c90061e458dba3a7839c0b6ff3 Pull Request resolved: #124928

[ghstack-poisoned]

This diff makes sure that a custom exception is thrown when no valid choices remain during autotuning. This allows to gracefully fall back to a default choice, even if that default choice has not been passed to autotune_select_algorithm. Additionally, this diff handles RuntimeErrors during autotuning gracefully, e.g. the corresponding choice is ignored but it does not lead to the compilation failure of the entire model if a problematic choice is encountered during autotuning. ( An error is being logged, though). TODO: * Add unit test * Add an assertion that we use autune_in_subproc when CUTLASS backend is enabled ghstack-source-id: ebfa46c5f1bf89cffe6e05658bb7f151ac172e0a Pull Request resolved: #124928

[ghstack-poisoned]

This diff makes sure that a custom exception is thrown when no valid choices remain during autotuning. This allows to gracefully fall back to a default choice, even if that default choice has not been passed to autotune_select_algorithm. Additionally, this diff handles RuntimeErrors during autotuning gracefully, e.g. the corresponding choice is ignored but it does not lead to the compilation failure of the entire model if a problematic choice is encountered during autotuning. ( An error is being logged, though). TODO: * Add unit test * Add an assertion that we use autune_in_subproc when CUTLASS backend is enabled ghstack-source-id: b9fffcd75ecc5d58eeb87b494119efa7db859c2a Pull Request resolved: #124928

[ghstack-poisoned]

This diff makes sure that a custom exception is thrown when no valid choices remain during autotuning. This allows to gracefully fall back to a default choice, even if that default choice has not been passed to autotune_select_algorithm. Additionally, this diff handles RuntimeErrors during autotuning gracefully, e.g. the corresponding choice is ignored but it does not lead to the compilation failure of the entire model if a problematic choice is encountered during autotuning. ( An error is being logged, though). TODO: * Add unit test * Add an assertion that we use autune_in_subproc when CUTLASS backend is enabled ghstack-source-id: b9fffcd75ecc5d58eeb87b494119efa7db859c2a Pull Request resolved: #124928

[ghstack-poisoned]

This diff makes sure that a custom exception is thrown when no valid choices remain during autotuning. This allows to gracefully fall back to a default choice, even if that default choice has not been passed to autotune_select_algorithm. Additionally, this diff handles RuntimeErrors during autotuning gracefully, e.g. the corresponding choice is ignored but it does not lead to the compilation failure of the entire model if a problematic choice is encountered during autotuning. ( An error is being logged, though). TODO: * Add unit test * Add an assertion that we use autune_in_subproc when CUTLASS backend is enabled ghstack-source-id: ed4af3cb67af81ce438faf062b20c166e5625a36 Pull Request resolved: #124928

[ghstack-poisoned]

This diff makes sure that a custom exception is thrown when no valid choices remain during autotuning. This allows to gracefully fall back to a default choice, even if that default choice has not been passed to autotune_select_algorithm. Additionally, this diff handles RuntimeErrors during autotuning gracefully, e.g. the corresponding choice is ignored but it does not lead to the compilation failure of the entire model if a problematic choice is encountered during autotuning. ( An error is being logged, though). TODO: * Add unit test * Add an assertion that we use autune_in_subproc when CUTLASS backend is enabled ghstack-source-id: fd9d31206fe483b2d6ed22d61390a5de08510ace Pull Request resolved: #124928

[ghstack-poisoned]

This diff makes sure that a custom exception is thrown when no valid choices remain during autotuning. This allows to gracefully fall back to a default choice, even if that default choice has not been passed to autotune_select_algorithm. Additionally, this diff handles RuntimeErrors during autotuning gracefully, e.g. the corresponding choice is ignored but it does not lead to the compilation failure of the entire model if a problematic choice is encountered during autotuning. ( An error is being logged, though). ghstack-source-id: 904251b11a684054ab1c734a307d5a0b0f910a05 Pull Request resolved: #124928

int3

lgtm

kadeng · 2024-05-05T15:17:53Z

@pytorchbot merge

pytorchmergebot · 2024-05-05T15:20:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-05-05T15:21:11Z

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 1, 5, linux.4xlarge.nvidia.gpu)
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

[ghstack-poisoned]

This diff makes sure that a custom exception is thrown when no valid choices remain during autotuning. This allows to gracefully fall back to a default choice, even if that default choice has not been passed to autotune_select_algorithm. Additionally, this diff handles RuntimeErrors during autotuning gracefully, e.g. the corresponding choice is ignored but it does not lead to the compilation failure of the entire model if a problematic choice is encountered during autotuning. ( An error is being logged, though). ghstack-source-id: 568b81be89e1a2436efc86e5b98497a1641268a8 Pull Request resolved: #124928

kadeng · 2024-05-05T15:33:00Z

@pytorchbot merge

pytorchmergebot · 2024-05-05T15:35:05Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Update

1f59ade

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Apr 25, 2024

Update

93343a0

[ghstack-poisoned]

kadeng mentioned this pull request Apr 25, 2024

[Inductor cutlass backend] Fix cutlass_utils.get_max_alignment() for strided layouts. #124930

Closed

Update

fc48617

[ghstack-poisoned]

kadeng added the topic: not user facing topic category label Apr 25, 2024

Update

a356bf5

[ghstack-poisoned]

kadeng requested a review from int3 May 2, 2024 10:10

int3 reviewed May 2, 2024

View reviewed changes

Update

ee0bf3c

[ghstack-poisoned]

This was referenced May 2, 2024

[Inductor cutlass backend] Enabled nonzero workspace and Cutlass StreamK #125406

Closed

[Inductor max autotune] Minor fix for AlgorithmSelectorCache verification. #125407

Closed

Update

10924a7

[ghstack-poisoned]

Update

8a438e8

[ghstack-poisoned]

Update

adef646

[ghstack-poisoned]

Update

43fb32a

[ghstack-poisoned]

Update

190124c

[ghstack-poisoned]

Update

1fad52d

[ghstack-poisoned]

Update

74c09c9

[ghstack-poisoned]

kadeng marked this pull request as ready for review May 4, 2024 05:12

kadeng requested review from eellison, shunting314, masnesral and int3 May 4, 2024 05:14

Update

4c8c49c

[ghstack-poisoned]

int3 approved these changes May 5, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 5, 2024

pytorchmergebot added the merging label May 5, 2024

pytorchmergebot removed the merging label May 5, 2024

Update

9464a45

[ghstack-poisoned]

pytorchmergebot added the merging label May 5, 2024

pytorchmergebot added the Merged label May 5, 2024

pytorchmergebot closed this in 94c4855 May 5, 2024

pytorchmergebot removed the merging label May 5, 2024

github-actions bot deleted the gh/kadeng/57/head branch June 5, 2024 01:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor max autotune] Make autotune_select_algorithm more robust #124928

[Inductor max autotune] Make autotune_select_algorithm more robust #124928

kadeng commented Apr 25, 2024 •

edited

pytorch-bot bot commented Apr 25, 2024 •

edited

kadeng commented May 2, 2024

int3 May 2, 2024

int3 commented May 2, 2024

int3 left a comment

kadeng commented May 5, 2024

pytorchmergebot commented May 5, 2024

pytorchmergebot commented May 5, 2024

kadeng commented May 5, 2024

pytorchmergebot commented May 5, 2024

[Inductor max autotune] Make autotune_select_algorithm more robust #124928

[Inductor max autotune] Make autotune_select_algorithm more robust #124928

Conversation

kadeng commented Apr 25, 2024 • edited

pytorch-bot bot commented Apr 25, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124928

✅ You can merge normally! (2 Unrelated Failures)

kadeng commented May 2, 2024

int3 May 2, 2024

Choose a reason for hiding this comment

int3 commented May 2, 2024

int3 left a comment

Choose a reason for hiding this comment

kadeng commented May 5, 2024

pytorchmergebot commented May 5, 2024

Merge started

pytorchmergebot commented May 5, 2024

Merge failed

kadeng commented May 5, 2024

pytorchmergebot commented May 5, 2024

Merge started

kadeng commented Apr 25, 2024 •

edited

pytorch-bot bot commented Apr 25, 2024 •

edited