You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our current Hopper matmul scheduler ignores the warp_tile option and only splits the CTA tile by the instruction tile. This means that currently we fail to execute a kernel compiled with this valid config:
RuntimeError: INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/runtime/executor_params.cpp":30, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Selected invalid number of threads for cuda: 4224
We should change our scheduling to instead first split by warp tile and to parallelize TIDy on the cta/warp split instead of cta/instr split. In this way, we can still use large tiles along with smaller instructions without using far too many warps.
The text was updated successfully, but these errors were encountered:
Our current Hopper matmul scheduler ignores the
warp_tile
option and only splits the CTA tile by the instruction tile. This means that currently we fail to execute a kernel compiled with this valid config:The error we see is
We should change our scheduling to instead first split by warp tile and to parallelize TIDy on the cta/warp split instead of cta/instr split. In this way, we can still use large tiles along with smaller instructions without using far too many warps.
The text was updated successfully, but these errors were encountered: