You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While training a Knet model, I was getting a "CUDNN_STATUS_EXECUTION_FAILED" error thrown by CUDA.jl. Further inspection revealed that this is related to CUDA.jl only attempting to reclaim memory when the error code is "CUDNN_STATUS_ALLOC_FAILED". This can be seen in the @check macro that is responsible for attempting memory reclamations and throwing api errors:
Replacing that line with the following code ends up fixing my problem:
res = @retry_reclaim err -> isequal(err, CUDNN_STATUS_ALLOC_FAILED) ||
isequal(err, CUDNN_STATUS_EXECUTION_FAILED) begin
$(esc(ex))
end
I believe ideally this should be fixed by CUDA and CUDNN packages. They are incorrectly assuming memory reclamations are only necessary to attempt when the error code is "CUDNN_STATUS_ALLOC_FAILED", but they also return "CUDNN_STATUS_EXECUTION_FAILED" for issues that could be fixed by reclaiming memory. But until the issue if fixed, it also affects Knet functionality, so I think a temporary workaround could be beneficial.
The text was updated successfully, but these errors were encountered:
While training a Knet model, I was getting a "CUDNN_STATUS_EXECUTION_FAILED" error thrown by CUDA.jl. Further inspection revealed that this is related to CUDA.jl only attempting to reclaim memory when the error code is "CUDNN_STATUS_ALLOC_FAILED". This can be seen in the @check macro that is responsible for attempting memory reclamations and throwing api errors:
https://github.com/JuliaGPU/CUDA.jl/blob/b3228085bc6bf87a0feb5885fc636f352d0e3f0e/lib/cudnn/error.jl#L28
Replacing that line with the following code ends up fixing my problem:
I believe ideally this should be fixed by CUDA and CUDNN packages. They are incorrectly assuming memory reclamations are only necessary to attempt when the error code is "CUDNN_STATUS_ALLOC_FAILED", but they also return "CUDNN_STATUS_EXECUTION_FAILED" for issues that could be fixed by reclaiming memory. But until the issue if fixed, it also affects Knet functionality, so I think a temporary workaround could be beneficial.
The text was updated successfully, but these errors were encountered: