You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is happening intermittently on the Linux worker build-bots, but doesn't present itself on nearly identical drivers and hardware when testing locally.
It shows up as a segfault at process exit for the correctness tests after the tests have run. When it happens, the Vulkan ICD function pointer chain is invalid, and any call to a Vulkan API method will segfault. If we don't cleanup, then the driver itself crashes. Same symptoms appear under JIT and AOT.
It appears to be either a Vulkan and/or NVIDIA driver bug. Running under the validation layers, and crash detection layers doesn't reveal anything, and we never receive a device lost error, making it difficult to detect or handle.
The text was updated successfully, but these errors were encountered:
I can reproduce the same jump to a bad address during nvidia driver finalization using llama.cpp by running multiple instances at once (only offloading a few layers to GPU, so that the multiple instances all fit). Our dev meeting conclusion was to just move vulkan testing off these bots onto a raspberry pi 5 (partially because #8494 shows us that this is a more useful platform to be testing on anyway)
This is happening intermittently on the Linux worker build-bots, but doesn't present itself on nearly identical drivers and hardware when testing locally.
It shows up as a
segfault
at process exit for the correctness tests after the tests have run. When it happens, the Vulkan ICD function pointer chain is invalid, and any call to a Vulkan API method will segfault. If we don't cleanup, then the driver itself crashes. Same symptoms appear under JIT and AOT.System details:
It appears to be either a Vulkan and/or NVIDIA driver bug. Running under the validation layers, and crash detection layers doesn't reveal anything, and we never receive a device lost error, making it difficult to detect or handle.
The text was updated successfully, but these errors were encountered: