Q.fill() improvements fail on gen12 #13787

konradkusiak97 · 2024-05-15T09:09:25Z

Describe the bug

When trying to make implementation of Q.fill() to use piextUSMEnqueueFill in this PR: #13788, the sycl/test-e2e/out_of_order_queue_status.cpp failed on the post-commit CI, on gen12:

FAIL: SYCL :: Basic/out_of_order_queue_status.cpp (293 of 2065)
******************** TEST 'SYCL :: Basic/out_of_order_queue_status.cpp' FAILED ********************
Exit Code: -11

Command Output (stdout):
--
# RUN: at line 1
/__w/llvm/llvm/toolchain/bin//clang++   -fsycl -fsycl-targets=spir64  /__w/llvm/llvm/llvm/sycl/test-e2e/Basic/out_of_order_queue_status.cpp -o /__w/llvm/llvm/build-e2e/Basic/Output/out_of_order_queue_status.cpp.tmp.out
# executed command: /__w/llvm/llvm/toolchain/bin//clang++ -fsycl -fsycl-targets=spir64 /__w/llvm/llvm/llvm/sycl/test-e2e/Basic/out_of_order_queue_status.cpp -o /__w/llvm/llvm/build-e2e/Basic/Output/out_of_order_queue_status.cpp.tmp.out
# note: command had no output on stdout or stderr
# RUN: at line 2
env ONEAPI_DEVICE_SELECTOR=level_zero:gpu  /__w/llvm/llvm/build-e2e/Basic/Output/out_of_order_queue_status.cpp.tmp.out
# executed command: env ONEAPI_DEVICE_SELECTOR=level_zero:gpu /__w/llvm/llvm/build-e2e/Basic/Output/out_of_order_queue_status.cpp.tmp.out
# .---command stderr------------
# | out_of_order_queue_status.cpp.tmp.out: /__w/llvm/llvm/llvm/sycl/test-e2e/Basic/out_of_order_queue_status.cpp:16: void CheckArray(int *, size_t, int): Assertion `x[i] == expected' failed.
# | *** longjmp causes uninitialized stack frame ***: terminated
# | *** longjmp causes uninitialized stack frame ***: terminated
# | *** longjmp causes uninitialized stack frame ***: terminated
# | *** longjmp causes uninitialized stack frame ***: terminated
# | *** longjmp causes uninitialized stack frame ***: terminated
...

The failure is very hard to reproduce since it only fails randomly on Intel(R) Iris(R) Xe Graphics. I managed to reproduce it a few times only with a fresh build and laptop reboot, running all e2e tests for both level_zero:gpu and opencl:fpga which is how the post-commit CI does it.

For now, for level_zero target, the path to piextUSMEnqueueFill was reverted due to this failure and workaround was implemented to use previous implementation of Q.fill() in #13788.

DPC++ build configuration:

python ../llvm/buildbot/configure.py -t Release --ci-defaults --shared-libs --no-assertions \
 --cmake-opt="-DSYCL_ENABLE_STACK_PRINTING=ON" \
 --cmake-opt="-DSYCL_LIB_WITH_DEBUG_SYMBOL=ON" \
 --cmake-opt="-DLLVM_INSTALL_UTILS=ON" \
 --cmake-opt="-DNATIVECPU_USE_OCK=Off" \
 --cmake-opt="-DSYCL_PI_TESTS=OFF" \
-o ./

Environment

OS: Linux
Target device and vendor: Intel(R) Iris(R) Xe Graphics
DPC++ commit: c173fbf
sycl-ls --verbose:

Platform [#4]:
    Version  : 1.3
    Name     : Intel(R) Level-Zero
    Vendor   : Intel(R) Corporation
    Devices  : 1
        Device [#0]:
        Type       : gpu
        Version    : 1.3
        Name       : Intel(R) Iris(R) Xe Graphics
        Vendor     : Intel(R) Corporation
        Driver     : 1.3.29138

The text was updated successfully, but these errors were encountered:

npmiller · 2024-05-21T09:02:47Z

cc @steffenlarsen @aelovikov-intel any ideas on this? We have limited access to gen12 so haven't been able to figure it out.

We could maybe mark the test unsupported/xfail to unblock the patch until someone else can have a look?

KornevNikita · 2024-05-29T09:03:02Z

@konradkusiak97 it's flaky issue, right?

konradkusiak97 · 2024-05-29T09:39:19Z

It might be a flaky issue. Though, I've seen it being reproduced quite reliably on the post-commit CI on gen12 but only there. I don't have access to Intel(R) Iris(R) Xe Graphics so I couldn't check myself but it makes me think that the issue is only reproducible on that hardware with all the conditions that I wrote about above, so DPC++ built with clang and not gcc.

konradkusiak97 · 2024-06-04T08:25:08Z

@KornevNikita, do you think we could we mark the test unsupported/xfail as @npmiller suggested? This is now the only thing that's blocking merging improvements to Queue.fill(), see #13788. We could also merge the patch, and mark the test xfail only if the post-commit CI is still failing

KornevNikita · 2024-06-04T09:28:16Z

@KornevNikita, do you think we could we mark the test unsupported/xfail as @npmiller suggested? This is now the only thing that's blocking merging improvements to Queue.fill(), see #13788. We could also merge the patch, and mark the test xfail only if the post-commit CI is still failing

If your patch introduces this regression, I guess it's not a good idea to turn off the test, isnt't?

Also as I remember it affects the CTS: #13679
Are there any changes to fix this failure? I'd suggest to check if this test passes (I can do it for you) with your patch and then also check the post-commit before merging

Then If the post-commit is green I guess we have no blockers to just merge it and monitor the post-commit.

konradkusiak97 · 2024-06-04T09:43:21Z

Yes, the CTS failure is fixed with this patch: oneapi-src/unified-runtime#1603

konradkusiak97 · 2024-06-04T17:09:25Z

@KornevNikita, do you think we could we mark the test unsupported/xfail as @npmiller suggested? This is now the only thing that's blocking merging improvements to Queue.fill(), see #13788. We could also merge the patch, and mark the test xfail only if the post-commit CI is still failing

If your patch introduces this regression, I guess it's not a good idea to turn off the test, isnt't?

Also as I remember it affects the CTS: #13679 Are there any changes to fix this failure? I'd suggest to check if this test passes (I can do it for you) with your patch and then also check the post-commit before merging

Then If the post-commit is green I guess we have no blockers to just merge it and monitor the post-commit.

I tried again reproducing this failure but without any luck. Would you be able to try out the post-commit with my patch: #13788 @KornevNikita and see if it still fails?

KornevNikita · 2024-06-04T17:16:35Z

I suggest to just trigger it in the PR.

KornevNikita · 2024-06-18T13:49:30Z

@konradkusiak97 hi! As I understand you made a fix / found a workaround, so this issue can be closed once the PR is merged?

konradkusiak97 · 2024-06-18T14:05:56Z

@konradkusiak97 hi! As I understand you made a fix / found a workaround, so this issue can be closed once the PR is merged?

I didn't get to the bottom of what's causing this issue so for now for level_zero target, Q.fill() will behave as it was after #13788 is merged.

I'm fine with closing this.

konradkusiak97 added the bug Something isn't working label May 15, 2024

konradkusiak97 mentioned this issue May 15, 2024

[SYCL][ABI-Break] Improve Queue fill #13788

Merged

konradkusiak97 changed the title ~~Memset failing post-commit CI on GEN12~~ Q.fill() improvements fail on gen12 Jun 18, 2024

konradkusiak97 closed this as completed Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q.fill() improvements fail on gen12 #13787

Q.fill() improvements fail on gen12 #13787

konradkusiak97 commented May 15, 2024 •

edited

Loading

npmiller commented May 21, 2024

KornevNikita commented May 29, 2024

konradkusiak97 commented May 29, 2024

konradkusiak97 commented Jun 4, 2024

KornevNikita commented Jun 4, 2024

konradkusiak97 commented Jun 4, 2024

konradkusiak97 commented Jun 4, 2024

KornevNikita commented Jun 4, 2024 •

edited

Loading

KornevNikita commented Jun 18, 2024

konradkusiak97 commented Jun 18, 2024

Q.fill() improvements fail on gen12 #13787

Q.fill() improvements fail on gen12 #13787

Comments

konradkusiak97 commented May 15, 2024 • edited Loading

Describe the bug

Environment

npmiller commented May 21, 2024

KornevNikita commented May 29, 2024

konradkusiak97 commented May 29, 2024

konradkusiak97 commented Jun 4, 2024

KornevNikita commented Jun 4, 2024

konradkusiak97 commented Jun 4, 2024

konradkusiak97 commented Jun 4, 2024

KornevNikita commented Jun 4, 2024 • edited Loading

KornevNikita commented Jun 18, 2024

konradkusiak97 commented Jun 18, 2024

konradkusiak97 commented May 15, 2024 •

edited

Loading

KornevNikita commented Jun 4, 2024 •

edited

Loading