Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q.fill() improvements fail on gen12 #13787

Closed
konradkusiak97 opened this issue May 15, 2024 · 10 comments
Closed

Q.fill() improvements fail on gen12 #13787

konradkusiak97 opened this issue May 15, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@konradkusiak97
Copy link
Contributor

konradkusiak97 commented May 15, 2024

Describe the bug

When trying to make implementation of Q.fill() to use piextUSMEnqueueFill in this PR: #13788, the sycl/test-e2e/out_of_order_queue_status.cpp failed on the post-commit CI, on gen12:

FAIL: SYCL :: Basic/out_of_order_queue_status.cpp (293 of 2065)
******************** TEST 'SYCL :: Basic/out_of_order_queue_status.cpp' FAILED ********************
Exit Code: -11

Command Output (stdout):
--
# RUN: at line 1
/__w/llvm/llvm/toolchain/bin//clang++   -fsycl -fsycl-targets=spir64  /__w/llvm/llvm/llvm/sycl/test-e2e/Basic/out_of_order_queue_status.cpp -o /__w/llvm/llvm/build-e2e/Basic/Output/out_of_order_queue_status.cpp.tmp.out
# executed command: /__w/llvm/llvm/toolchain/bin//clang++ -fsycl -fsycl-targets=spir64 /__w/llvm/llvm/llvm/sycl/test-e2e/Basic/out_of_order_queue_status.cpp -o /__w/llvm/llvm/build-e2e/Basic/Output/out_of_order_queue_status.cpp.tmp.out
# note: command had no output on stdout or stderr
# RUN: at line 2
env ONEAPI_DEVICE_SELECTOR=level_zero:gpu  /__w/llvm/llvm/build-e2e/Basic/Output/out_of_order_queue_status.cpp.tmp.out
# executed command: env ONEAPI_DEVICE_SELECTOR=level_zero:gpu /__w/llvm/llvm/build-e2e/Basic/Output/out_of_order_queue_status.cpp.tmp.out
# .---command stderr------------
# | out_of_order_queue_status.cpp.tmp.out: /__w/llvm/llvm/llvm/sycl/test-e2e/Basic/out_of_order_queue_status.cpp:16: void CheckArray(int *, size_t, int): Assertion `x[i] == expected' failed.
# | *** longjmp causes uninitialized stack frame ***: terminated
# | *** longjmp causes uninitialized stack frame ***: terminated
# | *** longjmp causes uninitialized stack frame ***: terminated
# | *** longjmp causes uninitialized stack frame ***: terminated
# | *** longjmp causes uninitialized stack frame ***: terminated
...

The failure is very hard to reproduce since it only fails randomly on Intel(R) Iris(R) Xe Graphics. I managed to reproduce it a few times only with a fresh build and laptop reboot, running all e2e tests for both level_zero:gpu and opencl:fpga which is how the post-commit CI does it.

For now, for level_zero target, the path to piextUSMEnqueueFill was reverted due to this failure and workaround was implemented to use previous implementation of Q.fill() in #13788.

DPC++ build configuration:

python ../llvm/buildbot/configure.py -t Release --ci-defaults --shared-libs --no-assertions \
 --cmake-opt="-DSYCL_ENABLE_STACK_PRINTING=ON" \
 --cmake-opt="-DSYCL_LIB_WITH_DEBUG_SYMBOL=ON" \
 --cmake-opt="-DLLVM_INSTALL_UTILS=ON" \
 --cmake-opt="-DNATIVECPU_USE_OCK=Off" \
 --cmake-opt="-DSYCL_PI_TESTS=OFF" \
-o ./

Environment

OS: Linux
Target device and vendor: Intel(R) Iris(R) Xe Graphics
DPC++ commit: c173fbf
sycl-ls --verbose:

Platform [#4]:
    Version  : 1.3
    Name     : Intel(R) Level-Zero
    Vendor   : Intel(R) Corporation
    Devices  : 1
        Device [#0]:
        Type       : gpu
        Version    : 1.3
        Name       : Intel(R) Iris(R) Xe Graphics
        Vendor     : Intel(R) Corporation
        Driver     : 1.3.29138
@npmiller
Copy link
Contributor

cc @steffenlarsen @aelovikov-intel any ideas on this? We have limited access to gen12 so haven't been able to figure it out.

We could maybe mark the test unsupported/xfail to unblock the patch until someone else can have a look?

@KornevNikita
Copy link
Contributor

@konradkusiak97 it's flaky issue, right?

@konradkusiak97
Copy link
Contributor Author

It might be a flaky issue. Though, I've seen it being reproduced quite reliably on the post-commit CI on gen12 but only there. I don't have access to Intel(R) Iris(R) Xe Graphics so I couldn't check myself but it makes me think that the issue is only reproducible on that hardware with all the conditions that I wrote about above, so DPC++ built with clang and not gcc.

@konradkusiak97
Copy link
Contributor Author

@KornevNikita, do you think we could we mark the test unsupported/xfail as @npmiller suggested? This is now the only thing that's blocking merging improvements to Queue.fill(), see #13788. We could also merge the patch, and mark the test xfail only if the post-commit CI is still failing

@KornevNikita
Copy link
Contributor

@KornevNikita, do you think we could we mark the test unsupported/xfail as @npmiller suggested? This is now the only thing that's blocking merging improvements to Queue.fill(), see #13788. We could also merge the patch, and mark the test xfail only if the post-commit CI is still failing

If your patch introduces this regression, I guess it's not a good idea to turn off the test, isnt't?

Also as I remember it affects the CTS: #13679
Are there any changes to fix this failure? I'd suggest to check if this test passes (I can do it for you) with your patch and then also check the post-commit before merging

Then If the post-commit is green I guess we have no blockers to just merge it and monitor the post-commit.

@konradkusiak97
Copy link
Contributor Author

Yes, the CTS failure is fixed with this patch: oneapi-src/unified-runtime#1603

@konradkusiak97
Copy link
Contributor Author

@KornevNikita, do you think we could we mark the test unsupported/xfail as @npmiller suggested? This is now the only thing that's blocking merging improvements to Queue.fill(), see #13788. We could also merge the patch, and mark the test xfail only if the post-commit CI is still failing

If your patch introduces this regression, I guess it's not a good idea to turn off the test, isnt't?

Also as I remember it affects the CTS: #13679 Are there any changes to fix this failure? I'd suggest to check if this test passes (I can do it for you) with your patch and then also check the post-commit before merging

Then If the post-commit is green I guess we have no blockers to just merge it and monitor the post-commit.

I tried again reproducing this failure but without any luck. Would you be able to try out the post-commit with my patch: #13788 @KornevNikita and see if it still fails?

@KornevNikita
Copy link
Contributor

KornevNikita commented Jun 4, 2024

I suggest to just trigger it in the PR.

@KornevNikita
Copy link
Contributor

@konradkusiak97 hi! As I understand you made a fix / found a workaround, so this issue can be closed once the PR is merged?

@konradkusiak97 konradkusiak97 changed the title Memset failing post-commit CI on GEN12 Q.fill() improvements fail on gen12 Jun 18, 2024
@konradkusiak97
Copy link
Contributor Author

@konradkusiak97 hi! As I understand you made a fix / found a workaround, so this issue can be closed once the PR is merged?

I didn't get to the bottom of what's causing this issue so for now for level_zero target, Q.fill() will behave as it was after #13788 is merged.

I'm fine with closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants