-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Intermittent result discrepancy for NDS SF3K query86 on L40S #11835
Comments
Rerun the repro with in two more modifications disable async pool memory allocator
the diff still consistently reproduces added compute sanitizer with the default memcheckthe default check does not catch errors and seems to change concurrency in a way that the issue stops reproducing. |
Pursued a conjecture that the issue only reproduces due to forward compatibility because we have no cubin sections for compute capability 89 However, a targeted compilation for 89 equally reproduced the issue |
Running the executors under
There are following issue classes: Uninitialized global memoryOne instance looks intentional given its name
But the other is not
Unused memory warnings
|
Describe the bug
NDS SF3K CI pipeline exhibits intermittent query result validation failures for various queries.
It is difficult to reproduce but I was able to reduce the scenario to running q36 and q86 one after another, which fails 90+% out of the runs. I dropped LIMIT 100 from q86 to reduce chances of nondeterminism.
The diff is large enough but there is a single row in the result and diff with lochierarchy=2 so it is to focus on for tests
This issue seems to be introduced between build 33
and build 42
Given that the runs are not 100% reproducible there is a chance the range is longer. I ran build 33 four times without reproducing the issue. Build 42 reproduces the failure quickly
Steps/Code to reproduce bug
Open the notebook on a single node with L40S https://github.com/gerashegalov/rapids-shell/blob/25ca477172f8ac45b71d0eed3452369299748284/src/jupyter/nds2-parquet-3k-snappy.ipynb
Expected behavior
Results must continue to match. These tests are consistently passing on the same node when configured to use an H100 GPU instead
Environment details (please complete the following information)
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: