Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GPU file writes only test writing a single row group or stripe #11735

Open
jlowe opened this issue Nov 19, 2024 · 3 comments · May be fixed by #11743
Open

[BUG] GPU file writes only test writing a single row group or stripe #11735

jlowe opened this issue Nov 19, 2024 · 3 comments · May be fixed by #11743
Assignees
Labels
bug Something isn't working test Only impacts tests

Comments

@jlowe
Copy link
Member

jlowe commented Nov 19, 2024

We recently ran into rapidsai/cudf#6763 which triggers when trying to write booleans with nulls. We test writing booleans to ORC in our integration tests, but those tests did not trigger the issue. They missed it because they only write a single stripe to each file, because so few rows are written. If the test had written enough rows to trigger more than one stripe, the bug would have been caught.

@jlowe jlowe added ? - Needs Triage Need team to review and classify bug Something isn't working test Only impacts tests labels Nov 19, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Nov 19, 2024
@GaryShen2008 GaryShen2008 assigned ustcfy and unassigned GaryShen2008 Nov 20, 2024
@ustcfy ustcfy linked a pull request Nov 21, 2024 that will close this issue
@ustcfy
Copy link
Collaborator

ustcfy commented Nov 26, 2024

Based on the explanation in the comment here rapidsai/cudf#6763 (comment) and my experiment, generating multiple row groups can reproduce this error #11736 instead of multiple stripes.

@ustcfy
Copy link
Collaborator

ustcfy commented Nov 26, 2024

So, is it necessary to generate multiple stripes? The default ORC stripe size is 64MB, I'm not sure if that is too large.👀

@jlowe
Copy link
Member Author

jlowe commented Nov 26, 2024

The point of this issue is not to overly focus on the specific ORC boolean issue. Instead this issue was raised because we realized the integration tests are not testing the cases where we need to write more than one row group or stripe. We need tests for both. Yes, there's an issue with ORC booleans across multiple row groups in a single write, and we'll come up with a specific unit test for that when that is fixed. However there's not a test for generating multiple row groups or stripes in a single write, regardless of booleans, and we should have tests to cover that case in general. This issue is about a whole category of testing that has been missed, not the specific boolean failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Only impacts tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants