-
Notifications
You must be signed in to change notification settings - Fork 854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] File written by ORC writer cannot be read in Pandas and Spark #15775
Labels
bug
Something isn't working
Comments
rapids-bot bot
pushed a commit
that referenced
this issue
May 22, 2024
Closes #15775 ORC writer encodes null mask bits in multiples of eight to avoid issues with other readers reading partial encoded bytes. When this does not align with row groups, the null mask encode boundaries are moved to align to multiples of eight. There was a bug in the alignment code that caused a pointless shift by 8 bits and, then, issues in encode. This PR fixes the unnecessary shift. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Nghia Truong (https://github.com/ttnghia) - Muhammad Haseeb (https://github.com/mhaseeb123) - Vyas Ramasubramani (https://github.com/vyasr) URL: #15789
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
For the attached parquet file, if we read it then write it out into an
.orc
file using cudf ORC writer then the output file cannot be read in Pandas or Spark.For more details, the given file contains a map with content like this:
And code to read/write file in cudf:
Related: NVIDIA/spark-rapids#10806.
The text was updated successfully, but these errors were encountered: