Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] File written by ORC writer cannot be read in Pandas and Spark #15775

Closed
ttnghia opened this issue May 17, 2024 · 0 comments · Fixed by #15789
Closed

[BUG] File written by ORC writer cannot be read in Pandas and Spark #15775

ttnghia opened this issue May 17, 2024 · 0 comments · Fixed by #15789
Labels
bug Something isn't working

Comments

@ttnghia
Copy link
Contributor

ttnghia commented May 17, 2024

For the attached parquet file, if we read it then write it out into an .orc file using cudf ORC writer then the output file cannot be read in Pandas or Spark.

For more details, the given file contains a map with content like this:

+-----------+
|        _c0|
+-----------+
|       null|
|         {}|
|{B -> null}|
|{B -> null}|
|{B -> null}|
|    {A -> }|
|       null|
|{B -> null}|
|{B -> null}|
|    {A -> }|
|       null|
|    {A -> }|
|    {A -> }|
|{B -> null}|
|    {A -> }|
|         {}|
|{B -> null}|
|{B -> null}|
|       null|
|{B -> null}|
+-----------+

And code to read/write file in cudf:

TEST_F(OrcReaderTest, Write)
{
  cudf::io::parquet_reader_options read_opts = cudf::io::parquet_reader_options::builder(
    cudf::io::source_info{"/home/nghiat/Devel/tmp/ORC_DATA/tmp/708769/GPU_pq/input.parquet"});
  auto input = cudf::io::read_parquet(read_opts);

  cudf::io::chunked_orc_writer_options write_opts =
    cudf::io::chunked_orc_writer_options::builder(
      cudf::io::sink_info{"/home/nghiat/Devel/tmp/ORC_DATA/tmp/708769/GPU_pq/output.orc"})
      .compression(cudf::io::compression_type::SNAPPY);
  cudf::io::orc_chunked_writer(write_opts).write(*input.tbl);
}

Related: NVIDIA/spark-rapids#10806.

@ttnghia ttnghia added the bug Something isn't working label May 17, 2024
rapids-bot bot pushed a commit that referenced this issue May 22, 2024
Closes #15775

ORC writer encodes null mask bits in multiples of eight to avoid issues with other readers reading partial encoded bytes. When this does not align with row groups, the null mask encode boundaries are moved to align to multiples of eight. There was a bug in the alignment code that caused a pointless shift by 8 bits and, then, issues in encode. This PR fixes the unnecessary shift.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #15789
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants
@ttnghia and others