[BUG] File written by ORC writer cannot be read in Pandas and Spark #15775

ttnghia · 2024-05-17T19:13:17Z

For the attached parquet file, if we read it then write it out into an .orc file using cudf ORC writer then the output file cannot be read in Pandas or Spark.

For more details, the given file contains a map with content like this:

+-----------+
|        _c0|
+-----------+
|       null|
|         {}|
|{B -> null}|
|{B -> null}|
|{B -> null}|
|    {A -> }|
|       null|
|{B -> null}|
|{B -> null}|
|    {A -> }|
|       null|
|    {A -> }|
|    {A -> }|
|{B -> null}|
|    {A -> }|
|         {}|
|{B -> null}|
|{B -> null}|
|       null|
|{B -> null}|
+-----------+

And code to read/write file in cudf:

TEST_F(OrcReaderTest, Write)
{
  cudf::io::parquet_reader_options read_opts = cudf::io::parquet_reader_options::builder(
    cudf::io::source_info{"/home/nghiat/Devel/tmp/ORC_DATA/tmp/708769/GPU_pq/input.parquet"});
  auto input = cudf::io::read_parquet(read_opts);

  cudf::io::chunked_orc_writer_options write_opts =
    cudf::io::chunked_orc_writer_options::builder(
      cudf::io::sink_info{"/home/nghiat/Devel/tmp/ORC_DATA/tmp/708769/GPU_pq/output.orc"})
      .compression(cudf::io::compression_type::SNAPPY);
  cudf::io::orc_chunked_writer(write_opts).write(*input.tbl);
}

Related: NVIDIA/spark-rapids#10806.

The text was updated successfully, but these errors were encountered:

Closes #15775 ORC writer encodes null mask bits in multiples of eight to avoid issues with other readers reading partial encoded bytes. When this does not align with row groups, the null mask encode boundaries are moved to align to multiples of eight. There was a bug in the alignment code that caused a pointless shift by 8 bits and, then, issues in encode. This PR fixes the unnecessary shift. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Nghia Truong (https://github.com/ttnghia) - Muhammad Haseeb (https://github.com/mhaseeb123) - Vyas Ramasubramani (https://github.com/vyasr) URL: #15789

ttnghia added the bug Something isn't working label May 17, 2024

vuule mentioned this issue May 20, 2024

Fix row group alignment in ORC writer #15789

Merged

3 tasks

rapids-bot bot closed this as completed in #15789 May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] File written by ORC writer cannot be read in Pandas and Spark #15775

[BUG] File written by ORC writer cannot be read in Pandas and Spark #15775

ttnghia commented May 17, 2024 •

edited

[BUG] File written by ORC writer cannot be read in Pandas and Spark #15775

[BUG] File written by ORC writer cannot be read in Pandas and Spark #15775

Comments

ttnghia commented May 17, 2024 • edited

ttnghia commented May 17, 2024 •

edited