GH-45092: [C++][Parquet] Add GetReadRanges function to FileReader #45093

zeroshade · 2024-12-20T16:55:20Z

Rationale for this change

For some consumers, it is convenient to expose a way to retrieve the necessary byte ranges of a parquet file to read specific column chunks from row groups without having to go through a full ReadRangeCache. Ultimately, it's a fairly simple function since we already have all the infrastructure implemented to compute these ranges.

What changes are included in this PR?

A single function added to parquet::FileReader named GetReadRanges which computes and retrieves the coalesced read ranges for specified row groups and column indices.

GitHub Issue: [C++][Parquet] Expose ReadRanges in Parquet FileReader #45092

github-actions · 2024-12-20T16:55:46Z

⚠️ GitHub issue #45092 has been automatically assigned in GitHub to PR creator.

kou

Could you also fix the lint errors?

I don't object this API but is there any conreate use-case of this API?

cpp/src/parquet/file_reader.cc

lidavidm · 2024-12-21T02:00:50Z

I pinged colleagues to give a review and make sure the API fits their use case. But essentially sometimes we want to accomplish pre-buffering (like ReadRangeCache) but without having to go through that API specifically. (Possibly it would work to implement a custom RandomAccessFile too?)

wgtmac · 2024-12-21T06:07:43Z

I think this is more like an utility function that users are capable of parsing parquet::FileMetaData to compute the same result on their end. In terms of the extensibility, if users want a third parameter of requested row ranges in addition to row group and column, do we want to support it as well?

FYI there was a discussion on the row range API: https://docs.google.com/document/d/1SeVcYudu6uD9rb9zRAnlLGgdauutaNZlAaS0gVzjkgM and a stale PR: #39731.

mapleFU · 2024-12-23T02:23:27Z

General looks ok to me ( but without page index, I think it might not prune many data)

felipeblazing · 2024-12-23T19:44:35Z

The idea is that we would like to be able to know ahead of time what bytes are going to be read from a parquet file without necessarily performing the computation associated with decoding the file at the same time. The high level use case for us is that sometimes we don't have the computational resources to perform decompression and decoding but we do have available I/O. We want to be able to continue to perform I/O when interacting with object stores and other slow sources of parquet data without having to commit to decompression / decoding until a later point in time.

This would allow us to move bytes from object stores to local memory where it can wait until we have compute resources for decompression / decoding.

zeroshade · 2024-12-23T22:18:15Z

Lint issues fixed and variable renamed as suggested

mapleFU · 2024-12-24T02:31:45Z

This patch LGTM now. Out of curiousity, why not call PreBuffer in SerializedFile, but call this new interface?

apacheGH-45092: [C++][Parquet] Add GetReadRanges function to FileReader

b9f1278

zeroshade requested review from kou and lidavidm December 20, 2024 16:55

zeroshade requested a review from wgtmac as a code owner December 20, 2024 16:55

github-actions bot added Component: Parquet Component: C++ awaiting committer review Awaiting committer review labels Dec 20, 2024

kou reviewed Dec 21, 2024

View reviewed changes

cpp/src/parquet/file_reader.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Dec 21, 2024

fix lint and rename var

5fdd78c

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-45092: [C++][Parquet] Add GetReadRanges function to FileReader #45093

GH-45092: [C++][Parquet] Add GetReadRanges function to FileReader #45093

zeroshade commented Dec 20, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 20, 2024

kou left a comment

lidavidm commented Dec 21, 2024

wgtmac commented Dec 21, 2024

mapleFU commented Dec 23, 2024

felipeblazing commented Dec 23, 2024

zeroshade commented Dec 23, 2024

mapleFU commented Dec 24, 2024

GH-45092: [C++][Parquet] Add GetReadRanges function to FileReader #45093

Are you sure you want to change the base?

GH-45092: [C++][Parquet] Add GetReadRanges function to FileReader #45093

Conversation

zeroshade commented Dec 20, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

github-actions bot commented Dec 20, 2024

kou left a comment

Choose a reason for hiding this comment

lidavidm commented Dec 21, 2024

wgtmac commented Dec 21, 2024

mapleFU commented Dec 23, 2024

felipeblazing commented Dec 23, 2024

zeroshade commented Dec 23, 2024

mapleFU commented Dec 24, 2024

zeroshade commented Dec 20, 2024 •

edited by github-actions bot

Loading