Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] (Discussion) Shall we push down the filter to cudf ParquetReader ? #11881

Open
sperlingxx opened this issue Dec 17, 2024 · 3 comments
Open
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request

Comments

@sperlingxx
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
This is just an early stage discussion without any specific plan.

Currently, cuDF Parquet Reader supports reading with filter expressions(https://github.com/rapidsai/cudf/blob/branch-25.02/cpp/include/cudf/io/parquet.hpp#L251). Although the filter pushdown does NOT seem to help on diminishing the cost of materialization through row-level read skipping, it might still be helpful as the prerequisite of the potential upcoming feature: pruning the FilterExec and following CoalescingBatchExec if all filters can be pushed down.

@sperlingxx sperlingxx added ? - Needs Triage Need team to review and classify feature request New feature or request labels Dec 17, 2024
@sperlingxx
Copy link
Collaborator Author

sperlingxx commented Dec 17, 2024

May I ask for your perspective on this issue? @revans2 @jlowe @winningsix @GaryShen2008

@sperlingxx sperlingxx added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Dec 17, 2024
@jlowe
Copy link
Member

jlowe commented Dec 17, 2024

Yes, eventually we want to push the filter predicate into the cudf reader to help avoid materialization. However currently there's no benefit to doing this work, because cudf only uses it to filter rowgroups, and we're already doing that in the Spark plugin. We're also applying predicates in a way that cudf does not do (e.g.: filtering dictionaries to see if the rest of the rowgroup should be skipped).

When cudf starts using the filter predicate to avoid decompress or decode of column pages then yes, we should definitely translate as much of the predicate to cudf as we can.

@revans2
Copy link
Collaborator

revans2 commented Dec 17, 2024

My perspective is the same as @jlowe. We are in the process of working with CUDF to come up with a design on how to do more lazy decoding of data. In fact right now we are looking at lazy fetching of data as well. It is still very preliminary though.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants