Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Potential optimization: Batched memset. #15773

Open
nvdbaranec opened this issue May 17, 2024 · 0 comments
Open

[FEA] Potential optimization: Batched memset. #15773

nvdbaranec opened this issue May 17, 2024 · 0 comments
Labels
feature request New feature or request Performance Performance related issue

Comments

@nvdbaranec
Copy link
Contributor

nvdbaranec commented May 17, 2024

Under some situations in the Parquet reader (particularly the case with tables containing many columns or deeply nested column) we burn a decent amount of time doing cudaMemset() operations on output buffers. A good amount of this overhead seems to stem from the fact that we're simply launching many tiny kernels. It might be useful to have a batched/multi memset kernel that takes a list of address/sizes/values as a single input and does all the work under a single kernel launch. Similar to the Cub multi-buffer memcpy or contiguous_split.

@nvdbaranec nvdbaranec added feature request New feature or request Performance Performance related issue labels May 17, 2024
@nvdbaranec nvdbaranec changed the title [FEA] Potential optimization:: Batched memset. [FEA] Potential optimization: Batched memset. May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Performance Performance related issue
Projects
Status: In Progress
Development

No branches or pull requests

2 participants
@nvdbaranec and others