[FEA] Potential optimization: Batched memset. #15773

nvdbaranec · 2024-05-17T15:43:03Z

Under some situations in the Parquet reader (particularly the case with tables containing many columns or deeply nested column) we burn a decent amount of time doing cudaMemset() operations on output buffers. A good amount of this overhead seems to stem from the fact that we're simply launching many tiny kernels. It might be useful to have a batched/multi memset kernel that takes a list of address/sizes/values as a single input and does all the work under a single kernel launch. Similar to the Cub multi-buffer memcpy or contiguous_split.

The text was updated successfully, but these errors were encountered:

nvdbaranec added feature request New feature or request Performance Performance related issue labels May 17, 2024

nvdbaranec changed the title ~~[FEA] Potential optimization:: Batched memset.~~ [FEA] Potential optimization: Batched memset. May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Potential optimization: Batched memset. #15773

[FEA] Potential optimization: Batched memset. #15773

nvdbaranec commented May 17, 2024 •

edited

[FEA] Potential optimization: Batched memset. #15773

[FEA] Potential optimization: Batched memset. #15773

Comments

nvdbaranec commented May 17, 2024 • edited

nvdbaranec commented May 17, 2024 •

edited