[FEA] spill bounce buffer on disk->device and device->disk needs to be a pool #11888

abellina · 2024-12-18T20:40:35Z

The above shows a couple of threads trying to large blocks (~2GB) that were spilled to disk. You can see how the top thread had to wait for close to 1 second to get started, and the bottom thread had to wait for 3 seconds to get started. This is because we have a single bounce buffer of 128MB currently.

We should instead turn this into a real pool and suballocate something much smaller, say 32MB (128MB/4 concurrent?). It could be configurable but ideally we find what is a good rule of thumb to use for these disk->device transfers, when N threads are trying to do the same thing.

The device->disk path (opposite to what is described above) has the exact same problem, and it contends for the same single buffer. Once the pool is in place, we can use it for device->disk, or we can have a separate pool for that path.

revans2 · 2024-12-18T20:56:49Z

What is the bottleneck here? Is it that we only have one buffer and so we cannot double buffer (reading some data from disk while at the same time transferring some of that data to the device)? Is it computation because we compressed the spill data and need to decompress it as we read it back in? The picture shows that one thread had to wait to start a transfer but if the reason it was waiting is because we could not read the data from disk fast enough, then it does not matter because doing two in parallel would not make it faster. But if it is because we could not decompress the data fast enough, then parallelism might be a good win. At the same time to me I want to try and optimize the single threaded use case before we try and optimize for multiple different threads reading back in data simultaneously. The single threaded cases I would assume is more common, but that is just a guess.

abellina · 2024-12-18T21:08:11Z

The IO and decompress time is between the memcpys while in the "holding host spill bb" range. So yeah perhaps in the future we try and accelerate the IO and decompression itself. This is one step towards that IMHO.

The very large waits under the orange are just these threads waiting for another thread releasing the host spill bounce buffer. We are holding onto the bounce buffer while we do all of the IO (the disk bytes trickle into the host buffer, and then we do an h2d). We could allocate a host buffer as well that is just for the disk->host transfer and then memcpy to the pinned HostMemoryBuffer for the h2d.

If we had a few buffers here instead of 1 we should have many disk->host ops, 1 per concurrent gpu task at least.

abellina added ? - Needs Triage Need team to review and classify feature request New feature or request labels Dec 18, 2024

abellina mentioned this issue Dec 18, 2024

Change order of initialization so pinned pool is available for spill framework buffers #11889

Open

abellina changed the title ~~[FEA] spill bounce buffer on disk->device path needs to be a pool~~ [FEA] spill bounce buffer on disk->device and device->disk needs to be a pool Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] spill bounce buffer on disk->device and device->disk needs to be a pool #11888

[FEA] spill bounce buffer on disk->device and device->disk needs to be a pool #11888

abellina commented Dec 18, 2024 •

edited

Loading

revans2 commented Dec 18, 2024

abellina commented Dec 18, 2024 •

edited

Loading

[FEA] spill bounce buffer on disk->device and device->disk needs to be a pool #11888

[FEA] spill bounce buffer on disk->device and device->disk needs to be a pool #11888

Comments

abellina commented Dec 18, 2024 • edited Loading

revans2 commented Dec 18, 2024

abellina commented Dec 18, 2024 • edited Loading

abellina commented Dec 18, 2024 •

edited

Loading

abellina commented Dec 18, 2024 •

edited

Loading