Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] spill bounce buffer on disk->device and device->disk needs to be a pool #11888

Open
abellina opened this issue Dec 18, 2024 · 2 comments
Open
Labels
? - Needs Triage Need team to review and classify feature request New feature or request

Comments

@abellina
Copy link
Collaborator

abellina commented Dec 18, 2024

Image

The above shows a couple of threads trying to large blocks (~2GB) that were spilled to disk. You can see how the top thread had to wait for close to 1 second to get started, and the bottom thread had to wait for 3 seconds to get started. This is because we have a single bounce buffer of 128MB currently.

We should instead turn this into a real pool and suballocate something much smaller, say 32MB (128MB/4 concurrent?). It could be configurable but ideally we find what is a good rule of thumb to use for these disk->device transfers, when N threads are trying to do the same thing.

The device->disk path (opposite to what is described above) has the exact same problem, and it contends for the same single buffer. Once the pool is in place, we can use it for device->disk, or we can have a separate pool for that path.

@abellina abellina added ? - Needs Triage Need team to review and classify feature request New feature or request labels Dec 18, 2024
@revans2
Copy link
Collaborator

revans2 commented Dec 18, 2024

What is the bottleneck here? Is it that we only have one buffer and so we cannot double buffer (reading some data from disk while at the same time transferring some of that data to the device)? Is it computation because we compressed the spill data and need to decompress it as we read it back in? The picture shows that one thread had to wait to start a transfer but if the reason it was waiting is because we could not read the data from disk fast enough, then it does not matter because doing two in parallel would not make it faster. But if it is because we could not decompress the data fast enough, then parallelism might be a good win. At the same time to me I want to try and optimize the single threaded use case before we try and optimize for multiple different threads reading back in data simultaneously. The single threaded cases I would assume is more common, but that is just a guess.

@abellina
Copy link
Collaborator Author

abellina commented Dec 18, 2024

The IO and decompress time is between the memcpys while in the "holding host spill bb" range. So yeah perhaps in the future we try and accelerate the IO and decompression itself. This is one step towards that IMHO.

The very large waits under the orange are just these threads waiting for another thread releasing the host spill bounce buffer. We are holding onto the bounce buffer while we do all of the IO (the disk bytes trickle into the host buffer, and then we do an h2d). We could allocate a host buffer as well that is just for the disk->host transfer and then memcpy to the pinned HostMemoryBuffer for the h2d.

If we had a few buffers here instead of 1 we should have many disk->host ops, 1 per concurrent gpu task at least.

@abellina abellina changed the title [FEA] spill bounce buffer on disk->device path needs to be a pool [FEA] spill bounce buffer on disk->device and device->disk needs to be a pool Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants