You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The above shows a couple of threads trying to large blocks (~2GB) that were spilled to disk. You can see how the top thread had to wait for close to 1 second to get started, and the bottom thread had to wait for 3 seconds to get started. This is because we have a single bounce buffer of 128MB currently.
We should instead turn this into a real pool and suballocate something much smaller, say 32MB (128MB/4 concurrent?). It could be configurable but ideally we find what is a good rule of thumb to use for these disk->device transfers, when N threads are trying to do the same thing.
The device->disk path (opposite to what is described above) has the exact same problem, and it contends for the same single buffer. Once the pool is in place, we can use it for device->disk, or we can have a separate pool for that path.
The text was updated successfully, but these errors were encountered:
What is the bottleneck here? Is it that we only have one buffer and so we cannot double buffer (reading some data from disk while at the same time transferring some of that data to the device)? Is it computation because we compressed the spill data and need to decompress it as we read it back in? The picture shows that one thread had to wait to start a transfer but if the reason it was waiting is because we could not read the data from disk fast enough, then it does not matter because doing two in parallel would not make it faster. But if it is because we could not decompress the data fast enough, then parallelism might be a good win. At the same time to me I want to try and optimize the single threaded use case before we try and optimize for multiple different threads reading back in data simultaneously. The single threaded cases I would assume is more common, but that is just a guess.
The IO and decompress time is between the memcpys while in the "holding host spill bb" range. So yeah perhaps in the future we try and accelerate the IO and decompression itself. This is one step towards that IMHO.
The very large waits under the orange are just these threads waiting for another thread releasing the host spill bounce buffer. We are holding onto the bounce buffer while we do all of the IO (the disk bytes trickle into the host buffer, and then we do an h2d). We could allocate a host buffer as well that is just for the disk->host transfer and then memcpy to the pinned HostMemoryBuffer for the h2d.
If we had a few buffers here instead of 1 we should have many disk->host ops, 1 per concurrent gpu task at least.
abellina
changed the title
[FEA] spill bounce buffer on disk->device path needs to be a pool
[FEA] spill bounce buffer on disk->device and device->disk needs to be a pool
Dec 18, 2024
The above shows a couple of threads trying to large blocks (~2GB) that were spilled to disk. You can see how the top thread had to wait for close to 1 second to get started, and the bottom thread had to wait for 3 seconds to get started. This is because we have a single bounce buffer of 128MB currently.
We should instead turn this into a real pool and suballocate something much smaller, say 32MB (128MB/4 concurrent?). It could be configurable but ideally we find what is a good rule of thumb to use for these disk->device transfers, when N threads are trying to do the same thing.
The device->disk path (opposite to what is described above) has the exact same problem, and it contends for the same single buffer. Once the pool is in place, we can use it for device->disk, or we can have a separate pool for that path.
The text was updated successfully, but these errors were encountered: