-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testing network performance #5258
Comments
Thanks for doing this @gjoseph92 . This is interesting.
I actually am surprised by both of these actually. For reference, my historical expectation is that on a full (non-virtual) machine I would expect the following:
So adding bandwidths as one does in an inverse summation kind of way, I would us expect to get something like 700MB/s through Dask comms. This is obviously hardware dependent though, and certainly VMs are not real "M"s. |
@gjoseph92 for the purposes of tuning shuffle computations, I think that we should build up some intuition around the size of dataframes when using comms. I would be curious to know the expected bandwidth as we send dataframes of varying sizes. Do we stay at the 300MB/s level if we shift from 100MB dataframes to 20MB? to 5MB? to 1MB? to 100kB? Understanding that response curve is probably useful for us. |
Yeah, I'll admit that a 2 GB/s communication bandwidth feels fast to me for data processing workloads. I'm fine leaving this one alone for a while :) |
For the curious, here are py-spy profiles of the sending and receiving workers running locally (go to On the receive side, 30% is actual socket reading in Tornado, 25% is making the read buffer in TCP comms, 25% idle. On the send side, 30% is Tornado socket writing, 42% is idle. |
Yeah with the buffer allocation time, it is worth noting that builtin Python creation operations, like We could avoid this by using In [1]: import numpy
In [2]: %timeit bytearray(1_000_000)
18.3 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit numpy.empty((1_000_000,), dtype="u1")
630 ns ± 7.44 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) I've been thinking we would benefit here from collecting the routines for memory allocation somewhere and then using the most efficient ones when available ( #3970 ). |
Good to know. Not calloc-ing here definitely seems desirable. I also just realized another discrepancy: |
It goes a bit beyond not using |
Running our shuffles, I'm seeing something that looks to me like SSL reads within Tornado could be blocking the event loop. But I haven't dug into Tornado much before so I might be understanding this incorrectly. Here's a py-spy profile of one worker: The worker should constantly be both reading and writing about the same amount of data. But in left-heavy view, you see In tracing through the code from Tornado while not self.closed():
# Read from the socket until we get EWOULDBLOCK or equivalent.
# SSL sockets do some internal buffering, and if the data is
# sitting in the SSL object's buffer select() and friends
# can't see it; the only way to find out if it's there is to
# try to read it.
if self._read_to_buffer() == 0:
break To me that sounds like the only way to know reading from SSL is done is to try reading and see what happens. But what happens when there isn't data sitting in the SSL object's buffer? It seems like that will eventually call Is it possible that if the SSL buffer was empty, the Again I'm not very Tornado- or SSL-literate and this was a pretty cursory read-through, so this might be off-base. |
FWICT this is the underlying |
I took a py-spy profile of the same workload as above running on an equivalent cluster, but using TCP instead of TLS. Anecdotally I noticed tasks just completed a lot faster, even though the dashboard reported similar average worker bandwidths. In the profile, the worker event loop only gets 15% idle time with TLS, vs 40% idle time with TCP. With TLS, ~60% of the event loop is spent on Tornado's |
I ran a version of my script above on identical clusters using TCP vs TLS. This particular trial showed TCP ~1.5x faster; on others it's been up to 1.9x faster. Full data
|
Since we are comparing low-level sends, maybe it is worth running some benchmarks with ery as well |
@quasiben I know you and others have looked at cloud network performance a bit already. Is there anyone who might be interested in looking at these performance issues so we don't repeat work? @jakirkham from reading the SSLSocket read implementation and OpenSSL docs, I'm now not so sure of my theory. Assuming the socket is non-blocking (it is, right?), the A different interesting thing from the OpenSSL docs though:
I don't know OpenSSL very well. Would an SSL write on the same socket through the same SSL object count as non-application data? Since this shuffle workload has each worker communicating with all others, there's a lot of concurrent sending and receiving. If reads could somehow be blocking on the writes, that might explain why we see so much event loop blocking in reads. Maybe a good next step would be to have this test script to all-to-all transfers, instead of A->B. Also, I don't think I was py-spying in native mode; I'll try that and see if we can get traces for what's happening inside of |
Got a native profile from py-spy. First I just must say @benfred, so many thanks for this great tool—being able to get native profiles integrated with the Python callstack with a single flag and no mucking around in perf is amazing! Worker profile (see So we can immediately see that 71% of the event loop's time is spent blocking on glibc One odd thing is that py-spy's line numbers in So the problem is the classic https://bugs.python.org/issue7946: contention re-acquiring the GIL when returning from a But that thread was about the scheduler. It's interesting to see how much this affects workers' performance transferring data. We have much more control to optimize the scheduler; when thinking about networking code, since workers are running user code, we should just assume there could always be high GIL contention. |
Yeah was looking at the GIL release/acquisition lines earlier and was wondering if that was involved somehow. That said, how much data are we receiving here? Currently we are receiving at most 2GB per read ( due to an OpenSSL 1.0.2 bug; see https://bugs.python.org/issue42853 ). Are we filling more than 2GB and if so how much more? |
Yeah I'm curious if releasing the GIL could actually be skipped for non-blocking sockets. I think we're definitely getting less than 2GB per read but I can confirm later. Our sends aim to be ~5MB MB for memory reasons. |
How hard would this be to try? Is this a small change in cPython followed by a rebuild? |
I think the issue is just that network IO is much more serial than we want it to be. You can see in the same py-spy profile in This tells me that we can't write data out fast enough to the network. But looking at the event loop, we saw writes only take 7% of the time. The issue is that the sending coroutines don't get scheduled on the event loop often enough, because the |
Do we know why the GIL would be hard to reacquire here?
…On Wed, Sep 8, 2021 at 10:44 AM Gabe Joseph ***@***.***> wrote:
I think the issue is just that network IO is much more serial than we want
it to be.
You can see in the same py-spy profile
<https://speedscope.app/#profileURL=https%3a%2f%2fgistcdn.githack.com%2fgjoseph92%2f65b80849d5a7b3a39ee34c92b603f287%2fraw%2f8634c96a13c3026bc1c6dc4444741def1802d171%2ftls-10_6_23_69-41859.json>
in Dask-DefaultThreads-62-4 and -2, these threads are blocked 80% of the
time in our maybe_pack function acquiring a semaphore.maybe_pack is a
function that processes some DataFrames, serializes them, and puts them on
a queue for a pool of coroutines to send over the comm. When a coroutine
has sent an item, it releases the semaphore. (So this is basically just a
bounded queue, just with the blocking points shifted a little for memory
reasons.)
This tells me that we can't write data out fast enough to the network. But
looking at the event loop, we saw writes only take 7% of the time. The
issue is that the sending coroutines don't get scheduled on the event loop
often enough, because the read coroutines spend so much time blocking the
event loop trying to re-acquire the GIL.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5258 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTGK2YIYLAGU3OP3YQTUA6AF7ANCNFSM5CV4H4AQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I think so? Alternatively if we could get
Probably just because our worker threads are mostly doing GIL-holding things when they're not blocked? |
A much simpler test would be to turn SSL off. Have we already done this somewhere (guessing so)? |
See the profile in #5258 (comment). I spent far too much time trying to get an equivalent profile with Also TCP sockets still release and retake the GIL on every send, so is that the test we'd want? |
Just to add to what Gabe said above on this.
I think this is the same answer that Antione gave related to
So presumably something in Python picks up the slack while OpenSSL is doing work. However once OpenSSL is ready to rejoin it has to wait until there is a moment where it can reacquire the GIL |
My gut says this would be fairly quick to hack badly into cpython to see if things improve (just stop dropping the gil for all socket calls). A proper patch would interact more with the asyncio eventloop implementation (asyncio or uvloop) to only keep the gil for async socket operations. |
@jcrist agreed. Building cpython (and getting it into a docker image if we wanted to test on our real shuffle workload) would be the slowest part. I would be very interested to see the results of this! |
Saw PR ( python/cpython#31492 ) recently, which looks potentially interesting in this context |
I wrote a quick script to test network performance between workers.
We try 3 approaches:
Make the scheduler transfer a DataFrame through task dependencies.
This includes client < -- > scheduler < -- > workers communication in transfer time, so isn't actually a good measure of bandwidth, but is maybe still an interesting number, since it's closer to (but still an over-estimation of) a measure of the real-life bandwidth a data transfer would experience.
Add simple comms handlers and send the DataFrame between workers directly.
Use
iperf3
to test raw TCP network (and disk) performance, to get an upper bound on what we should expect from the networkThis script was a quick hack (DataFrame size doesn't match iperf3 transfer size, for instance) but others might still find it interesting.
The script
Initial results:
On a Coiled cluster (docker on AWS EC2 VMs; don't know the exact instance type, but I requested 2CPU and 2GiB memory, so something low-end):
So dask's networking is only half as fast as raw TCP here. That's better than I expected actually.
Using comms handlers directly is faster, though not hugely. Also not surprising.
On these low-end EC2 nodes, networking is slow. And disk is very slow.
Full results
Locally on my mac:
Full results
Dask lags further behind a very fast network, but it's still much faster than a slow network! Does it need to be faster? How often are we actually bandwidth-constrained (versus being constrained by event loop/GIL/worker threads/things that can do something useful with that bandwidth)?
The text was updated successfully, but these errors were encountered: