DGX Nightly Benchmark run 20210217 #109

quasiben · 2021-02-17T16:18:01Z

Benchmark history

Raw Data

<Client: 'tcp://127.0.0.1:36573' processes=10 threads=10, memory=540.94 GB>
Distributed Version: 2021.02.0+7.g383ea032
simple
5.552e-01 +/- 4.505e-02
shuffle
2.322e+01 +/- 8.996e-01
rand_access
1.058e-02 +/- 6.584e-04
anom_mean
1.141e+02 +/- 2.758e+00

Raw Values

simple
[0.55093336 0.56203961 0.53239679 0.54476047 0.60462093 0.53786802
0.53799701 0.56506395 0.46722174 0.64862871]
shuffle
[23.48259377 23.3623848 21.50754213 23.67603922 22.89431787 24.37197256
24.03984976 21.67854238 23.82202053 23.35824418]
rand_access
[0.00939989 0.00964594 0.01096702 0.01064014 0.01120782 0.0106523
0.01046085 0.0117898 0.01062775 0.01038837]
anom_mean
[112.48733354 113.56297135 114.61307716 113.53013754 114.78903866
113.29076409 112.27411127 110.57477736 114.84757924 121.52652001]

Dask Profiles

Scheduler Execution Graph

jakirkham · 2021-02-18T03:15:16Z

Looking at the shuffle profile, this zooms in a bit (though not that much really) to show how much time is spent in write and extract_serialize respectively. There is one other write call, which is not as large (though still larger than extract_serialize). There are a couple read calls that also take a fair bit of time with similarly small from_frames associated with them.

cc @mrocklin

mrocklin · 2021-02-19T15:40:36Z

Hrm, that's odd. In general I feel like all of the profiling technologies are good at identifying different kinds of activities. I notice in the tree view above that extract_serialize has the largest percentage (1.5%) of any leaf node.

jakirkham · 2021-02-19T17:28:20Z

Yeah it's interesting. Not saying the Dask profile necessarily tells the full story either.

Something else interesting is the socket send and recv calls take around 0.6% in the call graph, which differs from what we see in viztracer. I wonder if we are missing something here or if there are limitations of each of these tools, which we need to factor in somehow. Antoine seemed to allude to that here ( dask/distributed#4443 (comment) )

Ignoring that for a moment, if we look at the read portion of the call graph, we see read_bytes takes 1.63% and read_into takes 1.81%, walking all the way down these branches to their leaves recv_into takes 0.67% and isinstance takes 0.59%. Subtracting the leaves from these base read_* functions, we find a total time of 2.18% is accounted for there. This is also larger than from_frames at 1.35% also on the read branch.

Agree on the write side extract_serialize seems to be the dominating component. Whereas on the read side various Tornado functions seem to be the dominating component. So at least from the call graph things seem to be balanced between Tornado and serialization overhead. Though admittedly other profiling strategies seem to be showing one or the other as a larger contributor. Thus far my working theory is these are more equal as the call graph would show, but I could be wrong about this.

mrocklin · 2021-02-19T17:30:10Z

🤷 :)

jakirkham mentioned this issue Feb 18, 2021

Profiling Scheduler Performance dask/distributed#4443

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DGX Nightly Benchmark run 20210217 #109

DGX Nightly Benchmark run 20210217 #109

quasiben commented Feb 17, 2021

jakirkham commented Feb 18, 2021

mrocklin commented Feb 19, 2021

jakirkham commented Feb 19, 2021

mrocklin commented Feb 19, 2021

DGX Nightly Benchmark run 20210217 #109

DGX Nightly Benchmark run 20210217 #109

Comments

quasiben commented Feb 17, 2021

Benchmark history

Raw Data

Raw Values

Dask Profiles

Scheduler Execution Graph

jakirkham commented Feb 18, 2021

mrocklin commented Feb 19, 2021

jakirkham commented Feb 19, 2021

mrocklin commented Feb 19, 2021