Alternative method to avoid rand contention in highly parallel usage #2

mroth · 2020-06-23T16:24:38Z

While this hasn't been a real-world performance issue in my particular use case, it is a known theoretical issue with this library that the usage of global rand, while convenient for users of the API, could cause lock contention and therefore performance issues when doing selection across multiple goroutines simultaneously in high throughput situations. Since more people seem to be adopting usage of this library, it's worth taking a look.

Initial Profiling

Adding a new appropriate RunParallel benchmark and checking across different CPU counts can show us the impact of this:

$ go test -run=^$ -bench=Parallel$ -cpu=1,2,4,8,16 -benchmem            
goos: darwin
goarch: amd64
pkg: github.com/mroth/weightedrand
BenchmarkPickParallel/10        26506533                48.0 ns/op             0 B/op          0 allocs/op
BenchmarkPickParallel/10-2      19374950                63.0 ns/op             0 B/op          0 allocs/op
BenchmarkPickParallel/10-4      12004538               101 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/10-8       9158463               136 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/10-16      7970101               146 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/100               16529026                71.3 ns/op             0 B/op          0 allocs/op
BenchmarkPickParallel/100-2             13610745                86.9 ns/op             0 B/op          0 allocs/op
BenchmarkPickParallel/100-4              9007213               132 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/100-8              6812714               176 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/100-16             6365052               188 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000              11643002               102 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000-2             9826347               121 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000-4             7176842               168 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000-8             5629867               213 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000-16            5286403               226 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/10000              8969794               129 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/10000-2            7945258               150 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/10000-4            5999011               199 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/10000-8            5052934               236 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/10000-16           4834681               250 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/100000             7166805               170 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/100000-2           6316102               193 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/100000-4           5007319               239 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/100000-8           4734712               255 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/100000-16          4552845               266 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000000            3980599               301 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000000-2          4381100               279 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000000-4          4397823               274 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000000-8          4154133               290 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000000-16         3985110               301 ns/op               0 B/op          0 allocs/op
PASS
ok      github.com/mroth/weightedrand   45.149s

Regardless of the number of Choices, as we increase the number of parallel CPUs attempting to utilize a Chooser simultaneously, performance decreases rather than increases. In practice, going from 1 CPU to 16 CPUs more than halves the actual throughput.

Using CPU profiling and examining the 16 CPU benchmark run via pprof confirms lock contention is indeed blocking compute quite significantly during this highly parallel utilization:

(pprof) top20
Showing nodes accounting for 44.14s, 97.35% of 45.34s total
Dropped 81 nodes (cum <= 0.23s)
Showing top 20 nodes out of 54
      flat  flat%   sum%        cum   cum%
    34.94s 77.06% 77.06%     34.96s 77.11%  runtime.usleep
     3.70s  8.16% 85.22%      3.71s  8.18%  runtime.pthread_cond_wait
     1.94s  4.28% 89.50%      1.94s  4.28%  runtime.nanotime1
     1.61s  3.55% 93.05%      1.61s  3.55%  runtime.(*semaRoot).queue
     1.39s  3.07% 96.12%      1.39s  3.07%  runtime.pthread_cond_signal
     0.14s  0.31% 96.43%      0.27s   0.6%  sort.doPivot_func
     0.09s   0.2% 96.63%      4.24s  9.35%  runtime.semacquire1
     0.07s  0.15% 96.78%      4.78s 10.54%  runtime.lock
     0.05s  0.11% 96.89%      5.75s 12.68%  sync.(*Mutex).lockSlow
     0.04s 0.088% 96.98%     30.64s 67.58%  runtime.runqgrab
     0.03s 0.066% 97.04%      8.11s 17.89%  math/rand.(*lockedSource).Int63
     0.03s 0.066% 97.11%     35.28s 77.81%  runtime.findrunnable
     0.03s 0.066% 97.18%      0.59s  1.30%  runtime.resetspinning
     0.02s 0.044% 97.22%      8.13s 17.93%  math/rand.(*Rand).Int31n
     0.02s 0.044% 97.27%      0.53s  1.17%  runtime.checkTimers
     0.01s 0.022% 97.29%      0.30s  0.66%  github.com/mroth/weightedrand.NewChooser
     0.01s 0.022% 97.31%      8.14s 17.95%  math/rand.(*Rand).Intn
     0.01s 0.022% 97.33%      5.76s 12.70%  sync.(*Mutex).Lock (inline)
     0.01s 0.022% 97.35%      2.32s  5.12%  sync.(*Mutex).Unlock (inline)
         0     0% 97.35%      0.47s  1.04%  github.com/mroth/weightedrand.BenchmarkPickParallel.func1

Patch and Benchmarks

This PR introduces a PickSource(*rand.Rand) method, a new version of Pick() which a reference to a source of randomness allows us to create a thread-local unique rand source per thread and avoid locks entirely. Now, as we add more CPUs, we can scale workload. The performance impact as shown in this benchmark is quite significant (~2x at 2 CPUs, ~20x at 16 CPUs):

$ benchstat before.sample after.sample
name                     old time/op  new time/op  delta
PickParallel/10          48.0ns ± 6%  44.5ns ± 7%   -7.27%  (p=0.000 n=10+10)
PickParallel/10-2        63.1ns ± 1%  23.0ns ± 6%  -63.56%  (p=0.000 n=8+10)
PickParallel/10-4         103ns ± 2%    12ns ± 7%  -88.68%  (p=0.000 n=10+10)
PickParallel/10-8         138ns ± 2%     6ns ± 4%  -95.72%  (p=0.000 n=9+10)
PickParallel/10-16        148ns ± 7%     4ns ± 2%  -97.19%  (p=0.000 n=9+9)
PickParallel/100         71.0ns ± 1%  68.5ns ± 1%   -3.44%  (p=0.000 n=10+10)
PickParallel/100-2       88.1ns ± 2%  34.8ns ± 2%  -60.52%  (p=0.000 n=10+10)
PickParallel/100-4        135ns ± 1%    18ns ± 1%  -86.89%  (p=0.000 n=10+10)
PickParallel/100-8        178ns ± 0%     9ns ± 1%  -94.94%  (p=0.000 n=10+9)
PickParallel/100-16       190ns ± 0%     6ns ± 1%  -96.73%  (p=0.000 n=8+9)
PickParallel/1000         102ns ± 1%    99ns ± 1%   -2.28%  (p=0.000 n=10+10)
PickParallel/1000-2       122ns ± 0%    50ns ± 1%  -58.89%  (p=0.000 n=10+10)
PickParallel/1000-4       170ns ± 0%    26ns ± 0%  -84.88%  (p=0.000 n=10+6)
PickParallel/1000-8       217ns ± 0%    13ns ± 0%  -94.00%  (p=0.000 n=10+8)
PickParallel/1000-16      229ns ± 1%     9ns ± 1%  -96.14%  (p=0.000 n=10+10)
PickParallel/10000        131ns ± 1%   127ns ± 0%   -2.83%  (p=0.000 n=10+8)
PickParallel/10000-2      150ns ± 0%    64ns ± 1%  -57.08%  (p=0.000 n=9+10)
PickParallel/10000-4      204ns ± 2%    33ns ± 1%  -83.85%  (p=0.000 n=10+10)
PickParallel/10000-8      240ns ± 1%    17ns ± 0%  -93.08%  (p=0.000 n=10+10)
PickParallel/10000-16     251ns ± 1%    11ns ± 0%  -95.54%  (p=0.000 n=10+8)
PickParallel/100000       169ns ± 1%   166ns ± 1%   -1.60%  (p=0.000 n=10+10)
PickParallel/100000-2     192ns ± 2%    85ns ± 3%  -55.83%  (p=0.000 n=10+10)
PickParallel/100000-4     249ns ± 2%    43ns ± 0%  -82.83%  (p=0.000 n=10+9)
PickParallel/100000-8     259ns ± 1%    22ns ± 1%  -91.59%  (p=0.000 n=10+10)
PickParallel/100000-16    266ns ± 1%    14ns ± 0%  -94.56%  (p=0.000 n=10+9)
PickParallel/1000000      305ns ± 1%   306ns ± 2%     ~     (p=0.329 n=9+10)
PickParallel/1000000-2    280ns ± 1%   150ns ± 1%  -46.25%  (p=0.000 n=10+10)
PickParallel/1000000-4    279ns ± 1%    76ns ± 1%  -72.67%  (p=0.000 n=10+10)
PickParallel/1000000-8    295ns ± 1%    38ns ± 1%  -87.02%  (p=0.000 n=10+9)
PickParallel/1000000-16   301ns ± 1%    23ns ± 3%  -92.19%  (p=0.000 n=10+10)

Time is again being spent as it should be:

Considerations

Adding a new method complicates the API, especially since it is one that opens the door to potential mis-use from developers who are not familiar with the underlying safety issues. Additionally, it is still unconfirmed whether any users of this library currently have a highly parallel utilization need.

If this is merged, I should make it clear in the documentation the situations where this method should be utilized and provide appropriate sample code so that it can be used safely.

we expect to see lock contention stemming from the usage of global rand.

mroth · 2023-08-13T16:38:27Z

For people coming across this PR in 2023 and beyond: usage of PickSource is no longer required for this performance benefit when using go1.21 or greater, you can just use Pick as long as you aren't manually seeding global rand (which is no longer recommended). Changes documented in #28.

mroth added 2 commits June 20, 2020 10:03

bench: add parallel perf test for lock contention

2526e09

we expect to see lock contention stemming from the usage of global rand.

feat: parallel safe Pick to avoid rand contention

7a6357d

mroth merged commit fcfd837 into master Jul 21, 2020

mroth deleted the rand-contention branch July 21, 2020 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative method to avoid rand contention in highly parallel usage #2

Alternative method to avoid rand contention in highly parallel usage #2

mroth commented Jun 23, 2020 •

edited

Loading

mroth commented Aug 13, 2023 •

edited

Loading

Alternative method to avoid rand contention in highly parallel usage #2

Alternative method to avoid rand contention in highly parallel usage #2

Conversation

mroth commented Jun 23, 2020 • edited Loading

Initial Profiling

Patch and Benchmarks

Considerations

mroth commented Aug 13, 2023 • edited Loading

mroth commented Jun 23, 2020 •

edited

Loading

mroth commented Aug 13, 2023 •

edited

Loading