Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative method to avoid rand contention in highly parallel usage #2

Merged
merged 2 commits into from
Jul 21, 2020

Conversation

mroth
Copy link
Owner

@mroth mroth commented Jun 23, 2020

While this hasn't been a real-world performance issue in my particular use case, it is a known theoretical issue with this library that the usage of global rand, while convenient for users of the API, could cause lock contention and therefore performance issues when doing selection across multiple goroutines simultaneously in high throughput situations. Since more people seem to be adopting usage of this library, it's worth taking a look.

Initial Profiling

Adding a new appropriate RunParallel benchmark and checking across different CPU counts can show us the impact of this:

$ go test -run=^$ -bench=Parallel$ -cpu=1,2,4,8,16 -benchmem            
goos: darwin
goarch: amd64
pkg: github.com/mroth/weightedrand
BenchmarkPickParallel/10        26506533                48.0 ns/op             0 B/op          0 allocs/op
BenchmarkPickParallel/10-2      19374950                63.0 ns/op             0 B/op          0 allocs/op
BenchmarkPickParallel/10-4      12004538               101 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/10-8       9158463               136 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/10-16      7970101               146 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/100               16529026                71.3 ns/op             0 B/op          0 allocs/op
BenchmarkPickParallel/100-2             13610745                86.9 ns/op             0 B/op          0 allocs/op
BenchmarkPickParallel/100-4              9007213               132 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/100-8              6812714               176 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/100-16             6365052               188 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000              11643002               102 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000-2             9826347               121 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000-4             7176842               168 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000-8             5629867               213 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000-16            5286403               226 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/10000              8969794               129 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/10000-2            7945258               150 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/10000-4            5999011               199 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/10000-8            5052934               236 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/10000-16           4834681               250 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/100000             7166805               170 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/100000-2           6316102               193 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/100000-4           5007319               239 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/100000-8           4734712               255 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/100000-16          4552845               266 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000000            3980599               301 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000000-2          4381100               279 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000000-4          4397823               274 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000000-8          4154133               290 ns/op               0 B/op          0 allocs/op
BenchmarkPickParallel/1000000-16         3985110               301 ns/op               0 B/op          0 allocs/op
PASS
ok      github.com/mroth/weightedrand   45.149s

Regardless of the number of Choices, as we increase the number of parallel CPUs attempting to utilize a Chooser simultaneously, performance decreases rather than increases. In practice, going from 1 CPU to 16 CPUs more than halves the actual throughput.

Using CPU profiling and examining the 16 CPU benchmark run via pprof confirms lock contention is indeed blocking compute quite significantly during this highly parallel utilization:

(pprof) top20
Showing nodes accounting for 44.14s, 97.35% of 45.34s total
Dropped 81 nodes (cum <= 0.23s)
Showing top 20 nodes out of 54
      flat  flat%   sum%        cum   cum%
    34.94s 77.06% 77.06%     34.96s 77.11%  runtime.usleep
     3.70s  8.16% 85.22%      3.71s  8.18%  runtime.pthread_cond_wait
     1.94s  4.28% 89.50%      1.94s  4.28%  runtime.nanotime1
     1.61s  3.55% 93.05%      1.61s  3.55%  runtime.(*semaRoot).queue
     1.39s  3.07% 96.12%      1.39s  3.07%  runtime.pthread_cond_signal
     0.14s  0.31% 96.43%      0.27s   0.6%  sort.doPivot_func
     0.09s   0.2% 96.63%      4.24s  9.35%  runtime.semacquire1
     0.07s  0.15% 96.78%      4.78s 10.54%  runtime.lock
     0.05s  0.11% 96.89%      5.75s 12.68%  sync.(*Mutex).lockSlow
     0.04s 0.088% 96.98%     30.64s 67.58%  runtime.runqgrab
     0.03s 0.066% 97.04%      8.11s 17.89%  math/rand.(*lockedSource).Int63
     0.03s 0.066% 97.11%     35.28s 77.81%  runtime.findrunnable
     0.03s 0.066% 97.18%      0.59s  1.30%  runtime.resetspinning
     0.02s 0.044% 97.22%      8.13s 17.93%  math/rand.(*Rand).Int31n
     0.02s 0.044% 97.27%      0.53s  1.17%  runtime.checkTimers
     0.01s 0.022% 97.29%      0.30s  0.66%  github.com/mroth/weightedrand.NewChooser
     0.01s 0.022% 97.31%      8.14s 17.95%  math/rand.(*Rand).Intn
     0.01s 0.022% 97.33%      5.76s 12.70%  sync.(*Mutex).Lock (inline)
     0.01s 0.022% 97.35%      2.32s  5.12%  sync.(*Mutex).Unlock (inline)
         0     0% 97.35%      0.47s  1.04%  github.com/mroth/weightedrand.BenchmarkPickParallel.func1

profile001

Patch and Benchmarks

This PR introduces a PickSource(*rand.Rand) method, a new version of Pick() which a reference to a source of randomness allows us to create a thread-local unique rand source per thread and avoid locks entirely. Now, as we add more CPUs, we can scale workload. The performance impact as shown in this benchmark is quite significant (~2x at 2 CPUs, ~20x at 16 CPUs):

$ benchstat before.sample after.sample
name                     old time/op  new time/op  delta
PickParallel/10          48.0ns ± 6%  44.5ns ± 7%   -7.27%  (p=0.000 n=10+10)
PickParallel/10-2        63.1ns ± 1%  23.0ns ± 6%  -63.56%  (p=0.000 n=8+10)
PickParallel/10-4         103ns ± 2%    12ns ± 7%  -88.68%  (p=0.000 n=10+10)
PickParallel/10-8         138ns ± 2%     6ns ± 4%  -95.72%  (p=0.000 n=9+10)
PickParallel/10-16        148ns ± 7%     4ns ± 2%  -97.19%  (p=0.000 n=9+9)
PickParallel/100         71.0ns ± 1%  68.5ns ± 1%   -3.44%  (p=0.000 n=10+10)
PickParallel/100-2       88.1ns ± 2%  34.8ns ± 2%  -60.52%  (p=0.000 n=10+10)
PickParallel/100-4        135ns ± 1%    18ns ± 1%  -86.89%  (p=0.000 n=10+10)
PickParallel/100-8        178ns ± 0%     9ns ± 1%  -94.94%  (p=0.000 n=10+9)
PickParallel/100-16       190ns ± 0%     6ns ± 1%  -96.73%  (p=0.000 n=8+9)
PickParallel/1000         102ns ± 1%    99ns ± 1%   -2.28%  (p=0.000 n=10+10)
PickParallel/1000-2       122ns ± 0%    50ns ± 1%  -58.89%  (p=0.000 n=10+10)
PickParallel/1000-4       170ns ± 0%    26ns ± 0%  -84.88%  (p=0.000 n=10+6)
PickParallel/1000-8       217ns ± 0%    13ns ± 0%  -94.00%  (p=0.000 n=10+8)
PickParallel/1000-16      229ns ± 1%     9ns ± 1%  -96.14%  (p=0.000 n=10+10)
PickParallel/10000        131ns ± 1%   127ns ± 0%   -2.83%  (p=0.000 n=10+8)
PickParallel/10000-2      150ns ± 0%    64ns ± 1%  -57.08%  (p=0.000 n=9+10)
PickParallel/10000-4      204ns ± 2%    33ns ± 1%  -83.85%  (p=0.000 n=10+10)
PickParallel/10000-8      240ns ± 1%    17ns ± 0%  -93.08%  (p=0.000 n=10+10)
PickParallel/10000-16     251ns ± 1%    11ns ± 0%  -95.54%  (p=0.000 n=10+8)
PickParallel/100000       169ns ± 1%   166ns ± 1%   -1.60%  (p=0.000 n=10+10)
PickParallel/100000-2     192ns ± 2%    85ns ± 3%  -55.83%  (p=0.000 n=10+10)
PickParallel/100000-4     249ns ± 2%    43ns ± 0%  -82.83%  (p=0.000 n=10+9)
PickParallel/100000-8     259ns ± 1%    22ns ± 1%  -91.59%  (p=0.000 n=10+10)
PickParallel/100000-16    266ns ± 1%    14ns ± 0%  -94.56%  (p=0.000 n=10+9)
PickParallel/1000000      305ns ± 1%   306ns ± 2%     ~     (p=0.329 n=9+10)
PickParallel/1000000-2    280ns ± 1%   150ns ± 1%  -46.25%  (p=0.000 n=10+10)
PickParallel/1000000-4    279ns ± 1%    76ns ± 1%  -72.67%  (p=0.000 n=10+10)
PickParallel/1000000-8    295ns ± 1%    38ns ± 1%  -87.02%  (p=0.000 n=10+9)
PickParallel/1000000-16   301ns ± 1%    23ns ± 3%  -92.19%  (p=0.000 n=10+10)

Time is again being spent as it should be:
profile002

Considerations

Adding a new method complicates the API, especially since it is one that opens the door to potential mis-use from developers who are not familiar with the underlying safety issues. Additionally, it is still unconfirmed whether any users of this library currently have a highly parallel utilization need.

If this is merged, I should make it clear in the documentation the situations where this method should be utilized and provide appropriate sample code so that it can be used safely.

we expect to see lock contention stemming from the usage of global rand.
@mroth mroth merged commit fcfd837 into master Jul 21, 2020
@mroth mroth deleted the rand-contention branch July 21, 2020 20:35
@mroth
Copy link
Owner Author

mroth commented Aug 13, 2023

For people coming across this PR in 2023 and beyond: usage of PickSource is no longer required for this performance benefit when using go1.21 or greater, you can just use Pick as long as you aren't manually seeding global rand (which is no longer recommended). Changes documented in #28.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant