v2.0.0. Performance Review #74

thrawn01 · 2020-10-28T00:30:07Z

Purpose

In production we are seeing 300ms response times during very high volumes. (Response times are usually in the 2-5ms range)
Profile the distribution hit updates when using Behavior=GLOBAL. Reference our implementation with that of https://ipfs.io.

TODO

Profile running gubernators in production.
Profile GLOBAL update behavior

The text was updated successfully, but these errors were encountered:

valer-cara · 2021-01-27T18:08:14Z

Have you considered any optimizations in the use of FanOut in GetRateLimits? (eg: not fanning out for local cache hits, ...)

I've been trying to use a gub cluster taking in ~40-80k QPS, each with ~5 items in the requests list and I've been reaching a ceiling (image below).

I tried a number of things: load balancing with envoy, various cluster sizes (from 1 to 5 machines of 16 cores each), etc.. However I wasn't able to saturate those machines, so I went hunting for blocking points. I initially thought it might be the global mutex on the cache and tried a sync.Map alternative but to no result.

I've taken some blocking profiles and there's quite some time spent in FanOut/ChanRecv (even locally, since it's expected on remote).

As a quick wip dirty hack, I eliminated the FanOut for local cache hits (and disabled remote). I only tested this locally in a single instance (as it made most sense given that I stripped out all GetPeerRatelimit for the quick proof of concept). I was able to go from 25k QPS to 40k QPS which indicated that I should be trying out a complete fanOut optimization.

Not sure if there's light at the end of this tunnel, but that almost 2x increase in QPS on the local machine definitely caught my attention.

thrawn01 · 2021-01-29T17:45:25Z

Thank you for doing this analysis! (I kept seeing FanOut show up in my CPU profiles, but never followed up on it). Avoiding fanout for local cache hits is a great optimization! My current optimization research is looking into how we can use GLOBAL behavior to avoid the network requests to owning peers. But it's stalled because work priorities are not leaving me with free time to work on this. If you are interested, a PR with this optimization would be most welcome!

thrawn01 added this to the v1.0.0 milestone Oct 28, 2020

thrawn01 self-assigned this Oct 28, 2020

thrawn01 changed the title ~~1.0.0. Performance Review~~ v1.0.0. Performance Review Oct 28, 2020

thrawn01 mentioned this issue Oct 28, 2020

v2.0.0 Release Roadmap #75

Open

9 tasks

thrawn01 mentioned this issue Mar 22, 2021

Date to release 1.0 final #88

Closed

thrawn01 mentioned this issue Jun 4, 2021

Performance improvements in GetRateLimits() #93

Merged

thrawn01 changed the title ~~v1.0.0. Performance Review~~ v2.0.0. Performance Review Aug 20, 2021

thrawn01 modified the milestones: v1.0.0, v2.0.0 Aug 20, 2021

Dreamsorcerer mentioned this issue May 10, 2023

Scheduling over-limit requests #172

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.0.0. Performance Review #74

v2.0.0. Performance Review #74

thrawn01 commented Oct 28, 2020 •

edited

valer-cara commented Jan 27, 2021 •

edited

thrawn01 commented Jan 29, 2021

v2.0.0. Performance Review #74

v2.0.0. Performance Review #74

Comments

thrawn01 commented Oct 28, 2020 • edited

Purpose

TODO

valer-cara commented Jan 27, 2021 • edited

thrawn01 commented Jan 29, 2021

thrawn01 commented Oct 28, 2020 •

edited

valer-cara commented Jan 27, 2021 •

edited