AliCloud(Perf): Optimize AliCloudProvider NodeGroupForNode Function #6749
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind feature
What this PR does / why we need it:
if a large number of nodes are created via instances not in Managed ASGs, every call to
NodeGroupForNode->GetAsgForInstance->FindForInstance
will trigger aregenerateCache
. If there are a significant number of instances in Managed ASGs at that time, this could lead to frequent calls, each with a considerable time overhead forregenerateCache
.For example, if there are 1000 machines in Managed ASGs and suddenly 2000 instances not in Managed ASGs are added:
UpdateNodes->updateReadinessStats->NodeGroupForNode
This call is executed with every runOnce, and each node invokes
NodeGroupForNode
once. The 2000 newly added instances not in Managed ASGs will triggerregenerateCache
, but it will only count instances in Managed ASGs. However, it will still execute 2000 times.Assuming each
regenerateCache
call takes 10 seconds, this function will take:2000 instances * 10 seconds/instance = 20000 seconds.
This significantly extends the duration of a single
runOnce
operation. During this period, if new pending pods appear, the Cluster Autoscaler might not function as expected, posing a severe risk.Proposed Solution:
The
regenerateCache
operation will be moved tofunc (ali *aliCloudProvider) Refresh() error
, which will be called on everyrunOnce
. To reduce the frequency of calls, a minimum interval between calls will be introduced. Additionally, the caching mechanism forinstancesNotInManagedAsg
will be removed. Instances not in instanceToAsg will return nil. By refreshingregenerateCache
on everyrunOnce
, the correct results will eventually be obtained. This modification ensures that the normal operation of the Cluster Autoscaler (CA) is not affected by the appearance of a large number of instances not in Managed ASGs.This approach is inspired by the latest implementation in the AWS CloudProvider 867
Which issue(s) this PR fixes:
Fixes #6748
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: