-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPUThrottlingHigh false positives #108
Comments
I have the same situation on my personal cluster with the overall load ( /cc @gouthamve |
It would be nice to actually debug this on the CFS (completly fail scheduler) layer. |
Sorry for slow response. This alert was added exactly for this reason: with low limits, spiky workloads can have low averages and still be being throttled. Consider this: if we sample every 15s, and do a What we've found is raising our limits on container CPU (whilst keeping container CPU requests close to 95%-ile "average" usage*) has allowed us to have lower throttling and decent utilisation. If you don't want this, you can set the threshold to something >25% in the
|
I've found this issue here: kubernetes/kubernetes#67577 I might dig into this later though. See this for some more info: https://twitter.com/putadent/status/1047808685840334848 |
Alright. Thanks a bunch for the further info. I'll look into that for my personal cluster and try to get a better idea. |
Hmm, thats interesting. But from what I read thats actually a bug in the kernel/cfs. Especially taking https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1 in mind, spiky workloads are throttled for no reason. I'm not getting how to mitigate that alert though. Afaict the only real mitigation is to just disable limits (which is not really an option). Imho the alert is more misleading / too trigger friendly, as long as the mentioned bug(s) is/are not fixed (Thanks for linking those) |
Talking to @gouthamve again, I am now running this alert in my cluster with kubernetes-mixin/config.libsonnet Line 44 in f7ca48c
Therefore the question: Should we set |
@metalmatze I think your process is still throttled and it may affect its performance. So it is just hiding the real issue. |
Sure. What do you propose instead @szymonpk? |
@metalmatze Disabling cfs-quota or removing cpu limits for containers with small limits and spiky workloads. |
@chiluk's recent talk at KubeCon19 revealed all the intricate details of CFS and throttling. Details about the kernel patch are now widely documented (see kubernetes/kubernetes#67577), but the bit that caught my eye is that calculating the throttling percentage based on seconds is apparently wrong: Throttling seconds accumulate on the counter for every running thread. As such, one cannot come up with a percentage value without also knowing the number of threads at time. Instead, the alert should be based on the ratio of periods, which are global for all cores, not seconds. I'm thinking the alert in this repo should be changed in that direction. Thoughts? |
@bgagnon this alert is based on period already, I don't think we need to change that . |
The math @benjaminhuo pointed at looks correct. I suspect @bgagnon is probably hitting the inadvertent throttling covered in my talk resulting in the increased throttled period percentages he's seeing. I suspect installing kernels with the fixes will likely alleviate some of the throttling such that the monitor can be decreased. Hopefully if these patches ever get accepted, bursty applications can have tighter limits with decreased throttling. |
Thanks @benjaminhuo and @chiluk, I must have misread the alert definition! |
Am I reading it correctly that your CPU requests and Cpu Limit are set to .07 and .08 respectively? Think about what is going on here. Whenever your application is runnable it is only able to execute for 8ms every 100ms on only 1 CPU before it hits throttling. Assuming a 3ghz CPU clock this is similar to giving your application a 210mhz single-core cpu *(this reminds me of my days back in the 90's with a Cyrix 166+). Depending on what it is or isn't doing the in-kernel context switch time could potentially be that expensive without your application doing anything *(you can thank spectre/meltdown for that). Basically your Requests are Limits are bounded too tightly. They are set well below the threshold that can be reasonably accounted for reliably by the kernel with useful results. I don't know what the minimum limit should be set to, but I do think you are well below that based on the throttling percentages you are seeing. This issue is solved. Your expectations of what can be reasonably accomplished with existing kernel constructs and hardware need to be re-evaluated. |
Thank you for your detailed information. I should have thought about that... At first I used the default kube-prometheus default resources (102m request/250 limit), but I still experienced throttling. So I increased resources to 800m, which solved the issue - but noticed node-exporter does not use it (it needs something ~0.1m). So I reduce the resources back - and now I'm fighting with this alert. |
So does this explanation mean that the default values set by kube-prometeus are non-sesnsical / wrong? |
I don't know to be honest... there is a discussion here prometheus-operator/kube-prometheus#214. It looks like the values make sense - node exporter almost not using any cpu, so giving it very little resources is reasonable - but I do wonder why it's still get throttled... |
Correct me if I'm wrong: As far as I understand the node_exporter actually only uses up CPU cycles when being scraped (which for a default prometheus setting should be every 15 or 30 seconds). This means on average the pod does have a very flat line of CPU usage. But the problem is, that the To be honest, I have no Idea how to build general-approach alerts for this then (or if it even is relevant), since it hardly depends on the application. In case of node-exporter it should be irrelevant (since prometheus doesn't care about some overhead in scraping and I've seen no cluster yet, where the node_exporter scrape was slowed to more than 3 seconds) |
So maybe we can add a selector for excluding containers from the alerts? So users could easily ignore such containers? |
We do not use prometheus, so I don't know about the default values etc. @cbeneke has the right idea. Since the pod only uses cpu sporadically it always hits throttling when it is running. If you don't care about the response times of this pod or how long it takes to "gather metrics" I would leave requests where they are, and increase the limit until you no longer see throttling. That way the pod would only be scheduled by the kernel when nothing else is able to run. This is similar to how we schedule our batch jobs with low requests, but high limit. That way they rarely pre-empt latency sensitive applications, but they are allowed to use a ton of cpu time that would otherwise be sacrificed away to the idle process gods. |
Since this is a very useful alert to have, especially during debugging, it is also a very chatty one (as can be seen by the number of issues linked here). In many cases this alert is not actionable (apart from silencing it) because the application is not latency-sensitive and can work without problems even when throttled. Additionally, this alert is based on cause and not a symptom. I propose to reduce alert severity to |
Info level severity sounds good to me |
Everybody, please leave a review on #453. Thanks! |
FYI there is also already the cpuThrottlingSelector configuration that allows you to scope or exclude certain containers/namespaces/etc. |
issues as described here https://engineering.indeedblog.com/blog/2019/12/cpu-throttling-regression-fix/ seems to be what this alert shows |
I think the title is misleading a little bit, I don't think they're false positives. Even with an updated kernel, applications still suffer from throttled cpu periods and perform much slower when many processes or threads are running at the same time (the situation is much worse for applications that handle each request in a separated thread or process such as php-fpm based apps or their average response time is more than (available quota periods in ms) / (number of threads or processes running)) For Golang apps such as node-exporter, you can set GOMAXPROCS to a lower value than node's cpu cores or use Uber's automaxprocs library to mitigate the CPU throttling issue: https://github.com/uber-go/automaxprocs Benchmarks: |
I totally agree with @alibo , this is not misleading. I initially disabled the alert and then found myself hunting down the reason for extremely slow and/or failing pods! CPU throttling is a serious issue in clusters, and also blindly removing limits can cause further problems. A very nice to have feature in dashboards would be a graph showing CPU waste, based on CPU requests. |
@irizzant and @alibo are correct. It's highly unlikely that you are receiving false positives. However it is likely that you are getting positives for very short bursts. I don't know enough about the monitor, but it might be useful to put some threshold on the monitor where it only triggers if the application is throttled for more than x% of the last many periods. I'd expect most well written applications to be throttled at some point in time. It also might be useful to be able to put such a threshold in the pod spec itself so it could be twiddled per pod. Alright that's my attempt at thought leadering here. Hopefully cgroups v2 will make some of this mess "better" without creating a whole new range of issues. If you'd rather not read the long blog post I wrote that @KlavsKlavsen linked I also gave a topic on this subject a few years back. |
Another possibility would be to create a kernel scheduler config such that runnable throttled applications would receive run time when the idle process would otherwise be run. That might really muddy the accounting metrics in the kernel, and would probably take a herculean effort to get scheduler dev approval. |
@chiluk The burstable CFS controller is introduced in Kernel 5.14 (it's not released yet!) can mitigate this issue a little bit, specially improves P90+ of response time a lot based on the benchmarks are provided:
however, it's not implemented in CRI-based container runtimes yet: |
I guess we need to define what we call "false positive" here. IMO a false positive in this context is an alert that is not actionable, e.g. not indicative of a real problem that requires an action. So far I was not able to deduce why those alerts randomly trigger and disappear many times a day and how they help me. |
In that context, and when application is not experiencing any issues manifested with other alerts, it is a false-positive. For exactly this reason CPUThrottlingHigh is shipped with |
This isn't a false alarm and it isn't due to CFS kernel bugs! I've written a whole wiki page on this and how to respond to each subset of this alert The gist of it is that processes are being deprived CPU when they need it and that can happen even when CPU is available. I know people consider it a best practice to set CPU limits, but if you use CPU requests for everything then the simple and safe action here is to remove the limits. Unless this is happening on metrics-server in which case it's a whole different story... |
In case anyone is curious, I'll elaborate on the previous comment with a specific example: Here's a numerical example of throttling when average CPU is far below the limit. Assumptions
Outcome
This is not a false positive alert. There is a real user facing impact. A server is getting one http request per second. It should take only 30ms to handle it. Yet that request takes 204ms instead! There was real latency introduced here. Performance got worse by 6.8x. Despite the pod having a limit of 130m which is far above average CPU of 3%. In short, as always, remove those darn limits if you can. |
@aantn understands. However, removing the limits is not strictly "safe" if you have untrustworthy apps or poor developers. |
How exactly is this unsafe? |
Without limits, a misbehaving or crashlooping application can theoretically eat all the available CPU which would adversely affect performance of other applications on the system. Even when request operating correctly as a minimum guarantee, an application using 100% of all cores of a CPU would cause thermal throttling on the CPU itself which can lead to lower performance for collocated behaving applications. Additionally it might cause a scheduling delay of a behaving application for a cpu time slice (~5ms). For this reason, my recommendation for interactive/request servicing applications is to set cpu limits large enough so as to avoid throttling, but not so large that a misbehaving application can eat an entire box. |
This issue has not had any activity in the past 30 days, so the
Thank you for your contributions! |
Hi,
since the Alert
CPUThrottlingHigh
got added, it is firing in my Cluster for a lot of pods. As most of the affected pods are not even at their assigned CPU limit, I assume the expression for the alert is wrong (either miscalculation or [what seems to be more likely]container_cpu_cfs_throttled_periods_total
includes different types of throttle).This needs further investigating to be sure where this comes from, but like this the alert is not useful. (With about 250 pods running and a 25% limit I observe >100 alerts, 50% limit ~20 alerts.)
The text was updated successfully, but these errors were encountered: