Description of problem:
Alert manager shows CPUThrottlingHigh alert even if CPU usage is very low in the container
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Observe the CPUThrottlingHigh alert on Monitoring > Alerting in the OpenShift console.
2. Observe CPU usage of the container
1. CPUThrottlingHigh alert is firing
94.59% throttling of CPU in namespace apim27 for container backend-cron in pod backend-cron-1-4dmdr.
2. The container is using very low cpu resources(1m/150m)
# kubectl top pod backend-cron-1-4dmdr --containers
POD NAME CPU(cores) MEMORY(bytes)
backend-cron-1-4dmdr backend-cron 1m 14Mi
backend-cron-1-4dmdr backend-redis-svc 0m 0Mi
3. cpu limits for the container
CPUThrottlingHigh alert should not fire
I found a discussion about CPUThrottlingHigh false positives
I attached screen-shots of Openshift-Console
Setting target release to current development version (4.6) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.
> As false alerts are unexpected and issue still persists with OCP v4.3
This is a warning type of alert and as such any fixes won't be backported.
> Can someone explain the meaning of the below formula?
> * sum by(container_name, pod_name, namespace) (increase(container_cpu_cfs_throttled_periods_total[5m]))
> / sum by(container_name, pod_name, namespace) (increase(container_cpu_cfs_periods_total[5m]))
> > 25
It is checking if more than 25% of total access periods granted by kernel CFS (Completely Fair Scheduler) are throttled due to cgroup constraints. It is entirely possible that such throttling won't get noticed by looking only at CPU utilization in cases where load is "spiky". For example if prometheus is getting CPU utilization data every 30s, but an application is very active only for less than a second and idling for the rest, then CPU utilization will be low, but throttling can happen (as said in https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/108#issuecomment-432796867)
> One of our cu on case 02681301 is observing CPUThrottlingHigh alert for a pod which is only consuming average around 70m CPU out of 500m limit and raising the limits did not help in suppressing the false alerts,
This alert is based on kernel data and in the current form, I don't see any option for it being "false". It is entirely possible that limits were set too low and application is active for limited period of time and "idle" for the rest. This results in a very average CPU consumption observed in graphs but kernel CFS is still throttling application during its high activity.
Solving this issue requires observing `container_cpu_cfs_throttled_periods_total` metric and increasing CPU limits to a point where throttling doesn't happen anymore.
Tested with 4.6.0-0.nightly-2020-07-07-141639, CPUThrottlingHigh alert is removed
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.