Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1843346

Summary:	Alert manager shows CPUThrottlingHigh alert even if CPU usage is very low in the container
Product:	OpenShift Container Platform	Reporter:	Hyosun Kim <hyoskim>
Component:	Monitoring	Assignee:	Pawel Krupa <pkrupa>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.3.z	CC:	alegrand, anpicker, asheth, christopher.obrien, cruhm, erooth, fgrosjea, juzhao, kakkoyun, lcosic, mloibl, pkrupa, scuppett, snalawad, spasquie, surbania, travi, wking
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:04:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Hyosun Kim 2020-06-03 05:23:24 UTC

Description of problem:
Alert manager shows CPUThrottlingHigh alert even if CPU usage is very low in the container

Version-Release number of selected component (if applicable):
OCP 4.3.18
Alert Manager

Steps to Reproduce:
1. Observe the CPUThrottlingHigh alert on Monitoring > Alerting in the OpenShift console.
2. Observe CPU usage of the container

Actual results:

1. CPUThrottlingHigh alert is firing
94.59% throttling of CPU in namespace apim27 for container backend-cron in pod backend-cron-1-4dmdr.

2. The container is using very low cpu resources(1m/150m)
# kubectl top pod backend-cron-1-4dmdr --containers 
POD                                      NAME                              CPU(cores)   MEMORY(bytes)   
backend-cron-1-4dmdr   backend-cron                1m                     14Mi            
backend-cron-1-4dmdr   backend-redis-svc       0m                    0Mi 

3. cpu limits for the container
resources:
            limits:
              cpu: 150m
              memory: 80Mi     

Expected results:

CPUThrottlingHigh alert should not fire


Additional info:

I found a discussion about CPUThrottlingHigh false positives
https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/108

I attached screen-shots of Openshift-Console

Comment 8 Stephen Cuppett 2020-06-24 17:09:10 UTC

Setting target release to current development version (4.6) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 12 Pawel Krupa 2020-06-30 10:54:22 UTC

> As false alerts are unexpected and issue still persists with OCP v4.3

This is a warning type of alert and as such any fixes won't be backported.

> Can someone explain the meaning of the below formula?
> 
> 100
>   * sum by(container_name, pod_name, namespace) (increase(container_cpu_cfs_throttled_periods_total[5m]))
>   / sum by(container_name, pod_name, namespace) (increase(container_cpu_cfs_periods_total[5m]))
>   > 25

It is checking if more than 25% of total access periods granted by kernel CFS (Completely Fair Scheduler) are throttled due to cgroup constraints. It is entirely possible that such throttling won't get noticed by looking only at CPU utilization in cases where load is "spiky". For example if prometheus is getting CPU utilization data every 30s, but an application is very active only for less than a second and idling for the rest, then CPU utilization will be low, but throttling can happen (as said in https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/108#issuecomment-432796867)

> One of our cu on case 02681301 is observing CPUThrottlingHigh alert for a pod which is only consuming average around 70m CPU out of 500m limit and raising the limits did not help in suppressing the false alerts,

This alert is based on kernel data and in the current form, I don't see any option for it being "false". It is entirely possible that limits were set too low and application is active for limited period of time and "idle" for the rest. This results in a very average CPU consumption observed in graphs but kernel CFS is still throttling application during its high activity.

Solving this issue requires observing `container_cpu_cfs_throttled_periods_total` metric and increasing CPU limits to a point where throttling doesn't happen anymore.

Comment 16 Junqi Zhao 2020-07-08 07:25:30 UTC

Tested with 4.6.0-0.nightly-2020-07-07-141639, CPUThrottlingHigh alert is removed

Comment 22 errata-xmlrpc 2020-10-27 16:04:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196