Bug 1843346 - Alert manager shows CPUThrottlingHigh alert even if CPU usage is very low in the container
Summary: Alert manager shows CPUThrottlingHigh alert even if CPU usage is very low in ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Pawel Krupa
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-03 05:23 UTC by Hyosun Kim
Modified: 2020-12-10 16:30 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:04:37 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 835 0 None closed Bug 1843346: Removing CPUThrottlingHigh alert 2021-02-20 06:44:20 UTC
Red Hat Bugzilla 1850717 0 unspecified CLOSED CPUThrottlingHigh and other alerts lack namespace restrictions 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1851873 0 low CLOSED CPUThrottlingHigh and other alerts lack namespace restrictions 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1851920 0 low CLOSED CPUThrottlingHigh and other alerts lack namespace restrictions 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:04:59 UTC

Description Hyosun Kim 2020-06-03 05:23:24 UTC
Description of problem:
Alert manager shows CPUThrottlingHigh alert even if CPU usage is very low in the container

Version-Release number of selected component (if applicable):
OCP 4.3.18
Alert Manager

Steps to Reproduce:
1. Observe the CPUThrottlingHigh alert on Monitoring > Alerting in the OpenShift console.
2. Observe CPU usage of the container

Actual results:

1. CPUThrottlingHigh alert is firing
94.59% throttling of CPU in namespace apim27 for container backend-cron in pod backend-cron-1-4dmdr.

2. The container is using very low cpu resources(1m/150m)
# kubectl top pod backend-cron-1-4dmdr --containers 
POD                                      NAME                              CPU(cores)   MEMORY(bytes)   
backend-cron-1-4dmdr   backend-cron                1m                     14Mi            
backend-cron-1-4dmdr   backend-redis-svc       0m                    0Mi 

3. cpu limits for the container
resources:
            limits:
              cpu: 150m
              memory: 80Mi     

Expected results:

CPUThrottlingHigh alert should not fire


Additional info:

I found a discussion about CPUThrottlingHigh false positives
https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/108

I attached screen-shots of Openshift-Console

Comment 8 Stephen Cuppett 2020-06-24 17:09:10 UTC
Setting target release to current development version (4.6) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 12 Pawel Krupa 2020-06-30 10:54:22 UTC
> As false alerts are unexpected and issue still persists with OCP v4.3

This is a warning type of alert and as such any fixes won't be backported.

> Can someone explain the meaning of the below formula?
> 
> 100
>   * sum by(container_name, pod_name, namespace) (increase(container_cpu_cfs_throttled_periods_total[5m]))
>   / sum by(container_name, pod_name, namespace) (increase(container_cpu_cfs_periods_total[5m]))
>   > 25

It is checking if more than 25% of total access periods granted by kernel CFS (Completely Fair Scheduler) are throttled due to cgroup constraints. It is entirely possible that such throttling won't get noticed by looking only at CPU utilization in cases where load is "spiky". For example if prometheus is getting CPU utilization data every 30s, but an application is very active only for less than a second and idling for the rest, then CPU utilization will be low, but throttling can happen (as said in https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/108#issuecomment-432796867)

> One of our cu on case 02681301 is observing CPUThrottlingHigh alert for a pod which is only consuming average around 70m CPU out of 500m limit and raising the limits did not help in suppressing the false alerts,

This alert is based on kernel data and in the current form, I don't see any option for it being "false". It is entirely possible that limits were set too low and application is active for limited period of time and "idle" for the rest. This results in a very average CPU consumption observed in graphs but kernel CFS is still throttling application during its high activity.

Solving this issue requires observing `container_cpu_cfs_throttled_periods_total` metric and increasing CPU limits to a point where throttling doesn't happen anymore.

Comment 16 Junqi Zhao 2020-07-08 07:25:30 UTC
Tested with 4.6.0-0.nightly-2020-07-07-141639, CPUThrottlingHigh alert is removed

Comment 22 errata-xmlrpc 2020-10-27 16:04:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.