Bug 1843346
Summary: | Alert manager shows CPUThrottlingHigh alert even if CPU usage is very low in the container | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Hyosun Kim <hyoskim> |
Component: | Monitoring | Assignee: | Pawel Krupa <pkrupa> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.3.z | CC: | alegrand, anpicker, asheth, christopher.obrien, cruhm, erooth, fgrosjea, juzhao, kakkoyun, lcosic, mloibl, pkrupa, scuppett, snalawad, spasquie, surbania, travi, wking |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 16:04:37 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Hyosun Kim
2020-06-03 05:23:24 UTC
Setting target release to current development version (4.6) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate. > As false alerts are unexpected and issue still persists with OCP v4.3 This is a warning type of alert and as such any fixes won't be backported. > Can someone explain the meaning of the below formula? > > 100 > * sum by(container_name, pod_name, namespace) (increase(container_cpu_cfs_throttled_periods_total[5m])) > / sum by(container_name, pod_name, namespace) (increase(container_cpu_cfs_periods_total[5m])) > > 25 It is checking if more than 25% of total access periods granted by kernel CFS (Completely Fair Scheduler) are throttled due to cgroup constraints. It is entirely possible that such throttling won't get noticed by looking only at CPU utilization in cases where load is "spiky". For example if prometheus is getting CPU utilization data every 30s, but an application is very active only for less than a second and idling for the rest, then CPU utilization will be low, but throttling can happen (as said in https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/108#issuecomment-432796867) > One of our cu on case 02681301 is observing CPUThrottlingHigh alert for a pod which is only consuming average around 70m CPU out of 500m limit and raising the limits did not help in suppressing the false alerts, This alert is based on kernel data and in the current form, I don't see any option for it being "false". It is entirely possible that limits were set too low and application is active for limited period of time and "idle" for the rest. This results in a very average CPU consumption observed in graphs but kernel CFS is still throttling application during its high activity. Solving this issue requires observing `container_cpu_cfs_throttled_periods_total` metric and increasing CPU limits to a point where throttling doesn't happen anymore. Tested with 4.6.0-0.nightly-2020-07-07-141639, CPUThrottlingHigh alert is removed Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |