Bug 1813077

Summary: How to handle alert KubeCPUOvercommit
Product: OpenShift Container Platform Reporter: Hongkai Liu <hongkliu>
Component: MonitoringAssignee: Sergiusz Urbaniak <surbania>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-13 11:29:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hongkai Liu 2020-03-12 21:40:59 UTC
oc get clusterversion --context build01
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-03-04-222846   True        False         7d5h    Cluster version is 4.3.0-0.nightly-2020-03-04-222846


Created attachment 1669779 [details]
alert.KubeCPUOvercommit

AlertManager is set up on a CI build form cluster (OCP4.3) to send out notifications to slack.

Recently there are alerts fired and I am not sure how I should debug/fix it.


[FIRING:1] KubeCPUOvercommit (openshift-monitoring/k8s warning)
Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.
https://coreos.slack.com/archives/CV1UZU53R/p1584031232061900


Are we supposed to silence it when autoscaler works properly (since autoscaler should add more node to cluster when the cluster is short of CPUs)? Or how should the debugging procedure be?

Another (not very related) issue:

There are 2 alerts with the same name "KubeCPUOvercommit". See the snapshort.
Is it intended?

Comment 1 Hongkai Liu 2020-03-12 21:42:28 UTC
Related/Separated to/from https://bugzilla.redhat.com/show_bug.cgi?id=1813069

Comment 2 Paul Gier 2020-03-13 01:38:04 UTC
There is some work going on to improve the resource settings for monitoring components which may help with this.  https://bugzilla.redhat.com/show_bug.cgi?id=1812719

Comment 3 Sergiusz Urbaniak 2020-03-13 11:29:55 UTC

*** This bug has been marked as a duplicate of bug 1812999 ***