1813077 – How to handle alert KubeCPUOvercommit

Bug 1813077 - How to handle alert KubeCPUOvercommit

Summary: How to handle alert KubeCPUOvercommit

Keywords:
Status:	CLOSED DUPLICATE of bug 1812999
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-12 21:40 UTC by Hongkai Liu
Modified:	2020-03-13 11:29 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-13 11:29:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Hongkai Liu 2020-03-12 21:40:59 UTC

oc get clusterversion --context build01
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-03-04-222846   True        False         7d5h    Cluster version is 4.3.0-0.nightly-2020-03-04-222846


Created attachment 1669779 [details]
alert.KubeCPUOvercommit

AlertManager is set up on a CI build form cluster (OCP4.3) to send out notifications to slack.

Recently there are alerts fired and I am not sure how I should debug/fix it.


[FIRING:1] KubeCPUOvercommit (openshift-monitoring/k8s warning)
Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.
https://coreos.slack.com/archives/CV1UZU53R/p1584031232061900


Are we supposed to silence it when autoscaler works properly (since autoscaler should add more node to cluster when the cluster is short of CPUs)? Or how should the debugging procedure be?

Another (not very related) issue:

There are 2 alerts with the same name "KubeCPUOvercommit". See the snapshort.
Is it intended?

Comment 1 Hongkai Liu 2020-03-12 21:42:28 UTC

Related/Separated to/from https://bugzilla.redhat.com/show_bug.cgi?id=1813069

Comment 2 Paul Gier 2020-03-13 01:38:04 UTC

There is some work going on to improve the resource settings for monitoring components which may help with this.  https://bugzilla.redhat.com/show_bug.cgi?id=1812719

Comment 3 Sergiusz Urbaniak 2020-03-13 11:29:55 UTC


*** This bug has been marked as a duplicate of bug 1812999 ***

Note You need to log in before you can comment on or make changes to this bug.