Bug 1945431

Summary: alerts: SystemMemoryExceedsReservation triggers too quickly
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: NodeAssignee: Seth Jennings <sjenning>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, david.karlsen, maupadhy, micmurph, oarribas, sople, vjaypurk
Version: 4.6   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1992688 (view as bug list) Environment:
Last Closed: 2021-07-27 22:57:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1992687, 1992688    
Attachments:
Description Flags
SystemMemoryExceedsReservation none

Description W. Trevor King 2021-03-31 22:45:47 UTC
The machine-config operator got a manifest entry for SystemMemoryExceedsReservation for 4.6's GA via [1] and bug 1881208.  However, the warning-severity alert should only fire over sustained usage, because this is a symptom and may peak briefly during startup, and there's no sense in getting folks all excited and then saying "ah, nevermind, we're fine after all".  There should be a timeout of 15m or similar to reduce the amount of false alarms.

Probably deserves backporting to 4.7, hence this bug, and once we have the backport train in motion, seems like we might as well take it all the way back to 4.6 (and since the alert didn't exist in 4.5, there's no need to go further than that).

[1]: https://github.com/openshift/machine-config-operator/pull/2033

Comment 3 Sunil Choudhary 2021-04-14 08:45:04 UTC
Verified on 4.8.0-0.nightly-2021-04-13-171608.

SystemMemoryExceedsReservation alert added.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-13-171608   True        False         68m     Cluster version is 4.8.0-0.nightly-2021-04-13-171608

Comment 4 Sunil Choudhary 2021-04-14 08:47:20 UTC
Created attachment 1771806 [details]
SystemMemoryExceedsReservation

Comment 7 errata-xmlrpc 2021-07-27 22:57:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 8 Mike Murphy 2021-08-11 14:02:37 UTC
Do we know if this fix will be backported to OCP 4.7 ?

Comment 10 Shivkumar Ople 2021-08-25 19:37:39 UTC
Hi Mike and Vedanti,

Yes, this is backported to 4.7.25, backport bugzilla[1] is in the VERIFIED state. The fix should arrive in the next minor release 4.7.25.


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1992687

- Below is the backport PR for release-4.7 

https://github.com/openshift/machine-config-operator/pull/2710

- These are the changes in 4.7

https://github.com/openshift/machine-config-operator/blob/release-4.7/install/0000_90_machine-config-operator_01_prometheus-rules.yaml#L50-L59

Best,
Shivkumar Ople

Comment 11 Shivkumar Ople 2021-08-28 08:46:18 UTC
Hi,

One update here.

OpenShift engineering has decided to NOT ship 4.7.25 due to a blocker bug. So this backport should be a part of next minor release.

Thanks!

Best,
Shivkumar Ople

Comment 12 oarribas 2021-09-10 13:46:29 UTC
Already backported to OCP 4.7.28 as BZ 1992687 [1].



[1] https://bugzilla.redhat.com/show_bug.cgi?id=1992687