Bug 1945431 - alerts: SystemMemoryExceedsReservation triggers too quickly
Summary: alerts: SystemMemoryExceedsReservation triggers too quickly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Seth Jennings
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks: 1992687 1992688
TreeView+ depends on / blocked
 
Reported: 2021-03-31 22:45 UTC by W. Trevor King
Modified: 2021-09-14 10:01 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1992688 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:57:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
SystemMemoryExceedsReservation (152.42 KB, image/png)
2021-04-14 08:47 UTC, Sunil Choudhary
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2508 0 None open alerts: SystemMemoryExceedsReservation triggers too quickly 2021-08-30 16:53:39 UTC
Red Hat Knowledge Base (Solution) 5788171 0 None None None 2021-09-03 16:44:44 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:57:28 UTC

Description W. Trevor King 2021-03-31 22:45:47 UTC
The machine-config operator got a manifest entry for SystemMemoryExceedsReservation for 4.6's GA via [1] and bug 1881208.  However, the warning-severity alert should only fire over sustained usage, because this is a symptom and may peak briefly during startup, and there's no sense in getting folks all excited and then saying "ah, nevermind, we're fine after all".  There should be a timeout of 15m or similar to reduce the amount of false alarms.

Probably deserves backporting to 4.7, hence this bug, and once we have the backport train in motion, seems like we might as well take it all the way back to 4.6 (and since the alert didn't exist in 4.5, there's no need to go further than that).

[1]: https://github.com/openshift/machine-config-operator/pull/2033

Comment 3 Sunil Choudhary 2021-04-14 08:45:04 UTC
Verified on 4.8.0-0.nightly-2021-04-13-171608.

SystemMemoryExceedsReservation alert added.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-13-171608   True        False         68m     Cluster version is 4.8.0-0.nightly-2021-04-13-171608

Comment 4 Sunil Choudhary 2021-04-14 08:47:20 UTC
Created attachment 1771806 [details]
SystemMemoryExceedsReservation

Comment 7 errata-xmlrpc 2021-07-27 22:57:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 8 Mike Murphy 2021-08-11 14:02:37 UTC
Do we know if this fix will be backported to OCP 4.7 ?

Comment 10 Shivkumar Ople 2021-08-25 19:37:39 UTC
Hi Mike and Vedanti,

Yes, this is backported to 4.7.25, backport bugzilla[1] is in the VERIFIED state. The fix should arrive in the next minor release 4.7.25.


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1992687

- Below is the backport PR for release-4.7 

https://github.com/openshift/machine-config-operator/pull/2710

- These are the changes in 4.7

https://github.com/openshift/machine-config-operator/blob/release-4.7/install/0000_90_machine-config-operator_01_prometheus-rules.yaml#L50-L59

Best,
Shivkumar Ople

Comment 11 Shivkumar Ople 2021-08-28 08:46:18 UTC
Hi,

One update here.

OpenShift engineering has decided to NOT ship 4.7.25 due to a blocker bug. So this backport should be a part of next minor release.

Thanks!

Best,
Shivkumar Ople

Comment 12 oarribas 2021-09-10 13:46:29 UTC
Already backported to OCP 4.7.28 as BZ 1992687 [1].



[1] https://bugzilla.redhat.com/show_bug.cgi?id=1992687


Note You need to log in before you can comment on or make changes to this bug.