Bug 1807140 - Alert on OOMKills on the cluster as a symptom of disruptive workloads or bugs
Summary: Alert on OOMKills on the cluster as a symptom of disruptive workloads or bugs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.4.0
Assignee: Lili Cosic
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On: 1807139
Blocks: 1807141
TreeView+ depends on / blocked
 
Reported: 2020-02-25 17:09 UTC by Clayton Coleman
Modified: 2020-05-04 11:42 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: New MultipleContainersOOMKilled alert within monitoring. Reason: alert when multiple containers are OOMed killed. This can help diagnosing an overloaded cluster. Result:
Clone Of: 1807139
: 1807141 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:42:23 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 676 None closed Bug 1807140: jsonnet/rules.jsonnet: Add MultipleContainersOOMKilled alert 2020-08-18 08:42:03 UTC
Red Hat Product Errata RHBA-2020:0581 None None None 2020-05-04 11:42:46 UTC

Description Clayton Coleman 2020-02-25 17:09:55 UTC
+++ This bug was initially created as a clone of Bug #1807139 +++

An OOMKill on a cluster can be disruptive to workloads and infrastructure both immediately and over time (if a component partially fails).  We should alert when a significant number of OOMKills have occurred.

As a starting point, we should pick a rate that we believe is likely to indicate serious problems and tune it down (to catch more issues) after we assess the impact in the field.  The alert should be at 'info' level for now in order to allow time for assessment.

Should be back ported to 4.3 where OOMKills may have caused significant production issues for a few customers.

Comment 5 errata-xmlrpc 2020-05-04 11:42:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.