+++ This bug was initially created as a clone of Bug #1807140 +++
+++ This bug was initially created as a clone of Bug #1807139 +++
An OOMKill on a cluster can be disruptive to workloads and infrastructure both immediately and over time (if a component partially fails). We should alert when a significant number of OOMKills have occurred.
As a starting point, we should pick a rate that we believe is likely to indicate serious problems and tune it down (to catch more issues) after we assess the impact in the field. The alert should be at 'info' level for now in order to allow time for assessment.
Should be back ported to 4.3 where OOMKills may have caused significant production issues for a few customers.
I will not have time to do this backport, due to other higher priority bugzillas and tasks. Resetting to the node team.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.