1858010 – KubePodCrashLooping is alerting on critical severity

Bug 1858010 - KubePodCrashLooping is alerting on critical severity

Summary: KubePodCrashLooping is alerting on critical severity

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.5.z
Assignee:	Pawel Krupa
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1858008
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-16 20:23 UTC by Rick Rackow
Modified:	2020-11-10 14:54 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1858008
Environment:
Last Closed:	2020-11-10 14:53:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes-monitoring kubernetes-mixin pull 501	None	closed	Backport #414	2020-12-07 11:15:28 UTC
Github	openshift cluster-monitoring-operator pull 926	None	closed	Bug 1858010: decrease alerts severity	2020-12-07 11:15:29 UTC
Github	openshift cluster-monitoring-operator pull 952	None	closed	Bug 1858010: Backport fix for KubePodCrashLooping alert	2020-12-07 11:15:29 UTC
Red Hat Product Errata	RHBA-2020:4425	None	None	None	2020-11-10 14:54:10 UTC

Description Rick Rackow 2020-07-16 20:23:49 UTC

+++ This bug was initially created as a clone of Bug #1858008 +++

Description of problem:
KubePodCrashLooping is alerting on critical severity.
As of current best practices this should be on a warning level instead since it's a cause based alert rather than a symptom based alert

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Cause a crashloop
2. 
3.

Actual results:
`severity: critical`

Expected results:
`severity: warning`

Additional info:
this has been fixed upstream [1] already and needs to be implemented into cluster monitoring


[1] https://github.com/kubernetes-monitoring/kubernetes-mixin/commit/050dedeba07b0ebd782beebef63f6c0168713ff3

Comment 3 Junqi Zhao 2020-09-14 08:52:19 UTC

4.6.0-0.nightly-2020-09-12-230035, time range for KubePodCrashLooping expr is 5m
        expr: |
          rate(kube_pod_container_status_restarts_total{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}[5m]) * 60 * 5 > 0

4.5.0-0.nightly-2020-09-12-063044, time range for KubePodCrashLooping expr is 15m, I think it is better to change to 5m
        expr: |
          rate(kube_pod_container_status_restarts_total{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}[15m]) * 60 * 5 > 0

Comment 4 Junqi Zhao 2020-09-14 08:57:17 UTC

4.5.0-0.nightly-2020-09-12-063044
      - alert: KubePodCrashLooping
        annotations:
          message: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container
            }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes.
        expr: |
          rate(kube_pod_container_status_restarts_total{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}[15m]) * 60 * 5 > 0
        for: 15m
        labels:
          severity: warning

Comment 9 Junqi Zhao 2020-11-02 01:55:32 UTC

the fix is in 4.5.0-0.nightly-2020-10-31-200727, since we had verified it with the not merged PR, move it to VERIFIED

Comment 12 errata-xmlrpc 2020-11-10 14:53:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.18 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4425

Note You need to log in before you can comment on or make changes to this bug.