Bug 1858010 - KubePodCrashLooping is alerting on critical severity
Summary: KubePodCrashLooping is alerting on critical severity
Keywords:
Status: POST
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.5
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.5.z
Assignee: Pawel Krupa
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 1858008
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-16 20:23 UTC by Rick Rackow
Modified: 2020-09-14 09:11 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1858008
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github kubernetes-monitoring kubernetes-mixin pull 501 None closed Backport #414 2020-09-18 13:56:39 UTC
Github openshift cluster-monitoring-operator pull 926 None closed Bug 1858010: decrease alerts severity 2020-09-21 13:26:10 UTC

Description Rick Rackow 2020-07-16 20:23:49 UTC
+++ This bug was initially created as a clone of Bug #1858008 +++

Description of problem:
KubePodCrashLooping is alerting on critical severity.
As of current best practices this should be on a warning level instead since it's a cause based alert rather than a symptom based alert

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Cause a crashloop
2. 
3.

Actual results:
`severity: critical`

Expected results:
`severity: warning`

Additional info:
this has been fixed upstream [1] already and needs to be implemented into cluster monitoring


[1] https://github.com/kubernetes-monitoring/kubernetes-mixin/commit/050dedeba07b0ebd782beebef63f6c0168713ff3

Comment 3 Junqi Zhao 2020-09-14 08:52:19 UTC
4.6.0-0.nightly-2020-09-12-230035, time range for KubePodCrashLooping expr is 5m
        expr: |
          rate(kube_pod_container_status_restarts_total{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}[5m]) * 60 * 5 > 0

4.5.0-0.nightly-2020-09-12-063044, time range for KubePodCrashLooping expr is 15m, I think it is better to change to 5m
        expr: |
          rate(kube_pod_container_status_restarts_total{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}[15m]) * 60 * 5 > 0

Comment 4 Junqi Zhao 2020-09-14 08:57:17 UTC
4.5.0-0.nightly-2020-09-12-063044
      - alert: KubePodCrashLooping
        annotations:
          message: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container
            }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes.
        expr: |
          rate(kube_pod_container_status_restarts_total{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}[15m]) * 60 * 5 > 0
        for: 15m
        labels:
          severity: warning


Note You need to log in before you can comment on or make changes to this bug.