Bug 2030698 - KubePodCrashLooping may fire when pod is not in CrashLoopBackOff
Summary: KubePodCrashLooping may fire when pod is not in CrashLoopBackOff
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: All
OS: All
high
medium
Target Milestone: ---
: 4.8.z
Assignee: Arunprasad Rajkumar
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 2013617
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-09 14:15 UTC by Philip Gough
Modified: 2022-04-27 11:46 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2013617
Environment:
Last Closed: 2022-04-27 11:46:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes-monitoring kubernetes-mixin pull 721 0 None Merged alerts:KubePodCrashLooping: Adjust alert to avoid non firing when fla… 2022-03-30 18:03:00 UTC
Github openshift cluster-monitoring-operator pull 1619 0 None open Bug 2030698: KubePodCrashLooping may fire when pod is not in CrashLoopBackOff 2022-03-30 18:03:03 UTC
Red Hat Product Errata RHBA-2022:1427 0 None None None 2022-04-27 11:46:26 UTC

Comment 1 W. Trevor King 2021-12-10 00:08:52 UTC
4.9 had a guard that kept it from firing, so we were reducing the number of false positives, and the old bug summary made sense there.  But 4.8 doesn't have that guard today, so in 4.8 we are reducing the number of false positives.  I'm adjusting the summary to reflect that.  It will still be the same alert logic that we brought to 4.9, the summary change just reflects the different starting point that each branch is moving from.

Comment 2 W. Trevor King 2021-12-10 00:09:37 UTC
> 4.9 had a guard that kept it from firing, so we were reducing the number of false positives...

Hit send too soon, I meant we were reducing false negatives in 4.9.

Comment 4 Junqi Zhao 2022-04-13 03:03:20 UTC
tested with the PR, KubePodCrashLooping expr is changed to below, used the deployment file from https://bugzilla.redhat.com/show_bug.cgi?id=2006767#c1 and watched for a few minutes, the result for the expr is continuous, no gaps, see the picture
        - alert: KubePodCrashLooping
          annotations:
            description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container
              }}) is in waiting state (reason: "CrashLoopBackOff").'
            summary: Pod is crash looping.
          expr: |
            max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"}[5m]) >= 1
          for: 15m
          labels:
            severity: warning

Comment 11 errata-xmlrpc 2022-04-27 11:46:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.39 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1427


Note You need to log in before you can comment on or make changes to this bug.