Bug 2006767 - KubePodCrashLooping may not fire
Summary: KubePodCrashLooping may not fire
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.10
Hardware: All
OS: All
Target Milestone: ---
: 4.10.0
Assignee: Philip Gough
QA Contact: Junqi Zhao
Depends On:
Blocks: 2013617
TreeView+ depends on / blocked
Reported: 2021-09-22 11:22 UTC by Philip Gough
Modified: 2022-03-30 12:15 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2013617 (view as bug list)
Last Closed: 2022-03-10 16:12:32 UTC
Target Upstream Version:

Attachments (Terms of Use)
Screenshot of metric that needs to match AND condition of alert (309.06 KB, image/png)
2021-09-22 11:25 UTC, Philip Gough
no flags Details

System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1423 0 None open BUG 2006767: Updates KubePodCrashLooping expression 2021-10-11 08:51:15 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:13:04 UTC

Description Philip Gough 2021-09-22 11:22:31 UTC
Description of problem:

As reported upstream in https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/645 - the KubePodCrashLooping alert may never fire for a pod that is crash looping.

This is due to the condition `kube_pod_container_status_waiting{} == 1` potentially flapping as a pod crashes

Version-Release number of selected component (if applicable):
4.10 nightly

How reproducible:

frequently but not guaranteed - depending on the cause of the crash, the scrape interval and timings

Steps to Reproduce:
1. Create the provided 'Crashing deployment' in an openshift-* namespace 
2. Observe that the pod comes in and out of crashloopbackoff state for > 15 minutes

Actual results:

Check the Alertmanager UI - the alert does not fire if there are gaps in said metric

Expected results:

Expect the alert to fire in all cases

Additional info:

Comment 3 Philip Gough 2021-09-22 11:25:57 UTC
Created attachment 1825263 [details]
Screenshot of metric that needs to match AND condition of alert

Comment 6 Junqi Zhao 2021-10-13 11:51:17 UTC
tested with the PR, KubePodCrashLooping expr is changed to below, and watched for a few minutes, the result for the expr is continuous
    - alert: KubePodCrashLooping
        description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container
          }}) is in waiting state (reason: "CrashLoopBackOff").'
        summary: Pod is crash looping.
      expr: |
        max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"}[5m]) >= 1
      for: 15m
        severity: warning

Comment 9 Junqi Zhao 2021-10-18 02:50:38 UTC
checked with 4.10.0-0.nightly-2021-10-16-173656, fix is in the payload, along with Comment 6, set to VERIFIED

Comment 12 errata-xmlrpc 2022-03-10 16:12:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.