Description of problem: As reported upstream in https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/645 - the KubePodCrashLooping alert may never fire for a pod that is crash looping. This is due to the condition `kube_pod_container_status_waiting{} == 1` potentially flapping as a pod crashes Version-Release number of selected component (if applicable): 4.10 nightly How reproducible: frequently but not guaranteed - depending on the cause of the crash, the scrape interval and timings Steps to Reproduce: 1. Create the provided 'Crashing deployment' in an openshift-* namespace 2. Observe that the pod comes in and out of crashloopbackoff state for > 15 minutes Actual results: Check the Alertmanager UI - the alert does not fire if there are gaps in said metric Expected results: Expect the alert to fire in all cases Additional info:
Created attachment 1825263 [details] Screenshot of metric that needs to match AND condition of alert
tested with the PR, KubePodCrashLooping expr is changed to below, and watched for a few minutes, the result for the expr is continuous - alert: KubePodCrashLooping annotations: description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is in waiting state (reason: "CrashLoopBackOff").' summary: Pod is crash looping. expr: | max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"}[5m]) >= 1 for: 15m labels: severity: warning
checked with 4.10.0-0.nightly-2021-10-16-173656, fix is in the payload, along with Comment 6, set to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056