Bug 2006767

Summary: KubePodCrashLooping may not fire
Product: OpenShift Container Platform Reporter: Philip Gough <pgough>
Component: MonitoringAssignee: Philip Gough <pgough>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.10CC: amuller, anpicker, aos-bugs, erooth, wking
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2013617 (view as bug list) Environment:
Last Closed: 2022-03-10 16:12:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2013617    
Attachments:
Description Flags
Screenshot of metric that needs to match AND condition of alert none

Description Philip Gough 2021-09-22 11:22:31 UTC
Description of problem:

As reported upstream in https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/645 - the KubePodCrashLooping alert may never fire for a pod that is crash looping.

This is due to the condition `kube_pod_container_status_waiting{} == 1` potentially flapping as a pod crashes


Version-Release number of selected component (if applicable):
4.10 nightly

How reproducible:

frequently but not guaranteed - depending on the cause of the crash, the scrape interval and timings

Steps to Reproduce:
1. Create the provided 'Crashing deployment' in an openshift-* namespace 
2. Observe that the pod comes in and out of crashloopbackoff state for > 15 minutes

Actual results:

Check the Alertmanager UI - the alert does not fire if there are gaps in said metric

Expected results:

Expect the alert to fire in all cases

Additional info:

Comment 3 Philip Gough 2021-09-22 11:25:57 UTC
Created attachment 1825263 [details]
Screenshot of metric that needs to match AND condition of alert

Comment 6 Junqi Zhao 2021-10-13 11:51:17 UTC
tested with the PR, KubePodCrashLooping expr is changed to below, and watched for a few minutes, the result for the expr is continuous
    - alert: KubePodCrashLooping
      annotations:
        description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container
          }}) is in waiting state (reason: "CrashLoopBackOff").'
        summary: Pod is crash looping.
      expr: |
        max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"}[5m]) >= 1
      for: 15m
      labels:
        severity: warning

Comment 9 Junqi Zhao 2021-10-18 02:50:38 UTC
checked with 4.10.0-0.nightly-2021-10-16-173656, fix is in the payload, along with Comment 6, set to VERIFIED

Comment 12 errata-xmlrpc 2022-03-10 16:12:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056