Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2006767

Summary:

KubePodCrashLooping may not fire

Product:

OpenShift Container Platform

Reporter:

Philip Gough <pgough>

Component:

Monitoring

Assignee:

Philip Gough <pgough>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.10

CC:

amuller, anpicker, aos-bugs, erooth, wking

Target Milestone:

---

Target Release:

4.10.0

Hardware:

All

OS:

All

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

2013617 (view as bug list)

Environment:

Last Closed:

2022-03-10 16:12:32 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2013617

Attachments:

Description	Flags
Screenshot of metric that needs to match AND condition of alert	none

Description Philip Gough 2021-09-22 11:22:31 UTC

Description of problem:

As reported upstream in https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/645 - the KubePodCrashLooping alert may never fire for a pod that is crash looping.

This is due to the condition `kube_pod_container_status_waiting{} == 1` potentially flapping as a pod crashes


Version-Release number of selected component (if applicable):
4.10 nightly

How reproducible:

frequently but not guaranteed - depending on the cause of the crash, the scrape interval and timings

Steps to Reproduce:
1. Create the provided 'Crashing deployment' in an openshift-* namespace 
2. Observe that the pod comes in and out of crashloopbackoff state for > 15 minutes

Actual results:

Check the Alertmanager UI - the alert does not fire if there are gaps in said metric

Expected results:

Expect the alert to fire in all cases

Additional info:

Comment 3 Philip Gough 2021-09-22 11:25:57 UTC

Created attachment 1825263 [details]
Screenshot of metric that needs to match AND condition of alert

Comment 6 Junqi Zhao 2021-10-13 11:51:17 UTC

tested with the PR, KubePodCrashLooping expr is changed to below, and watched for a few minutes, the result for the expr is continuous
    - alert: KubePodCrashLooping
      annotations:
        description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container
          }}) is in waiting state (reason: "CrashLoopBackOff").'
        summary: Pod is crash looping.
      expr: |
        max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"}[5m]) >= 1
      for: 15m
      labels:
        severity: warning

Comment 9 Junqi Zhao 2021-10-18 02:50:38 UTC

checked with 4.10.0-0.nightly-2021-10-16-173656, fix is in the payload, along with Comment 6, set to VERIFIED

Comment 12 errata-xmlrpc 2022-03-10 16:12:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056