2006767 – KubePodCrashLooping may not fire

Bug 2006767 - KubePodCrashLooping may not fire

Summary: KubePodCrashLooping may not fire

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.10
Hardware:	All
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Philip Gough
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2013617
TreeView+	depends on / blocked

Reported:	2021-09-22 11:22 UTC by Philip Gough
Modified:	2022-03-30 12:15 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2013617 (view as bug list)
Environment:
Last Closed:	2022-03-10 16:12:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Screenshot of metric that needs to match AND condition of alert (309.06 KB, image/png) 2021-09-22 11:25 UTC, Philip Gough	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1423	0	None	open	BUG 2006767: Updates KubePodCrashLooping expression	2021-10-11 08:51:15 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:13:04 UTC

Description Philip Gough 2021-09-22 11:22:31 UTC

Description of problem:

As reported upstream in https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/645 - the KubePodCrashLooping alert may never fire for a pod that is crash looping.

This is due to the condition `kube_pod_container_status_waiting{} == 1` potentially flapping as a pod crashes


Version-Release number of selected component (if applicable):
4.10 nightly

How reproducible:

frequently but not guaranteed - depending on the cause of the crash, the scrape interval and timings

Steps to Reproduce:
1. Create the provided 'Crashing deployment' in an openshift-* namespace 
2. Observe that the pod comes in and out of crashloopbackoff state for > 15 minutes

Actual results:

Check the Alertmanager UI - the alert does not fire if there are gaps in said metric

Expected results:

Expect the alert to fire in all cases

Additional info:

Comment 3 Philip Gough 2021-09-22 11:25:57 UTC

Created attachment 1825263 [details]
Screenshot of metric that needs to match AND condition of alert

Comment 6 Junqi Zhao 2021-10-13 11:51:17 UTC

tested with the PR, KubePodCrashLooping expr is changed to below, and watched for a few minutes, the result for the expr is continuous
    - alert: KubePodCrashLooping
      annotations:
        description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container
          }}) is in waiting state (reason: "CrashLoopBackOff").'
        summary: Pod is crash looping.
      expr: |
        max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff", namespace=~"(openshift-.*|kube-.*|default)",job="kube-state-metrics"}[5m]) >= 1
      for: 15m
      labels:
        severity: warning

Comment 9 Junqi Zhao 2021-10-18 02:50:38 UTC

checked with 4.10.0-0.nightly-2021-10-16-173656, fix is in the payload, along with Comment 6, set to VERIFIED

Comment 12 errata-xmlrpc 2022-03-10 16:12:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.