1943667 – KubeDaemonSetRolloutStuck fires during upgrades too often because it does not accurately detect progress

Bug 1943667 - KubeDaemonSetRolloutStuck fires during upgrades too often because it does not accurately detect progress

Summary: KubeDaemonSetRolloutStuck fires during upgrades too often because it does not...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-26 18:56 UTC by Clayton Coleman
Modified:	2021-11-04 07:03 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: on smaller clusters, daemonset instances are rolled out at a regular rate meaning that the number of unavailable instances stays the same during the duration of the upgrade. Consequence: KubeDaemonSetRolloutStuck fires during upgrades. Fix: The for duration of the alert has been increased to 30 minutes. Result: KubeDaemonSetRolloutStuck shouldn't fire before the upgrade ends up.
Clone Of:
Environment:
Last Closed:	2021-07-27 22:56:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1094	None	open	Bug 1943667: increase for duration of KubeDaemonSetRolloutStuck	2021-03-30 07:30:52 UTC
Github	openshift cluster-monitoring-operator pull 1100	None	open	Bug 1943667: fix alert description	2021-04-01 14:33:43 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:56:36 UTC

Description Clayton Coleman 2021-03-26 18:56:11 UTC

KubeDaemonSetRolloutStuck is firing during a large number of small cluster upgrades because it can't really differentiate between:

1. stuck on a set of nodes
2. an efficient and orderly rollout where we smoothly and rapidly upgrade all the nodes, but the same number of pods are always unavailable

The alert KubeDaemonSetRolloutStuck looks at the number of scheduled/ready/unavailable pods and also looks for changes.  However, in small/medium clusters a smoothly progressing rollout will continuously have a few pods down:

6 nodes, 2 updated at a time (1 master / 1 worker):

Master A / Worker A get drained
Master A / Worker A get rebooted
2 Pods in DS are down, daemonset looks like scheduled: 6, ready/available: 4
Master A / Worker A start up again, pods go ready
Master B / Worker B get drained fast and rebooted within 30-45s
2 Pods in DS are down, daemonset looks like scheduled: 6, ready/available: 4
...
Last upgrade completes, daemonset looks like scheduled: 6, ready/available: 6

This error is happening in about 25% of CI runs because we are SO GOOD at upgrading quickly and efficiently that the assumptions in the alert (that the number of down pods will change) is not enough, and we take about 4-5m per chunk of nodes x 3 chunks that we end up hitting the alert.

The alert needs to be changed to either wait significantly longer (30m?) in the short run, and in the long run the alert needs to be based on DaemonSet conditions (workload team priority) and the Progerssing deadline behavior that statefulsets deployments have (no progress made in X min is calculated by the operator, not by metrics).

Comment 1 Clayton Coleman 2021-03-26 18:57:35 UTC

Debugged in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1375374519794929664

The cri-o fixes to wait to shutdown correctly increased the amount of time a reboot of happy node took from 3m to 4-5m and pushed us over the 15m threshold.

Note that on larger clusters the odds that all the reboots line up goes down and so there should be changes in numbers.  The increase in "for" is just a hack and reduces the probability of it firing.

Comment 5 Junqi Zhao 2021-04-01 11:19:08 UTC

tested with 4.8.0-0.nightly-2021-03-31-211319, annotations part should also be updated as "at least 30 minutes"
# oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml | grep KubeDaemonSetRolloutStuck -A29
      - alert: KubeDaemonSetRolloutStuck
        annotations:
          description: DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has not
            finished or progressed for at least 15 minutes.
          summary: DaemonSet rollout is stuck.
        expr:  *******
        for: 30m

Comment 8 Junqi Zhao 2021-04-06 01:03:11 UTC

issue is fixed with 4.8.0-0.nightly-2021-04-05-174735
# oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml | grep KubeDaemonSetRolloutStuck
      - alert: KubeDaemonSetRolloutStuck
        annotations:
          description: DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has not
            finished or progressed for at least 30 minutes.
          summary: DaemonSet rollout is stuck.
        ...
        for: 30m
        labels:
          severity: warning

Comment 11 errata-xmlrpc 2021-07-27 22:56:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.