KubeDaemonSetRolloutStuck is firing during a large number of small cluster upgrades because it can't really differentiate between: 1. stuck on a set of nodes 2. an efficient and orderly rollout where we smoothly and rapidly upgrade all the nodes, but the same number of pods are always unavailable The alert KubeDaemonSetRolloutStuck looks at the number of scheduled/ready/unavailable pods and also looks for changes. However, in small/medium clusters a smoothly progressing rollout will continuously have a few pods down: 6 nodes, 2 updated at a time (1 master / 1 worker): Master A / Worker A get drained Master A / Worker A get rebooted 2 Pods in DS are down, daemonset looks like scheduled: 6, ready/available: 4 Master A / Worker A start up again, pods go ready Master B / Worker B get drained fast and rebooted within 30-45s 2 Pods in DS are down, daemonset looks like scheduled: 6, ready/available: 4 ... Last upgrade completes, daemonset looks like scheduled: 6, ready/available: 6 This error is happening in about 25% of CI runs because we are SO GOOD at upgrading quickly and efficiently that the assumptions in the alert (that the number of down pods will change) is not enough, and we take about 4-5m per chunk of nodes x 3 chunks that we end up hitting the alert. The alert needs to be changed to either wait significantly longer (30m?) in the short run, and in the long run the alert needs to be based on DaemonSet conditions (workload team priority) and the Progerssing deadline behavior that statefulsets deployments have (no progress made in X min is calculated by the operator, not by metrics).
Debugged in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1375374519794929664 The cri-o fixes to wait to shutdown correctly increased the amount of time a reboot of happy node took from 3m to 4-5m and pushed us over the 15m threshold. Note that on larger clusters the odds that all the reboots line up goes down and so there should be changes in numbers. The increase in "for" is just a hack and reduces the probability of it firing.
tested with 4.8.0-0.nightly-2021-03-31-211319, annotations part should also be updated as "at least 30 minutes" # oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml | grep KubeDaemonSetRolloutStuck -A29 - alert: KubeDaemonSetRolloutStuck annotations: description: DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has not finished or progressed for at least 15 minutes. summary: DaemonSet rollout is stuck. expr: ******* for: 30m
issue is fixed with 4.8.0-0.nightly-2021-04-05-174735 # oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml | grep KubeDaemonSetRolloutStuck - alert: KubeDaemonSetRolloutStuck annotations: description: DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has not finished or progressed for at least 30 minutes. summary: DaemonSet rollout is stuck. ... for: 30m labels: severity: warning
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438