Bug 1943667

Summary: KubeDaemonSetRolloutStuck fires during upgrades too often because it does not accurately detect progress
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: MonitoringAssignee: Simon Pasquier <spasquie>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.8CC: alegrand, anpicker, erooth, kakkoyun, lcosic, pkrupa, scuppett, spasquie, travi, wking
Target Milestone: ---Keywords: Upgrades
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: on smaller clusters, daemonset instances are rolled out at a regular rate meaning that the number of unavailable instances stays the same during the duration of the upgrade. Consequence: KubeDaemonSetRolloutStuck fires during upgrades. Fix: The for duration of the alert has been increased to 30 minutes. Result: KubeDaemonSetRolloutStuck shouldn't fire before the upgrade ends up.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:56:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2021-03-26 18:56:11 UTC
KubeDaemonSetRolloutStuck is firing during a large number of small cluster upgrades because it can't really differentiate between:

1. stuck on a set of nodes
2. an efficient and orderly rollout where we smoothly and rapidly upgrade all the nodes, but the same number of pods are always unavailable

The alert KubeDaemonSetRolloutStuck looks at the number of scheduled/ready/unavailable pods and also looks for changes.  However, in small/medium clusters a smoothly progressing rollout will continuously have a few pods down:

6 nodes, 2 updated at a time (1 master / 1 worker):

Master A / Worker A get drained
Master A / Worker A get rebooted
2 Pods in DS are down, daemonset looks like scheduled: 6, ready/available: 4
Master A / Worker A start up again, pods go ready
Master B / Worker B get drained fast and rebooted within 30-45s
2 Pods in DS are down, daemonset looks like scheduled: 6, ready/available: 4
...
Last upgrade completes, daemonset looks like scheduled: 6, ready/available: 6

This error is happening in about 25% of CI runs because we are SO GOOD at upgrading quickly and efficiently that the assumptions in the alert (that the number of down pods will change) is not enough, and we take about 4-5m per chunk of nodes x 3 chunks that we end up hitting the alert.

The alert needs to be changed to either wait significantly longer (30m?) in the short run, and in the long run the alert needs to be based on DaemonSet conditions (workload team priority) and the Progerssing deadline behavior that statefulsets deployments have (no progress made in X min is calculated by the operator, not by metrics).

Comment 1 Clayton Coleman 2021-03-26 18:57:35 UTC
Debugged in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1375374519794929664

The cri-o fixes to wait to shutdown correctly increased the amount of time a reboot of happy node took from 3m to 4-5m and pushed us over the 15m threshold.

Note that on larger clusters the odds that all the reboots line up goes down and so there should be changes in numbers.  The increase in "for" is just a hack and reduces the probability of it firing.

Comment 5 Junqi Zhao 2021-04-01 11:19:08 UTC
tested with 4.8.0-0.nightly-2021-03-31-211319, annotations part should also be updated as "at least 30 minutes"
# oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml | grep KubeDaemonSetRolloutStuck -A29
      - alert: KubeDaemonSetRolloutStuck
        annotations:
          description: DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has not
            finished or progressed for at least 15 minutes.
          summary: DaemonSet rollout is stuck.
        expr:  *******
        for: 30m

Comment 8 Junqi Zhao 2021-04-06 01:03:11 UTC
issue is fixed with 4.8.0-0.nightly-2021-04-05-174735
# oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml | grep KubeDaemonSetRolloutStuck
      - alert: KubeDaemonSetRolloutStuck
        annotations:
          description: DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has not
            finished or progressed for at least 30 minutes.
          summary: DaemonSet rollout is stuck.
        ...
        for: 30m
        labels:
          severity: warning

Comment 11 errata-xmlrpc 2021-07-27 22:56:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438