Bug 1943667
Summary: | KubeDaemonSetRolloutStuck fires during upgrades too often because it does not accurately detect progress | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.8 | CC: | alegrand, anpicker, erooth, kakkoyun, lcosic, pkrupa, scuppett, spasquie, travi, wking |
Target Milestone: | --- | Keywords: | Upgrades |
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: on smaller clusters, daemonset instances are rolled out at a regular rate meaning that the number of unavailable instances stays the same during the duration of the upgrade.
Consequence: KubeDaemonSetRolloutStuck fires during upgrades.
Fix: The for duration of the alert has been increased to 30 minutes.
Result: KubeDaemonSetRolloutStuck shouldn't fire before the upgrade ends up.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 22:56:00 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Clayton Coleman
2021-03-26 18:56:11 UTC
Debugged in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1375374519794929664 The cri-o fixes to wait to shutdown correctly increased the amount of time a reboot of happy node took from 3m to 4-5m and pushed us over the 15m threshold. Note that on larger clusters the odds that all the reboots line up goes down and so there should be changes in numbers. The increase in "for" is just a hack and reduces the probability of it firing. tested with 4.8.0-0.nightly-2021-03-31-211319, annotations part should also be updated as "at least 30 minutes" # oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml | grep KubeDaemonSetRolloutStuck -A29 - alert: KubeDaemonSetRolloutStuck annotations: description: DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has not finished or progressed for at least 15 minutes. summary: DaemonSet rollout is stuck. expr: ******* for: 30m issue is fixed with 4.8.0-0.nightly-2021-04-05-174735 # oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -oyaml | grep KubeDaemonSetRolloutStuck - alert: KubeDaemonSetRolloutStuck annotations: description: DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has not finished or progressed for at least 30 minutes. summary: DaemonSet rollout is stuck. ... for: 30m labels: severity: warning Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |