1824988 – [4.3 upgrade][alert] KubeDaemonSetRolloutStuck: Only 96.43% of the desired Pods of DaemonSet openshift-dns/dns-default are scheduled and ready.

Bug 1824988 - [4.3 upgrade][alert] KubeDaemonSetRolloutStuck: Only 96.43% of the desired Pods of DaemonSet openshift-dns/dns-default are scheduled and ready.

Summary: [4.3 upgrade][alert] KubeDaemonSetRolloutStuck: Only 96.43% of the desired Po...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-16 19:32 UTC by Hongkai Liu
Modified:	2020-10-27 15:58 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Some alerts did not have the right severity set or were incorrect, causing upgrade issues. The fixes include: - Adjusts severity levels of many alerts from critical to warning as they were cause based alerts - Adjusts KubeStatefulSetUpdateNotRolledOut, KubeDaemonSetRolloutStuck - Removes KubeAPILatencyHigh and KubeAPIErrorsHigh
Clone Of:
Environment:
Last Closed:	2020-10-27 15:57:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes-monitoring kubernetes-mixin pull 476	None	closed	alerts: Improve DaemonSet rollout alert taking progress into account	2021-02-10 14:43:52 UTC
Github	openshift cluster-monitoring-operator pull 898	None	closed	Bug 1846805: Remove kube-mixin direct dependancy	2021-02-10 14:43:52 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 15:58:06 UTC

Internal Links: 1824981

Description Hongkai Liu 2020-04-16 19:32:48 UTC

During upgrade of a cluster in CI build farm, we have seen a sequence of alerts and messages of failures from clusterversion.

oc --context build01 adm upgrade --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-04-13-190424 --force=true

Eventually upgrade was completed successfully (which is so nice).
But those alerts and messages are too frightening.

I would like to create a bug for each of those and feel better for the next upgrade.

https://coreos.slack.com/archives/CHY2E1BL4/p1587060400457400


[FIRING:1] KubeDaemonSetRolloutStuck kube-state-metrics (dns-default https-main 10.128.237.134:8443 openshift-dns kube-state-metrics-66dfc9f94f-qdp5d openshift-monitoring/k8s kube-state-metrics critical)
Only 96.43% of the desired Pods of DaemonSet openshift-dns/dns-default are scheduled and ready.

must-gather after upgrade:
http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/

Comment 1 Dan Mace 2020-04-21 14:15:23 UTC

Sorry, I'm not understanding what specific bug is being reported here.

After the upgrade, DNS reports healthy:

http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a37e5bef81b80c86ab3864c3395d69c0867ab0aa58e1150eceb85be8951def49/cluster-scoped-resources/operator.openshift.io/dnses/default.yaml

The kube-state-metrics error looks related to monitoring (in the openshift-monitoring namespace) but it's not clear whether the alert was still firing after the upgrade or how to check that.

What specifically do you think we should be doing with DNS here? It looks like the upgrade succeeded and everything's okay. What's the DNS bug, and what justifies the "high" severity?

Comment 2 W. Trevor King 2020-04-21 18:22:00 UTC

[1] has:

    updateStrategy:
      rollingUpdate:
        maxUnavailable: 1
      type: RollingUpdate
  status:
    currentNumberScheduled: 29
    desiredNumberScheduled: 29
    numberAvailable: 29
    numberMisscheduled: 0
    numberReady: 29
    observedGeneration: 6
    updatedNumberScheduled: 29

And 28/29 = 96.55...% (which is not quite 27/28 = 96.43....%).  So I'm pretty sure this alert was "one of the nodes is down for longer than I'd have liked, and its dns-default pod cannot be scheduled".  I think the solution it tuning the alert to allow for expected node-reboot times.  And possibly increasing maxUnavailable, because the machine-config logic reboots control plane and compute node sets in parallel, right?  So similar to [2].

[1]: http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a37e5bef81b80c86ab3864c3395d69c0867ab0aa58e1150eceb85be8951def49/namespaces/openshift-dns/apps/daemonsets.yaml
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1824986#c2

Comment 3 Hongkai Liu 2020-04-27 15:45:42 UTC

Hey Dan,

yes, the upgrade was successful and no such alerts were fired after the upgrade process.

This bug, as the others linked to https://bugzilla.redhat.com/show_bug.cgi?id=1824981, is about that no critical alerts should be fired during the upgrade unless ci-admins need to act on it.

I am not sure which team owns the alert. Apologize if it is not a DNS component.

Comment 4 W. Trevor King 2020-04-27 18:36:57 UTC

> This bug, as the others linked to https://bugzilla.redhat.com/show_bug.cgi?id=1824981, is about that no critical alerts should be fired during the upgrade unless ci-admins need to act on it.

Better link for generic "no critical alerts should fire during healthy updates" is bug 1828427.  That alert fires after 15m [1], and yeah it looks like the monitoring operator is the current owner.  Moving this over to them, so they can talk to the MCO team and decide if 15m is appropriate given our measured node-reboot timing.  Looks like it's been 15m since at least [2], so the reasoning behind the existing 15m choice may be lost to time.

[1]: https://github.com/openshift/cluster-monitoring-operator/blob/b9d67a775b31dbe008885cfb19889266007c0ef0/assets/prometheus-k8s/rules.yaml#L1185-L1195
[2]: https://github.com/openshift/cluster-monitoring-operator/commit/140059365bcf4ef592dba2855b1482a1034fd185#diff-1dca87d186c04a487d72e52ab0b4dde5R198-R202

Comment 10 W. Trevor King 2020-06-15 20:32:19 UTC

Reopening.  Silencing alerts during updates is ok, but is orthogonal to actually fixing the alerting condition or twitchy alert.  Also, this is a critical alert, and silencing critical alerts is less likely than silencing warning or info alerts.  I still think the issue here is that the alert logic should be tuned so that it doesn't fire when we are slow to push DaemonSet pods onto nodes during rolling reboots of large machine pools.

Comment 16 Frederic Branczyk 2020-07-23 13:30:16 UTC

Opened a PR upstream to introduce some improvements. https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/476

Comment 17 Frederic Branczyk 2020-07-28 08:45:20 UTC

Upstream PR merged, will now work on getting this into openshift

Comment 19 Lili Cosic 2020-08-03 11:51:35 UTC

PR https://github.com/openshift/cluster-monitoring-operator/pull/898

Comment 24 errata-xmlrpc 2020-10-27 15:57:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.