Bug 1824988

Summary:	[4.3 upgrade][alert] KubeDaemonSetRolloutStuck: Only 96.43% of the desired Pods of DaemonSet openshift-dns/dns-default are scheduled and ready.
Product:	OpenShift Container Platform	Reporter:	Hongkai Liu <hongkliu>
Component:	Monitoring	Assignee:	Sergiusz Urbaniak <surbania>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.3.0	CC:	alegrand, anpicker, aos-bugs, ccoleman, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania, wking
Target Milestone:	---	Keywords:	Reopened, Upgrades
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Some alerts did not have the right severity set or were incorrect, causing upgrade issues. The fixes include: - Adjusts severity levels of many alerts from critical to warning as they were cause based alerts - Adjusts KubeStatefulSetUpdateNotRolledOut, KubeDaemonSetRolloutStuck - Removes KubeAPILatencyHigh and KubeAPIErrorsHigh	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 15:57:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Hongkai Liu 2020-04-16 19:32:48 UTC

During upgrade of a cluster in CI build farm, we have seen a sequence of alerts and messages of failures from clusterversion.

oc --context build01 adm upgrade --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-04-13-190424 --force=true

Eventually upgrade was completed successfully (which is so nice).
But those alerts and messages are too frightening.

I would like to create a bug for each of those and feel better for the next upgrade.

https://coreos.slack.com/archives/CHY2E1BL4/p1587060400457400


[FIRING:1] KubeDaemonSetRolloutStuck kube-state-metrics (dns-default https-main 10.128.237.134:8443 openshift-dns kube-state-metrics-66dfc9f94f-qdp5d openshift-monitoring/k8s kube-state-metrics critical)
Only 96.43% of the desired Pods of DaemonSet openshift-dns/dns-default are scheduled and ready.

must-gather after upgrade:
http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/

Comment 1 Dan Mace 2020-04-21 14:15:23 UTC

Sorry, I'm not understanding what specific bug is being reported here.

After the upgrade, DNS reports healthy:

http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a37e5bef81b80c86ab3864c3395d69c0867ab0aa58e1150eceb85be8951def49/cluster-scoped-resources/operator.openshift.io/dnses/default.yaml

The kube-state-metrics error looks related to monitoring (in the openshift-monitoring namespace) but it's not clear whether the alert was still firing after the upgrade or how to check that.

What specifically do you think we should be doing with DNS here? It looks like the upgrade succeeded and everything's okay. What's the DNS bug, and what justifies the "high" severity?

Comment 2 W. Trevor King 2020-04-21 18:22:00 UTC

[1] has:

    updateStrategy:
      rollingUpdate:
        maxUnavailable: 1
      type: RollingUpdate
  status:
    currentNumberScheduled: 29
    desiredNumberScheduled: 29
    numberAvailable: 29
    numberMisscheduled: 0
    numberReady: 29
    observedGeneration: 6
    updatedNumberScheduled: 29

And 28/29 = 96.55...% (which is not quite 27/28 = 96.43....%).  So I'm pretty sure this alert was "one of the nodes is down for longer than I'd have liked, and its dns-default pod cannot be scheduled".  I think the solution it tuning the alert to allow for expected node-reboot times.  And possibly increasing maxUnavailable, because the machine-config logic reboots control plane and compute node sets in parallel, right?  So similar to [2].

[1]: http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a37e5bef81b80c86ab3864c3395d69c0867ab0aa58e1150eceb85be8951def49/namespaces/openshift-dns/apps/daemonsets.yaml
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1824986#c2

Comment 3 Hongkai Liu 2020-04-27 15:45:42 UTC

Hey Dan,

yes, the upgrade was successful and no such alerts were fired after the upgrade process.

This bug, as the others linked to https://bugzilla.redhat.com/show_bug.cgi?id=1824981, is about that no critical alerts should be fired during the upgrade unless ci-admins need to act on it.

I am not sure which team owns the alert. Apologize if it is not a DNS component.

Comment 4 W. Trevor King 2020-04-27 18:36:57 UTC

> This bug, as the others linked to https://bugzilla.redhat.com/show_bug.cgi?id=1824981, is about that no critical alerts should be fired during the upgrade unless ci-admins need to act on it.

Better link for generic "no critical alerts should fire during healthy updates" is bug 1828427.  That alert fires after 15m [1], and yeah it looks like the monitoring operator is the current owner.  Moving this over to them, so they can talk to the MCO team and decide if 15m is appropriate given our measured node-reboot timing.  Looks like it's been 15m since at least [2], so the reasoning behind the existing 15m choice may be lost to time.

[1]: https://github.com/openshift/cluster-monitoring-operator/blob/b9d67a775b31dbe008885cfb19889266007c0ef0/assets/prometheus-k8s/rules.yaml#L1185-L1195
[2]: https://github.com/openshift/cluster-monitoring-operator/commit/140059365bcf4ef592dba2855b1482a1034fd185#diff-1dca87d186c04a487d72e52ab0b4dde5R198-R202

Comment 10 W. Trevor King 2020-06-15 20:32:19 UTC

Reopening.  Silencing alerts during updates is ok, but is orthogonal to actually fixing the alerting condition or twitchy alert.  Also, this is a critical alert, and silencing critical alerts is less likely than silencing warning or info alerts.  I still think the issue here is that the alert logic should be tuned so that it doesn't fire when we are slow to push DaemonSet pods onto nodes during rolling reboots of large machine pools.

Comment 16 Frederic Branczyk 2020-07-23 13:30:16 UTC

Opened a PR upstream to introduce some improvements. https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/476

Comment 17 Frederic Branczyk 2020-07-28 08:45:20 UTC

Upstream PR merged, will now work on getting this into openshift

Comment 19 Lili Cosic 2020-08-03 11:51:35 UTC

PR https://github.com/openshift/cluster-monitoring-operator/pull/898

Comment 24 errata-xmlrpc 2020-10-27 15:57:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196