Bug 1824988 - [4.3 upgrade][alert] KubeDaemonSetRolloutStuck: Only 96.43% of the desired Pods of DaemonSet openshift-dns/dns-default are scheduled and ready.
Summary: [4.3 upgrade][alert] KubeDaemonSetRolloutStuck: Only 96.43% of the desired Po...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-16 19:32 UTC by Hongkai Liu
Modified: 2020-10-27 15:58 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Some alerts did not have the right severity set or were incorrect, causing upgrade issues. The fixes include: - Adjusts severity levels of many alerts from critical to warning as they were cause based alerts - Adjusts KubeStatefulSetUpdateNotRolledOut, KubeDaemonSetRolloutStuck - Removes KubeAPILatencyHigh and KubeAPIErrorsHigh
Clone Of:
Environment:
Last Closed: 2020-10-27 15:57:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes-monitoring kubernetes-mixin pull 476 0 None closed alerts: Improve DaemonSet rollout alert taking progress into account 2021-02-10 14:43:52 UTC
Github openshift cluster-monitoring-operator pull 898 0 None closed Bug 1846805: Remove kube-mixin direct dependancy 2021-02-10 14:43:52 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:58:06 UTC

Internal Links: 1824981

Description Hongkai Liu 2020-04-16 19:32:48 UTC
During upgrade of a cluster in CI build farm, we have seen a sequence of alerts and messages of failures from clusterversion.

oc --context build01 adm upgrade --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-04-13-190424 --force=true

Eventually upgrade was completed successfully (which is so nice).
But those alerts and messages are too frightening.

I would like to create a bug for each of those and feel better for the next upgrade.

https://coreos.slack.com/archives/CHY2E1BL4/p1587060400457400


[FIRING:1] KubeDaemonSetRolloutStuck kube-state-metrics (dns-default https-main 10.128.237.134:8443 openshift-dns kube-state-metrics-66dfc9f94f-qdp5d openshift-monitoring/k8s kube-state-metrics critical)
Only 96.43% of the desired Pods of DaemonSet openshift-dns/dns-default are scheduled and ready.

must-gather after upgrade:
http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/

Comment 1 Dan Mace 2020-04-21 14:15:23 UTC
Sorry, I'm not understanding what specific bug is being reported here.

After the upgrade, DNS reports healthy:

http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a37e5bef81b80c86ab3864c3395d69c0867ab0aa58e1150eceb85be8951def49/cluster-scoped-resources/operator.openshift.io/dnses/default.yaml

The kube-state-metrics error looks related to monitoring (in the openshift-monitoring namespace) but it's not clear whether the alert was still firing after the upgrade or how to check that.

What specifically do you think we should be doing with DNS here? It looks like the upgrade succeeded and everything's okay. What's the DNS bug, and what justifies the "high" severity?

Comment 2 W. Trevor King 2020-04-21 18:22:00 UTC
[1] has:

    updateStrategy:
      rollingUpdate:
        maxUnavailable: 1
      type: RollingUpdate
  status:
    currentNumberScheduled: 29
    desiredNumberScheduled: 29
    numberAvailable: 29
    numberMisscheduled: 0
    numberReady: 29
    observedGeneration: 6
    updatedNumberScheduled: 29

And 28/29 = 96.55...% (which is not quite 27/28 = 96.43....%).  So I'm pretty sure this alert was "one of the nodes is down for longer than I'd have liked, and its dns-default pod cannot be scheduled".  I think the solution it tuning the alert to allow for expected node-reboot times.  And possibly increasing maxUnavailable, because the machine-config logic reboots control plane and compute node sets in parallel, right?  So similar to [2].

[1]: http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a37e5bef81b80c86ab3864c3395d69c0867ab0aa58e1150eceb85be8951def49/namespaces/openshift-dns/apps/daemonsets.yaml
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1824986#c2

Comment 3 Hongkai Liu 2020-04-27 15:45:42 UTC
Hey Dan,

yes, the upgrade was successful and no such alerts were fired after the upgrade process.

This bug, as the others linked to https://bugzilla.redhat.com/show_bug.cgi?id=1824981, is about that no critical alerts should be fired during the upgrade unless ci-admins need to act on it.

I am not sure which team owns the alert. Apologize if it is not a DNS component.

Comment 4 W. Trevor King 2020-04-27 18:36:57 UTC
> This bug, as the others linked to https://bugzilla.redhat.com/show_bug.cgi?id=1824981, is about that no critical alerts should be fired during the upgrade unless ci-admins need to act on it.

Better link for generic "no critical alerts should fire during healthy updates" is bug 1828427.  That alert fires after 15m [1], and yeah it looks like the monitoring operator is the current owner.  Moving this over to them, so they can talk to the MCO team and decide if 15m is appropriate given our measured node-reboot timing.  Looks like it's been 15m since at least [2], so the reasoning behind the existing 15m choice may be lost to time.

[1]: https://github.com/openshift/cluster-monitoring-operator/blob/b9d67a775b31dbe008885cfb19889266007c0ef0/assets/prometheus-k8s/rules.yaml#L1185-L1195
[2]: https://github.com/openshift/cluster-monitoring-operator/commit/140059365bcf4ef592dba2855b1482a1034fd185#diff-1dca87d186c04a487d72e52ab0b4dde5R198-R202

Comment 10 W. Trevor King 2020-06-15 20:32:19 UTC
Reopening.  Silencing alerts during updates is ok, but is orthogonal to actually fixing the alerting condition or twitchy alert.  Also, this is a critical alert, and silencing critical alerts is less likely than silencing warning or info alerts.  I still think the issue here is that the alert logic should be tuned so that it doesn't fire when we are slow to push DaemonSet pods onto nodes during rolling reboots of large machine pools.

Comment 16 Frederic Branczyk 2020-07-23 13:30:16 UTC
Opened a PR upstream to introduce some improvements. https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/476

Comment 17 Frederic Branczyk 2020-07-28 08:45:20 UTC
Upstream PR merged, will now work on getting this into openshift

Comment 24 errata-xmlrpc 2020-10-27 15:57:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.