During upgrade of a cluster in CI build farm, we have seen a sequence of alerts and messages of failures from clusterversion. oc --context build01 adm upgrade --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-04-13-190424 --force=true Eventually upgrade was completed successfully (which is so nice). But those alerts and messages are too frightening. I would like to create a bug for each of those and feel better for the next upgrade. https://coreos.slack.com/archives/CHY2E1BL4/p1587060400457400 [FIRING:1] KubeDaemonSetRolloutStuck kube-state-metrics (dns-default https-main 10.128.237.134:8443 openshift-dns kube-state-metrics-66dfc9f94f-qdp5d openshift-monitoring/k8s kube-state-metrics critical) Only 96.43% of the desired Pods of DaemonSet openshift-dns/dns-default are scheduled and ready. must-gather after upgrade: http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/
Sorry, I'm not understanding what specific bug is being reported here. After the upgrade, DNS reports healthy: http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a37e5bef81b80c86ab3864c3395d69c0867ab0aa58e1150eceb85be8951def49/cluster-scoped-resources/operator.openshift.io/dnses/default.yaml The kube-state-metrics error looks related to monitoring (in the openshift-monitoring namespace) but it's not clear whether the alert was still firing after the upgrade or how to check that. What specifically do you think we should be doing with DNS here? It looks like the upgrade succeeded and everything's okay. What's the DNS bug, and what justifies the "high" severity?
[1] has: updateStrategy: rollingUpdate: maxUnavailable: 1 type: RollingUpdate status: currentNumberScheduled: 29 desiredNumberScheduled: 29 numberAvailable: 29 numberMisscheduled: 0 numberReady: 29 observedGeneration: 6 updatedNumberScheduled: 29 And 28/29 = 96.55...% (which is not quite 27/28 = 96.43....%). So I'm pretty sure this alert was "one of the nodes is down for longer than I'd have liked, and its dns-default pod cannot be scheduled". I think the solution it tuning the alert to allow for expected node-reboot times. And possibly increasing maxUnavailable, because the machine-config logic reboots control plane and compute node sets in parallel, right? So similar to [2]. [1]: http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a37e5bef81b80c86ab3864c3395d69c0867ab0aa58e1150eceb85be8951def49/namespaces/openshift-dns/apps/daemonsets.yaml [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1824986#c2
Hey Dan, yes, the upgrade was successful and no such alerts were fired after the upgrade process. This bug, as the others linked to https://bugzilla.redhat.com/show_bug.cgi?id=1824981, is about that no critical alerts should be fired during the upgrade unless ci-admins need to act on it. I am not sure which team owns the alert. Apologize if it is not a DNS component.
> This bug, as the others linked to https://bugzilla.redhat.com/show_bug.cgi?id=1824981, is about that no critical alerts should be fired during the upgrade unless ci-admins need to act on it. Better link for generic "no critical alerts should fire during healthy updates" is bug 1828427. That alert fires after 15m [1], and yeah it looks like the monitoring operator is the current owner. Moving this over to them, so they can talk to the MCO team and decide if 15m is appropriate given our measured node-reboot timing. Looks like it's been 15m since at least [2], so the reasoning behind the existing 15m choice may be lost to time. [1]: https://github.com/openshift/cluster-monitoring-operator/blob/b9d67a775b31dbe008885cfb19889266007c0ef0/assets/prometheus-k8s/rules.yaml#L1185-L1195 [2]: https://github.com/openshift/cluster-monitoring-operator/commit/140059365bcf4ef592dba2855b1482a1034fd185#diff-1dca87d186c04a487d72e52ab0b4dde5R198-R202
Reopening. Silencing alerts during updates is ok, but is orthogonal to actually fixing the alerting condition or twitchy alert. Also, this is a critical alert, and silencing critical alerts is less likely than silencing warning or info alerts. I still think the issue here is that the alert logic should be tuned so that it doesn't fire when we are slow to push DaemonSet pods onto nodes during rolling reboots of large machine pools.
Opened a PR upstream to introduce some improvements. https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/476
Upstream PR merged, will now work on getting this into openshift
PR https://github.com/openshift/cluster-monitoring-operator/pull/898
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196