Bug 1824988
| Summary: | [4.3 upgrade][alert] KubeDaemonSetRolloutStuck: Only 96.43% of the desired Pods of DaemonSet openshift-dns/dns-default are scheduled and ready. | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Hongkai Liu <hongkliu> |
| Component: | Monitoring | Assignee: | Sergiusz Urbaniak <surbania> |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.3.0 | CC: | alegrand, anpicker, aos-bugs, ccoleman, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania, wking |
| Target Milestone: | --- | Keywords: | Reopened, Upgrades |
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Some alerts did not have the right severity set or were incorrect, causing upgrade issues. The fixes include:
- Adjusts severity levels of many alerts from critical to warning as they were cause based alerts
- Adjusts KubeStatefulSetUpdateNotRolledOut, KubeDaemonSetRolloutStuck
- Removes KubeAPILatencyHigh and KubeAPIErrorsHigh
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-27 15:57:47 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Hongkai Liu
2020-04-16 19:32:48 UTC
Sorry, I'm not understanding what specific bug is being reported here. After the upgrade, DNS reports healthy: http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a37e5bef81b80c86ab3864c3395d69c0867ab0aa58e1150eceb85be8951def49/cluster-scoped-resources/operator.openshift.io/dnses/default.yaml The kube-state-metrics error looks related to monitoring (in the openshift-monitoring namespace) but it's not clear whether the alert was still firing after the upgrade or how to check that. What specifically do you think we should be doing with DNS here? It looks like the upgrade succeeded and everything's okay. What's the DNS bug, and what justifies the "high" severity? [1] has:
updateStrategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
status:
currentNumberScheduled: 29
desiredNumberScheduled: 29
numberAvailable: 29
numberMisscheduled: 0
numberReady: 29
observedGeneration: 6
updatedNumberScheduled: 29
And 28/29 = 96.55...% (which is not quite 27/28 = 96.43....%). So I'm pretty sure this alert was "one of the nodes is down for longer than I'd have liked, and its dns-default pod cannot be scheduled". I think the solution it tuning the alert to allow for expected node-reboot times. And possibly increasing maxUnavailable, because the machine-config logic reboots control plane and compute node sets in parallel, right? So similar to [2].
[1]: http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a37e5bef81b80c86ab3864c3395d69c0867ab0aa58e1150eceb85be8951def49/namespaces/openshift-dns/apps/daemonsets.yaml
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1824986#c2
Hey Dan, yes, the upgrade was successful and no such alerts were fired after the upgrade process. This bug, as the others linked to https://bugzilla.redhat.com/show_bug.cgi?id=1824981, is about that no critical alerts should be fired during the upgrade unless ci-admins need to act on it. I am not sure which team owns the alert. Apologize if it is not a DNS component. > This bug, as the others linked to https://bugzilla.redhat.com/show_bug.cgi?id=1824981, is about that no critical alerts should be fired during the upgrade unless ci-admins need to act on it. Better link for generic "no critical alerts should fire during healthy updates" is bug 1828427. That alert fires after 15m [1], and yeah it looks like the monitoring operator is the current owner. Moving this over to them, so they can talk to the MCO team and decide if 15m is appropriate given our measured node-reboot timing. Looks like it's been 15m since at least [2], so the reasoning behind the existing 15m choice may be lost to time. [1]: https://github.com/openshift/cluster-monitoring-operator/blob/b9d67a775b31dbe008885cfb19889266007c0ef0/assets/prometheus-k8s/rules.yaml#L1185-L1195 [2]: https://github.com/openshift/cluster-monitoring-operator/commit/140059365bcf4ef592dba2855b1482a1034fd185#diff-1dca87d186c04a487d72e52ab0b4dde5R198-R202 Reopening. Silencing alerts during updates is ok, but is orthogonal to actually fixing the alerting condition or twitchy alert. Also, this is a critical alert, and silencing critical alerts is less likely than silencing warning or info alerts. I still think the issue here is that the alert logic should be tuned so that it doesn't fire when we are slow to push DaemonSet pods onto nodes during rolling reboots of large machine pools. Opened a PR upstream to introduce some improvements. https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/476 Upstream PR merged, will now work on getting this into openshift Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |