Bug 1824996
Summary: | [4.3 upgrade][alert] KubeNodeUnreachable: ip-10-0-159-123.ec2.internal is unreachable and some workloads may be rescheduled. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Hongkai Liu <hongkliu> |
Component: | Node | Assignee: | Joel Smith <joelsmith> |
Status: | CLOSED WONTFIX | QA Contact: | MinLi <minmli> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.3.0 | CC: | aos-bugs, ccoleman, eparis, jokerman, nagrawal, nmalik, scuppett, sjenning, tsweeney, wking |
Target Milestone: | --- | Keywords: | Reopened, ServiceDeliveryImpact, Upgrades |
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 23:40:16 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Hongkai Liu
2020-04-16 19:45:40 UTC
Not sure which component this should be attached to. ip-10-0-159-123.ec2.internal is a master node. Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate. We have a few fixes in place for the CI cluster. Closing. Reopening. CI-side fixes don't help customers who are seeing this in the wild, and Telemetry shows many born-in-4.4 clusters alerting KubeNodeUnreachable today. Internal discussion suggests that IOPS-saturation might be the root cause, although I have a hard time imagining us saturating IOPS for multiple kubelet heartbeats (but maybe I'm just misunderstanding). Can we have a separate alert that scrapes out IOPS throughput to detect when we're pegged vs. any other causes that might be behind KubeNodeUnreachable? Or if IOPS are often the culprit, maybe at least mention that or link out to docs discussing the alert, or some such, because the current alert text does not give much guidance about possible causes or resolutions. What action should a cluster-admin take when they see this alert? This alert fires on upgrade as nodes are upgraded. It doesn't have a `for` property so fires immediately. Note similar KubeNodeNotReady alert is `for: 15m`. In observing latest 4.4.11 upgrades for OSD I see if we set similar `for: 15m` it would no longer alert in upgrade but would alert eventually which is good enough to me. Could we consider updating the alert to not fire immediately? Following Naveen's idea, I have posted https://github.com/openshift/cluster-monitoring-operator/pull/893 which seems reasonable to me. It looks like my PR targets the wrong repo. The check is also found upstream. Opened https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/491 |