During upgrade of a cluster in CI build farm, we have seen a sequence of alerts and messages of failures from clusterversion. oc --context build01 adm upgrade --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-04-13-190424 --force=true Eventually upgrade was completed successfully (which is so nice). But those alerts and messages are too frightening. I would like to create a bug for each of those and feel better for the next upgrade. https://coreos.slack.com/archives/CHY2E1BL4/p1587060450460700 [FIRING:1] KubeNodeUnreachable kube-state-metrics (NoSchedule https-main 10.128.237.134:8443 node.kubernetes.io/unreachable openshift-monitoring ip-10-0-159-123.ec2.internal kube-state-metrics-66dfc9f94f-qdp5d openshift-monitoring/k8s kube-state-metrics warning) ip-10-0-159-123.ec2.internal is unreachable and some workloads may be rescheduled. must-gather after upgrade: http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/
Not sure which component this should be attached to. ip-10-0-159-123.ec2.internal is a master node.
Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.
We have a few fixes in place for the CI cluster. Closing.
Reopening. CI-side fixes don't help customers who are seeing this in the wild, and Telemetry shows many born-in-4.4 clusters alerting KubeNodeUnreachable today.
Internal discussion suggests that IOPS-saturation might be the root cause, although I have a hard time imagining us saturating IOPS for multiple kubelet heartbeats (but maybe I'm just misunderstanding). Can we have a separate alert that scrapes out IOPS throughput to detect when we're pegged vs. any other causes that might be behind KubeNodeUnreachable? Or if IOPS are often the culprit, maybe at least mention that or link out to docs discussing the alert, or some such, because the current alert text does not give much guidance about possible causes or resolutions. What action should a cluster-admin take when they see this alert?
This alert fires on upgrade as nodes are upgraded. It doesn't have a `for` property so fires immediately. Note similar KubeNodeNotReady alert is `for: 15m`. In observing latest 4.4.11 upgrades for OSD I see if we set similar `for: 15m` it would no longer alert in upgrade but would alert eventually which is good enough to me. Could we consider updating the alert to not fire immediately?
Following Naveen's idea, I have posted https://github.com/openshift/cluster-monitoring-operator/pull/893 which seems reasonable to me.
It looks like my PR targets the wrong repo. The check is also found upstream. Opened https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/491