Bug 1824996

Summary:	[4.3 upgrade][alert] KubeNodeUnreachable: ip-10-0-159-123.ec2.internal is unreachable and some workloads may be rescheduled.
Product:	OpenShift Container Platform	Reporter:	Hongkai Liu <hongkliu>
Component:	Node	Assignee:	Joel Smith <joelsmith>
Status:	CLOSED WONTFIX	QA Contact:	MinLi <minmli>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.3.0	CC:	aos-bugs, ccoleman, eparis, jokerman, nagrawal, nmalik, scuppett, sjenning, tsweeney, wking
Target Milestone:	---	Keywords:	Reopened, ServiceDeliveryImpact, Upgrades
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 23:40:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Hongkai Liu 2020-04-16 19:45:40 UTC

During upgrade of a cluster in CI build farm, we have seen a sequence of alerts and messages of failures from clusterversion.

oc --context build01 adm upgrade --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-04-13-190424 --force=true

Eventually upgrade was completed successfully (which is so nice).
But those alerts and messages are too frightening.

I would like to create a bug for each of those and feel better for the next upgrade.

https://coreos.slack.com/archives/CHY2E1BL4/p1587060450460700


[FIRING:1] KubeNodeUnreachable kube-state-metrics (NoSchedule https-main 10.128.237.134:8443 node.kubernetes.io/unreachable openshift-monitoring ip-10-0-159-123.ec2.internal kube-state-metrics-66dfc9f94f-qdp5d openshift-monitoring/k8s kube-state-metrics warning)
ip-10-0-159-123.ec2.internal is unreachable and some workloads may be rescheduled.

must-gather after upgrade:
http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/

Comment 1 Hongkai Liu 2020-04-16 19:46:25 UTC

Not sure which component this should be attached to.

ip-10-0-159-123.ec2.internal is a master node.

Comment 2 Stephen Cuppett 2020-04-17 12:32:33 UTC

Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 3 Ryan Phillips 2020-05-14 17:54:09 UTC

We have a few fixes in place for the CI cluster. Closing.

Comment 5 W. Trevor King 2020-06-15 20:46:01 UTC

Reopening.  CI-side fixes don't help customers who are seeing this in the wild, and Telemetry shows many born-in-4.4 clusters alerting KubeNodeUnreachable today.

Comment 8 W. Trevor King 2020-06-17 01:54:22 UTC

Internal discussion suggests that IOPS-saturation might be the root cause, although I have a hard time imagining us saturating IOPS for multiple kubelet heartbeats (but maybe I'm just misunderstanding).  Can we have a separate alert that scrapes out IOPS throughput to detect when we're pegged vs. any other causes that might be behind KubeNodeUnreachable?  Or if IOPS are often the culprit, maybe at least mention that or link out to docs discussing the alert, or some such, because the current alert text does not give much guidance about possible causes or resolutions.  What action should a cluster-admin take when they see this alert?

Comment 12 Naveen Malik 2020-07-29 18:52:19 UTC

This alert fires on upgrade as nodes are upgraded.  It doesn't have a `for` property so fires immediately.  Note similar KubeNodeNotReady alert is `for: 15m`.  In observing latest 4.4.11 upgrades for OSD I see if we set similar `for: 15m` it would no longer alert in upgrade but would alert eventually which is good enough to me.  Could we consider updating the alert to not fire immediately?

Comment 13 Joel Smith 2020-07-30 21:21:00 UTC

 Following Naveen's idea, I have posted https://github.com/openshift/cluster-monitoring-operator/pull/893 which seems reasonable to me.

Comment 14 Joel Smith 2020-08-24 11:28:56 UTC

It looks like my PR targets the wrong repo. The check is also found upstream.  Opened https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/491