1824996 – [4.3 upgrade][alert] KubeNodeUnreachable: ip-10-0-159-123.ec2.internal is unreachable and some workloads may be rescheduled.

Bug 1824996 - [4.3 upgrade][alert] KubeNodeUnreachable: ip-10-0-159-123.ec2.internal is unreachable and some workloads may be rescheduled.

Summary: [4.3 upgrade][alert] KubeNodeUnreachable: ip-10-0-159-123.ec2.internal is unr...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Joel Smith
QA Contact:	MinLi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-16 19:45 UTC by Hongkai Liu
Modified:	2020-10-27 23:40 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 23:40:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Hongkai Liu 2020-04-16 19:45:40 UTC

During upgrade of a cluster in CI build farm, we have seen a sequence of alerts and messages of failures from clusterversion.

oc --context build01 adm upgrade --allow-explicit-upgrade --to-image registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2020-04-13-190424 --force=true

Eventually upgrade was completed successfully (which is so nice).
But those alerts and messages are too frightening.

I would like to create a bug for each of those and feel better for the next upgrade.

https://coreos.slack.com/archives/CHY2E1BL4/p1587060450460700


[FIRING:1] KubeNodeUnreachable kube-state-metrics (NoSchedule https-main 10.128.237.134:8443 node.kubernetes.io/unreachable openshift-monitoring ip-10-0-159-123.ec2.internal kube-state-metrics-66dfc9f94f-qdp5d openshift-monitoring/k8s kube-state-metrics warning)
ip-10-0-159-123.ec2.internal is unreachable and some workloads may be rescheduled.

must-gather after upgrade:
http://file.rdu.redhat.com/~hongkliu/test_result/upgrade/upgrade0416/aaa/

Comment 1 Hongkai Liu 2020-04-16 19:46:25 UTC

Not sure which component this should be attached to.

ip-10-0-159-123.ec2.internal is a master node.

Comment 2 Stephen Cuppett 2020-04-17 12:32:33 UTC

Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 3 Ryan Phillips 2020-05-14 17:54:09 UTC

We have a few fixes in place for the CI cluster. Closing.

Comment 5 W. Trevor King 2020-06-15 20:46:01 UTC

Reopening.  CI-side fixes don't help customers who are seeing this in the wild, and Telemetry shows many born-in-4.4 clusters alerting KubeNodeUnreachable today.

Comment 8 W. Trevor King 2020-06-17 01:54:22 UTC

Internal discussion suggests that IOPS-saturation might be the root cause, although I have a hard time imagining us saturating IOPS for multiple kubelet heartbeats (but maybe I'm just misunderstanding).  Can we have a separate alert that scrapes out IOPS throughput to detect when we're pegged vs. any other causes that might be behind KubeNodeUnreachable?  Or if IOPS are often the culprit, maybe at least mention that or link out to docs discussing the alert, or some such, because the current alert text does not give much guidance about possible causes or resolutions.  What action should a cluster-admin take when they see this alert?

Comment 12 Naveen Malik 2020-07-29 18:52:19 UTC

This alert fires on upgrade as nodes are upgraded.  It doesn't have a `for` property so fires immediately.  Note similar KubeNodeNotReady alert is `for: 15m`.  In observing latest 4.4.11 upgrades for OSD I see if we set similar `for: 15m` it would no longer alert in upgrade but would alert eventually which is good enough to me.  Could we consider updating the alert to not fire immediately?

Comment 13 Joel Smith 2020-07-30 21:21:00 UTC

 Following Naveen's idea, I have posted https://github.com/openshift/cluster-monitoring-operator/pull/893 which seems reasonable to me.

Comment 14 Joel Smith 2020-08-24 11:28:56 UTC

It looks like my PR targets the wrong repo. The check is also found upstream.  Opened https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/491

Note You need to log in before you can comment on or make changes to this bug.