1890684 – [4.5 upgrade]Node got stuck with NotReady

Bug 1890684 - [4.5 upgrade]Node got stuck with NotReady

Summary: [4.5 upgrade]Node got stuck with NotReady

Keywords:
Status:	CLOSED DUPLICATE of bug 1857446
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-22 17:46 UTC by Hongkai Liu
Modified:	2020-11-10 15:21 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-10 15:21:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Hongkai Liu 2020-10-22 17:46:17 UTC

Description of problem:
This happened during update from 4.5.14 to 4.6.0-rc.4.

Nodes that were already part of the cluster were drained and rebooted to deploy the new machine configuration during the cluster upgrade. As a result, user workloads were rescheduled a number of times to other nodes in the cluster. Some of these workloads did not set CPU or RAM requests, and when they were all rescheduled to the same target node at once their thundering herd caused system OOMs. The kubelet on the node would become unresponsive after that time and remain in that state.

Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Mon, 01 Jan 0001 00:00:00 +0000 Tue, 20 Oct 2020 11:16:12 -0700 RouteCreated openshift-sdn cleared kubelet-set NoRouteCreated
MemoryPressure Unknown Wed, 21 Oct 2020 07:22:05 -0700 Wed, 21 Oct 2020 07:22:52 -0700 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Wed, 21 Oct 2020 07:22:05 -0700 Wed, 21 Oct 2020 07:22:52 -0700 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Wed, 21 Oct 2020 07:22:05 -0700 Wed, 21 Oct 2020 07:22:52 -0700 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Wed, 21 Oct 2020 07:22:05 -0700 Wed, 21 Oct 2020 07:22:52 -0700 NodeStatusUnknown Kubelet stopped posting node status.

This is the last node that was stuck with NotReady.
build0-gstfj-w-b-grtqb.c.openshift-ci-build-farm.internal NotReady worker 26d v1.19.0+d59ce34

Comment 2 Ryan Phillips 2020-10-22 19:22:35 UTC

This was diagnosed to an MCO issue.

Comment 3 Steve Kuznetsov 2020-10-26 14:23:26 UTC

No, this was never related to the MCO issue. This is an issue with the kubelet and/or the system responding to OOM events.

Comment 4 Hongkai Liu 2020-10-27 13:48:35 UTC

Applied the workaround: https://github.com/openshift/release/pull/13140
I will update here if it still happens.
So far, it is good since last night.

Note You need to log in before you can comment on or make changes to this bug.