Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1890684

Summary: [4.5 upgrade]Node got stuck with NotReady
Product: OpenShift Container Platform Reporter: Hongkai Liu <hongkliu>
Component: NodeAssignee: Ryan Phillips <rphillips>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: aaleman, aos-bugs, jokerman, skuznets
Version: 4.5Keywords: Reopened, Upgrades
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-10 15:21:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hongkai Liu 2020-10-22 17:46:17 UTC
Description of problem:
This happened during update from 4.5.14 to 4.6.0-rc.4.

Nodes that were already part of the cluster were drained and rebooted to deploy the new machine configuration during the cluster upgrade. As a result, user workloads were rescheduled a number of times to other nodes in the cluster. Some of these workloads did not set CPU or RAM requests, and when they were all rescheduled to the same target node at once their thundering herd caused system OOMs. The kubelet on the node would become unresponsive after that time and remain in that state.

Conditions:
 Type                 Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
 ----                 ------    -----------------                 ------------------                ------              -------
 NetworkUnavailable   False     Mon, 01 Jan 0001 00:00:00 +0000   Tue, 20 Oct 2020 11:16:12 -0700   RouteCreated        openshift-sdn cleared kubelet-set NoRouteCreated
 MemoryPressure       Unknown   Wed, 21 Oct 2020 07:22:05 -0700   Wed, 21 Oct 2020 07:22:52 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
 DiskPressure         Unknown   Wed, 21 Oct 2020 07:22:05 -0700   Wed, 21 Oct 2020 07:22:52 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
 PIDPressure          Unknown   Wed, 21 Oct 2020 07:22:05 -0700   Wed, 21 Oct 2020 07:22:52 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
 Ready                Unknown   Wed, 21 Oct 2020 07:22:05 -0700   Wed, 21 Oct 2020 07:22:52 -0700   NodeStatusUnknown   Kubelet stopped posting node status.



This is the last node that was stuck with NotReady.
build0-gstfj-w-b-grtqb.c.openshift-ci-build-farm.internal   NotReady   worker   26d   v1.19.0+d59ce34

Comment 2 Ryan Phillips 2020-10-22 19:22:35 UTC
This was diagnosed to an MCO issue.

Comment 3 Steve Kuznetsov 2020-10-26 14:23:26 UTC
No, this was never related to the MCO issue. This is an issue with the kubelet and/or the system responding to OOM events.

Comment 4 Hongkai Liu 2020-10-27 13:48:35 UTC
Applied the workaround: https://github.com/openshift/release/pull/13140
I will update here if it still happens.
So far, it is good since last night.