1945739 – Endurance cluster has notready and schedulingdisabled nodes after upgrade

Bug 1945739 - Endurance cluster has notready and schedulingdisabled nodes after upgrade

Summary: Endurance cluster has notready and schedulingdisabled nodes after upgrade

Keywords:
Status:	CLOSED DUPLICATE of bug 1929463
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Elana Hashman
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1946306
TreeView+	depends on / blocked

Reported:	2021-04-01 19:53 UTC by Ben Parees
Modified:	2021-04-27 21:27 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1946306 (view as bug list)
Environment:
Last Closed:	2021-04-27 21:27:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ben Parees 2021-04-01 19:53:42 UTC

Description of problem:
Cluster was upgraded from
4.6.0-0.nightly-2021-03-21-131139
to
4.6.0-0.nightly-2021-03-27-052141

Cluster now has 2 unready nodes and one w/ scheduling disabled:

$ oc get nodes
NAME                                         STATUS                     ROLES    AGE   VERSION
ip-10-0-136-59.us-east-2.compute.internal    NotReady                   worker   20d   v1.19.0+263ee0d
ip-10-0-147-192.us-east-2.compute.internal   Ready                      master   20d   v1.19.0+a5a0987
ip-10-0-178-43.us-east-2.compute.internal    NotReady                   worker   20d   v1.19.0+263ee0d
ip-10-0-191-180.us-east-2.compute.internal   Ready                      master   20d   v1.19.0+a5a0987
ip-10-0-214-241.us-east-2.compute.internal   Ready                      master   20d   v1.19.0+a5a0987
ip-10-0-246-183.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   20d   v1.19.0+263ee0d



Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2021-03-27-052141

How reproducible:
unknown


Additional info:


kubelet appears to have died on 2 nodes:

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Tue, 30 Mar 2021 02:14:51 -0400   Tue, 30 Mar 2021 02:13:27 -0400   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Tue, 30 Mar 2021 02:14:51 -0400   Tue, 30 Mar 2021 02:13:27 -0400   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Tue, 30 Mar 2021 02:14:51 -0400   Tue, 30 Mar 2021 02:13:27 -0400   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Tue, 30 Mar 2021 02:14:51 -0400   Tue, 30 Mar 2021 02:15:32 -0400   NodeStatusUnknown   Kubelet stopped posting node status.



Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Tue, 30 Mar 2021 13:05:29 -0400   Tue, 30 Mar 2021 13:06:20 -0400   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Tue, 30 Mar 2021 13:05:29 -0400   Tue, 30 Mar 2021 13:06:20 -0400   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Tue, 30 Mar 2021 13:05:29 -0400   Tue, 30 Mar 2021 13:06:20 -0400   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Tue, 30 Mar 2021 13:05:29 -0400   Tue, 30 Mar 2021 13:06:20 -0400   NodeStatusUnknown   Kubelet stopped posting node status.



and one node is failing to drain:
                    machineconfiguration.openshift.io/reason:
                      failed to drain node (5 tries): timed out waiting for the condition: [error when evicting pod "pod-submit-status-2-10": pods "pod-submit-s...

Note You need to log in before you can comment on or make changes to this bug.