Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1936724

Summary:	Workers remain in SchedulingDisabled state after being upgraded
Product:	OpenShift Container Platform	Reporter:	Matt Bargenquast <mbargenq>
Component:	Machine Config Operator	Assignee:	Yu Qi Zhang <jerzhang>
Machine Config Operator sub component:	Machine Config Operator	QA Contact:	Rio Liu <rioliu>
Status:	CLOSED INSUFFICIENT_DATA	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aos-bugs, apjagtap, mkrejci, rioliu, travi, wking, yanyang
Version:	4.7	Keywords:	ServiceDeliveryImpact
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-11-01 15:54:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matt Bargenquast 2021-03-09 01:53:57 UTC

Description of problem:

A cluster upgraded from 4.6.18 to 4.7.1.

During the upgrade of worker nodes, it was observed that a worker upgraded, rebooted, and then came back in a Ready state but remained cordoned/SchedulingDisabled. It remained in this state for several hours.

This halted the upgrades of other workers in the pool. It is our understanding that no manual cordoning of the node had occurred which would cause this situation. 

The same exact situation then happened on a second node in the same cluster during the upgrade process.

Uncordoning the node immediately allowed the worker pool upgrade to progress onto a new node.

Version-Release number of selected component (if applicable):

4.7.1

How reproducible:

It occurred on two separate nodes during this cluster upgrade.

Expected results:

Worker node should be uncordoned after upgrading.

Additional info:

The associated must-gather (see first comment) was taken at the time when this was affecting the first node. (ip-10-0-135-168.ec2.internal)

The associated machine-config-daemon pod (machine-config-daemon-cbh6c) indicates that the node had upgraded.

Comment 3 Yang Yang 2021-09-06 06:19:56 UTC

Experiencing the similar issue when upgrading 4.8.10 to 4.9.0-0.nightly-2021-09-04-210501. All the cos are rolled out to 4.9. But one worker remains Ready/SchedulingDisabled.


09-05 12:05:27.339  NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
09-05 12:05:27.339  master   rendered-master-c9eba5d9e4b66c30f7696753a359098f   True      False      False      3              3                   3                     0                      3h18m
09-05 12:05:27.339  worker   rendered-worker-80ee580b99c8aafe0dd64537e4cc8fc6   False     True       False      3              0                   0                     0                      3h18m

09-05 12:05:27.340  Post action: #oc get node: NAME                                         STATUS                     ROLES    AGE     VERSION                INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
09-05 12:05:27.340  ip-10-0-140-222.us-east-2.compute.internal   Ready                      master   3h18m   v1.22.0-rc.0+f8f58dc   10.0.140.222   <none>        Red Hat Enterprise Linux CoreOS 49.84.202109040851-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.22.0-68.rhaos4.9.git011c10a.el8
09-05 12:05:27.340  ip-10-0-143-143.us-east-2.compute.internal   Ready                      worker   3h11m   v1.21.1+9807387        10.0.143.143   <none>        Red Hat Enterprise Linux CoreOS 48.84.202108301459-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-15.rhaos4.8.gitcdc4f56.el8
09-05 12:05:27.340  ip-10-0-175-63.us-east-2.compute.internal    Ready                      worker   3h11m   v1.21.1+9807387        10.0.175.63    <none>        Red Hat Enterprise Linux CoreOS 48.84.202108301459-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-15.rhaos4.8.gitcdc4f56.el8
09-05 12:05:27.340  ip-10-0-185-153.us-east-2.compute.internal   Ready                      master   3h14m   v1.22.0-rc.0+f8f58dc   10.0.185.153   <none>        Red Hat Enterprise Linux CoreOS 49.84.202109040851-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.22.0-68.rhaos4.9.git011c10a.el8
09-05 12:05:27.340  ip-10-0-201-87.us-east-2.compute.internal    Ready,SchedulingDisabled   worker   3h11m   v1.21.1+9807387        10.0.201.87    <none>        Red Hat Enterprise Linux CoreOS 48.84.202108301459-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-15.rhaos4.8.gitcdc4f56.el8
09-05 12:05:27.340  ip-10-0-223-164.us-east-2.compute.internal   Ready                      master   3h19m   v1.22.0-rc.0+f8f58dc   10.0.223.164   <none>        Red Hat Enterprise Linux CoreOS 49.84.202109040851-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.22.0-68.rhaos4.9.git011c10a.el8

Comment 4 Yang Yang 2021-09-06 06:21:07 UTC

Experiencing the similar issue when upgrading 4.8.10 to 4.9.0-0.nightly-2021-09-04-210501. All the cos are rolled out to 4.9. But one worker remains Ready/SchedulingDisabled.


09-05 12:05:27.339  NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
09-05 12:05:27.339  master   rendered-master-c9eba5d9e4b66c30f7696753a359098f   True      False      False      3              3                   3                     0                      3h18m
09-05 12:05:27.339  worker   rendered-worker-80ee580b99c8aafe0dd64537e4cc8fc6   False     True       False      3              0                   0                     0                      3h18m

09-05 12:05:27.340  Post action: #oc get node: NAME                                         STATUS                     ROLES    AGE     VERSION                INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
09-05 12:05:27.340  ip-10-0-140-222.us-east-2.compute.internal   Ready                      master   3h18m   v1.22.0-rc.0+f8f58dc   10.0.140.222   <none>        Red Hat Enterprise Linux CoreOS 49.84.202109040851-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.22.0-68.rhaos4.9.git011c10a.el8
09-05 12:05:27.340  ip-10-0-143-143.us-east-2.compute.internal   Ready                      worker   3h11m   v1.21.1+9807387        10.0.143.143   <none>        Red Hat Enterprise Linux CoreOS 48.84.202108301459-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-15.rhaos4.8.gitcdc4f56.el8
09-05 12:05:27.340  ip-10-0-175-63.us-east-2.compute.internal    Ready                      worker   3h11m   v1.21.1+9807387        10.0.175.63    <none>        Red Hat Enterprise Linux CoreOS 48.84.202108301459-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-15.rhaos4.8.gitcdc4f56.el8
09-05 12:05:27.340  ip-10-0-185-153.us-east-2.compute.internal   Ready                      master   3h14m   v1.22.0-rc.0+f8f58dc   10.0.185.153   <none>        Red Hat Enterprise Linux CoreOS 49.84.202109040851-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.22.0-68.rhaos4.9.git011c10a.el8
09-05 12:05:27.340  ip-10-0-201-87.us-east-2.compute.internal    Ready,SchedulingDisabled   worker   3h11m   v1.21.1+9807387        10.0.201.87    <none>        Red Hat Enterprise Linux CoreOS 48.84.202108301459-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.21.2-15.rhaos4.8.gitcdc4f56.el8
09-05 12:05:27.340  ip-10-0-223-164.us-east-2.compute.internal   Ready                      master   3h19m   v1.22.0-rc.0+f8f58dc   10.0.223.164   <none>        Red Hat Enterprise Linux CoreOS 49.84.202109040851-0 (Ootpa)   4.18.0-305.12.1.el8_4.x86_64   cri-o://1.22.0-68.rhaos4.9.git011c10a.el8