Bug 1792139

Summary: RHEL7 worker nodes may go to NotReady,SchedulingDisabled while upgrading from 4.2.12 to 4.3.0
Product: OpenShift Container Platform Reporter: Gaoyun Pei <gpei>
Component: InstallerAssignee: Russell Teague <rteague>
Installer sub component: openshift-ansible QA Contact: Gaoyun Pei <gpei>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: amurdaca, aos-bugs, gpei, jialiu, jokerman, juzhao, kalexand, mifiedle, rphillips, scuppett, sdodson, sjenning, vigoyal, weinliu, wsun, xtian
Version: 4.3.0Keywords: Regression, TestBlocker
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Machine config was not properly updated by MCO because package installs updated files on disk Consequence: MCO would not process config updates on RHEL nodes Fix: Added machine config apply back to upgrade steps and added proxy config for image pulls Result: Machine configs were properly applied after package updates during upgrade.
Story Points: ---
Clone Of: 1786274 Environment:
Last Closed: 2020-05-04 11:24:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1786274    
Bug Blocks:    

Comment 2 Scott Dodson 2020-01-30 20:02:43 UTC
https://github.com/openshift/openshift-ansible/pull/12069 fixed application of machine config in such a way that the workflow described in docs should now be successful

Comment 4 Gaoyun Pei 2020-02-07 12:25:17 UTC
During the upgrade of a 4.3.1 cluster(with rhcos&rhel worker) to 4.4, no RHEL worker would be stuck at NotReady,SchedulingDisabled, which is different from when doing so in "4.2 to 4.3".

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-02-07-012035   True        False         4m27s   Cluster version is 4.4.0-0.nightly-2020-02-07-012035

# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.0-0.nightly-2020-02-07-012035   True        False         False      3h55m
cloud-credential                           4.4.0-0.nightly-2020-02-07-012035   True        False         False      4h23m
cluster-autoscaler                         4.4.0-0.nightly-2020-02-07-012035   True        False         False      4h12m
console                                    4.4.0-0.nightly-2020-02-07-012035   True        False         False      33m
csi-snapshot-controller                    4.4.0-0.nightly-2020-02-07-012035   True        False         False      3m56s
dns                                        4.4.0-0.nightly-2020-02-07-012035   True        False         False      4h16m
etcd                                       4.4.0-0.nightly-2020-02-07-012035   True        False         False      19m
image-registry                             4.4.0-0.nightly-2020-02-07-012035   True        False         False      7m40s
ingress                                    4.4.0-0.nightly-2020-02-07-012035   True        False         False      3m44s
insights                                   4.4.0-0.nightly-2020-02-07-012035   True        False         False      4h18m
kube-apiserver                             4.4.0-0.nightly-2020-02-07-012035   True        False         False      4h15m
kube-controller-manager                    4.4.0-0.nightly-2020-02-07-012035   True        False         False      4h16m
kube-scheduler                             4.4.0-0.nightly-2020-02-07-012035   True        False         False      4h16m
kube-storage-version-migrator              4.4.0-0.nightly-2020-02-07-012035   True        False         False      7m46s
machine-api                                4.4.0-0.nightly-2020-02-07-012035   True        False         False      4h17m
machine-config                             4.4.0-0.nightly-2020-02-07-012035   True        False         False      21m
marketplace                                4.4.0-0.nightly-2020-02-07-012035   True        False         False      23m
monitoring                                 4.4.0-0.nightly-2020-02-07-012035   True        False         False      8m11s
network                                    4.4.0-0.nightly-2020-02-07-012035   True        False         False      4h18m
node-tuning                                4.4.0-0.nightly-2020-02-07-012035   True        False         False      3h27m
openshift-apiserver                        4.4.0-0.nightly-2020-02-07-012035   True        False         False      23m
openshift-controller-manager               4.4.0-0.nightly-2020-02-07-012035   True        False         False      4h16m
openshift-samples                          4.4.0-0.nightly-2020-02-07-012035   True        False         False      3h27m
operator-lifecycle-manager                 4.4.0-0.nightly-2020-02-07-012035   True        False         False      4h17m
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-02-07-012035   True        False         False      4h17m
operator-lifecycle-manager-packageserver   4.4.0-0.nightly-2020-02-07-012035   True        False         False      33m
service-ca                                 4.4.0-0.nightly-2020-02-07-012035   True        False         False      4h18m
service-catalog-apiserver                  4.4.0-0.nightly-2020-02-07-012035   True        False         False      4h15m
service-catalog-controller-manager         4.4.0-0.nightly-2020-02-07-012035   True        False         False      4h15m
storage                                    4.4.0-0.nightly-2020-02-07-012035   True        False         False      3h27m

# oc get clusterversion -o json|jq -r '.items[0].status.history[]|.startedTime + "|" + .completionTime + "|" + .state + "|" + .version'
2020-02-07T08:36:05Z|2020-02-07T12:09:47Z|Completed|4.4.0-0.nightly-2020-02-07-012035
2020-02-07T07:50:36Z|2020-02-07T08:18:35Z|Completed|4.3.1


After cluster upgrade finished, run upgrade playbook against all the RHEL workers. 
The whole cluster was upgrade to an expected status in the end.

# oc get node -o wide
NAME                                        STATUS   ROLES    AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
ip-10-0-49-70.us-east-2.compute.internal    Ready    worker   3h44m   v1.17.1   10.0.49.70    <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)                    3.10.0-1062.12.1.el7.x86_64   cri-o://1.17.0-0.4.rc1.rhaos4.4.git5842752.el7-rc1
ip-10-0-51-166.us-east-2.compute.internal   Ready    worker   4h11m   v1.17.1   10.0.51.166   <none>        Red Hat Enterprise Linux CoreOS 44.81.202002061902-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.16.2-15.dev.rhaos4.3.gita83f883.el8
ip-10-0-57-164.us-east-2.compute.internal   Ready    master   4h23m   v1.17.1   10.0.57.164   <none>        Red Hat Enterprise Linux CoreOS 44.81.202002061902-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.16.2-15.dev.rhaos4.3.gita83f883.el8
ip-10-0-57-166.us-east-2.compute.internal   Ready    master   4h23m   v1.17.1   10.0.57.166   <none>        Red Hat Enterprise Linux CoreOS 44.81.202002061902-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.16.2-15.dev.rhaos4.3.gita83f883.el8
ip-10-0-59-5.us-east-2.compute.internal     Ready    worker   3h44m   v1.17.1   10.0.59.5     <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)                    3.10.0-1062.12.1.el7.x86_64   cri-o://1.17.0-0.4.rc1.rhaos4.4.git5842752.el7-rc1
ip-10-0-67-153.us-east-2.compute.internal   Ready    master   4h23m   v1.17.1   10.0.67.153   <none>        Red Hat Enterprise Linux CoreOS 44.81.202002061902-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.16.2-15.dev.rhaos4.3.gita83f883.el8
ip-10-0-69-163.us-east-2.compute.internal   Ready    worker   4h10m   v1.17.1   10.0.69.163   <none>        Red Hat Enterprise Linux CoreOS 44.81.202002061902-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.16.2-15.dev.rhaos4.3.gita83f883.el8

Mark this bug as verified in openshift-ansible-4.4.0-202002070656.git.178.3e1c275.el7.noarch.rpm.

Comment 6 errata-xmlrpc 2020-05-04 11:24:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581