Bug 1793078

Summary: RHEL worker upgrade playbook leads to MCO being out of sync
Product: OpenShift Container Platform Reporter: Russell Teague <rteague>
Component: InstallerAssignee: Russell Teague <rteague>
Installer sub component: openshift-ansible QA Contact: weiwei jiang <wjiang>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: wjiang
Version: 4.4   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Machine config was not properly updated by MCO because package installs updated files on disk Consequence: MCO would not process config updates on RHEL nodes Fix: Added machine config apply back to upgrade steps and added proxy config for image pulls Result: Machine configs were properly applied after package updates during upgrade.
Story Points: ---
Clone Of:
: 1793093 (view as bug list) Environment:
Last Closed: 2020-05-04 11:25:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1793093    

Description Russell Teague 2020-01-20 16:03:15 UTC
Description of problem:
During a cluster upgrade, RHEL node could go Not Ready due to an incompatibility between the kube config and the version of kubelet running on the node.  This requires an upgrade of the kubelet by running the RHEL upgrade playbooks.  The playbooks install new RPMs which could modify the files managed by MCD and put the node in a Degraded state.

Version-Release number of the following components:
4.2 to 4.3

How reproducible:

Steps to Reproduce:
1. Install OCP 4.2
2. Upgrade cluster to 4.3
3. RHEL node is Not Ready
4. Upgrade RHEL nodes
5. MCO machine config rollout is blocked due to on disk files do not match config

Actual results:
RHEL node Not Ready due kube version skew:
hyperkube[2508]: F0117 14:48:12.003999    2508 server.go:206] unrecognized feature gate: LegacyNodeRoleBehavior

After RHEL upgrade, MCD reporting:
content mismatch for file /etc/containers/storage.conf


Expected results:
Upgrade to complete successfully.

Comment 2 Johnny Liu 2020-01-21 03:07:54 UTC
@Russell, will this also fix BZ#1792139 together?

Comment 3 weiwei jiang 2020-01-21 08:38:47 UTC
Checked with 4.2 -> 4.4 path for upgrade with proxy under restricted networking cluster

verified version: openshift-ansible-4.4.0-202001201746.git.178.e31d324.el7.noarch.rpm

# before upgrade
$ oc get clusterversion                                                                                                                                                                                                                                                       
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS                                                                                                                                                                                          
version   4.2.0-0.nightly-2020-01-20-195638   True        False         115m    Cluster version is 4.2.0-0.nightly-2020-01-20-195638     

# after trigger `oc adm upgrade`
$ oc get nodes -o wide && oc get clusterversion  && oc get co
NAME                           STATUS                        ROLES    AGE     VERSION             INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
wj42bz-bgmlk-compute-0         Ready                         worker   4h6m    v1.17.1             10.0.98.208   <none>        Red Hat Enterprise Linux CoreOS 44.81.202001202331.0 (Ootpa)   4.18.0-147.3.1.el8_1.x86_64   cri-o://1.17.0-0.2.rc1.rhaos4.4.gitb89a5fc.el8-rc1
wj42bz-bgmlk-compute-1         Ready                         worker   4h7m    v1.14.6+97c81d00e   10.0.98.35    <none>        Red Hat Enterprise Linux CoreOS 42.81.20200114.0 (Ootpa)       4.18.0-147.3.1.el8_1.x86_64   cri-o://1.14.11-6.dev.rhaos4.2.git627b85c.el8        
wj42bz-bgmlk-control-plane-0   Ready                         master   4h19m   v1.17.1             10.0.98.128   <none>        Red Hat Enterprise Linux CoreOS 44.81.202001202331.0 (Ootpa)   4.18.0-147.3.1.el8_1.x86_64   cri-o://1.17.0-0.2.rc1.rhaos4.4.gitb89a5fc.el8-rc1
wj42bz-bgmlk-control-plane-1   Ready                         master   4h19m   v1.17.1             10.0.96.127   <none>        Red Hat Enterprise Linux CoreOS 44.81.202001202331.0 (Ootpa)   4.18.0-147.3.1.el8_1.x86_64   cri-o://1.17.0-0.2.rc1.rhaos4.4.gitb89a5fc.el8-rc1
wj42bz-bgmlk-control-plane-2   Ready                         master   4h19m   v1.17.1             10.0.96.156   <none>        Red Hat Enterprise Linux CoreOS 44.81.202001202331.0 (Ootpa)   4.18.0-147.3.1.el8_1.x86_64   cri-o://1.17.0-0.2.rc1.rhaos4.4.gitb89a5fc.el8-rc1   
wj42bz-bgmlk-rhel-0            NotReady,SchedulingDisabled   worker   114m    v1.14.6+c383847f6   10.0.96.188   <none>        Red Hat Enterprise Linux Server 7.7 (Maipo)                    3.10.0-1062.9.1.el7.x86_64    cri-o://1.14.11-9.dev.rhaos4.2.git983e00f.el7        
wj42bz-bgmlk-rhel-1            Ready                         worker   114m    v1.14.6+c383847f6   10.0.96.72    <none>        Red Hat Enterprise Linux Server 7.7 (Maipo)                    3.10.0-1062.9.1.el7.x86_64    cri-o://1.14.11-9.dev.rhaos4.2.git983e00f.el7        
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS                                                  
version   4.2.0-0.nightly-2020-01-20-195638   True        True          86m     Unable to apply 4.4.0-0.nightly-2020-01-21-012409: the cluster operator monitoring is degraded                                                                                                  
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE               
authentication                             4.4.0-0.nightly-2020-01-21-012409   True        False         False      3h57m                                                                                                                                                       
cloud-credential                           4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h17m
cluster-autoscaler                         4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h7m
console                                    4.4.0-0.nightly-2020-01-21-012409   True        False         False      44m
dns                                        4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h13m
image-registry                             4.4.0-0.nightly-2020-01-21-012409   True        False         False      3h59m
ingress                                    4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h4m
insights                                   4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h14m
kube-apiserver                             4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h11m
kube-controller-manager                    4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h9m
kube-scheduler                             4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h11m
kube-storage-version-migrator              4.4.0-0.nightly-2020-01-21-012409   True        False         False      76m
machine-api                                4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h17m
machine-config                             4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h12m
marketplace                                4.4.0-0.nightly-2020-01-21-012409   True        False         False      43m
monitoring                                 4.4.0-0.nightly-2020-01-21-012409   False       True          True       47m
network                                    4.4.0-0.nightly-2020-01-21-012409   True        True          True       4h12m
node-tuning                                4.4.0-0.nightly-2020-01-21-012409   True        False         False      76m
openshift-apiserver                        4.4.0-0.nightly-2020-01-21-012409   True        False         False      39m
openshift-controller-manager               4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h13m
openshift-samples                          4.4.0-0.nightly-2020-01-21-012409   True        False         False      66m
operator-lifecycle-manager                 4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h12m
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h12m
operator-lifecycle-manager-packageserver   4.4.0-0.nightly-2020-01-21-012409   True        False         False      38m
service-ca                                 4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h14m
service-catalog-apiserver                  4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h10m
service-catalog-controller-manager         4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h10m
storage                                    4.4.0-0.nightly-2020-01-21-012409   True        False         False      76m


# then run upgrade playbook for the cluster, and after that the cluster back to serve
$ oc get nodes -o wide && oc get clusterversion  && oc get co
NAME                           STATUS   ROLES    AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
wj42bz-bgmlk-compute-0         Ready    worker   4h52m   v1.17.1   10.0.98.208   <none>        Red Hat Enterprise Linux CoreOS 44.81.202001202331.0 (Ootpa)   4.18.0-147.3.1.el8_1.x86_64   cri-o://1.17.0-0.2.rc1.rhaos4.4.gitb89a5fc.el8-rc1
wj42bz-bgmlk-compute-1         Ready    worker   4h53m   v1.17.1   10.0.98.35    <none>        Red Hat Enterprise Linux CoreOS 44.81.202001202331.0 (Ootpa)   4.18.0-147.3.1.el8_1.x86_64   cri-o://1.17.0-0.2.rc1.rhaos4.4.gitb89a5fc.el8-rc1
wj42bz-bgmlk-control-plane-0   Ready    master   5h5m    v1.17.1   10.0.98.128   <none>        Red Hat Enterprise Linux CoreOS 44.81.202001202331.0 (Ootpa)   4.18.0-147.3.1.el8_1.x86_64   cri-o://1.17.0-0.2.rc1.rhaos4.4.gitb89a5fc.el8-rc1
wj42bz-bgmlk-control-plane-1   Ready    master   5h5m    v1.17.1   10.0.96.127   <none>        Red Hat Enterprise Linux CoreOS 44.81.202001202331.0 (Ootpa)   4.18.0-147.3.1.el8_1.x86_64   cri-o://1.17.0-0.2.rc1.rhaos4.4.gitb89a5fc.el8-rc1
wj42bz-bgmlk-control-plane-2   Ready    master   5h5m    v1.17.1   10.0.96.156   <none>        Red Hat Enterprise Linux CoreOS 44.81.202001202331.0 (Ootpa)   4.18.0-147.3.1.el8_1.x86_64   cri-o://1.17.0-0.2.rc1.rhaos4.4.gitb89a5fc.el8-rc1
wj42bz-bgmlk-rhel-0            Ready    worker   160m    v1.17.1   10.0.96.188   <none>        Red Hat Enterprise Linux Server 7.7 (Maipo)                    3.10.0-1062.9.1.el7.x86_64    cri-o://1.17.0-0.3.rc1.rhaos4.4.gitb89a5fc.el7-rc1
wj42bz-bgmlk-rhel-1            Ready    worker   160m    v1.17.1   10.0.96.72    <none>        Red Hat Enterprise Linux Server 7.7 (Maipo)                    3.10.0-1062.9.1.el7.x86_64    cri-o://1.17.0-0.3.rc1.rhaos4.4.gitb89a5fc.el7-rc1
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-01-21-012409   True        False         39m     Cluster version is 4.4.0-0.nightly-2020-01-21-012409
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h43m
cloud-credential                           4.4.0-0.nightly-2020-01-21-012409   True        False         False      5h3m
cluster-autoscaler                         4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h52m
console                                    4.4.0-0.nightly-2020-01-21-012409   True        False         False      89m
dns                                        4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h59m
image-registry                             4.4.0-0.nightly-2020-01-21-012409   True        False         False      39m
ingress                                    4.4.0-0.nightly-2020-01-21-012409   True        False         False      39m
insights                                   4.4.0-0.nightly-2020-01-21-012409   True        False         False      5h
kube-apiserver                             4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h57m
kube-controller-manager                    4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h55m
kube-scheduler                             4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h57m
kube-storage-version-migrator              4.4.0-0.nightly-2020-01-21-012409   True        False         False      39m
machine-api                                4.4.0-0.nightly-2020-01-21-012409   True        False         False      5h3m
machine-config                             4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h58m
marketplace                                4.4.0-0.nightly-2020-01-21-012409   True        False         False      89m
monitoring                                 4.4.0-0.nightly-2020-01-21-012409   True        False         False      17m
network                                    4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h58m
node-tuning                                4.4.0-0.nightly-2020-01-21-012409   True        False         False      122m
openshift-apiserver                        4.4.0-0.nightly-2020-01-21-012409   True        False         False      85m
openshift-controller-manager               4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h59m
openshift-samples                          4.4.0-0.nightly-2020-01-21-012409   True        False         False      112m
operator-lifecycle-manager                 4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h58m
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h58m
operator-lifecycle-manager-packageserver   4.4.0-0.nightly-2020-01-21-012409   True        False         False      84m
service-ca                                 4.4.0-0.nightly-2020-01-21-012409   True        False         False      5h
service-catalog-apiserver                  4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h55m
service-catalog-controller-manager         4.4.0-0.nightly-2020-01-21-012409   True        False         False      4h55m
storage                                    4.4.0-0.nightly-2020-01-21-012409   True        False         False      122m

Comment 4 Johnny Liu 2020-01-21 09:24:50 UTC
Per comment 3, user still need interfere with openshift-ansible rhel worker upgrade during `oc adm upgrade` process so that complete the whole cluster upgrade. So clear needinfo flag fro comment 2, and keep BZ#1792139 for tracking the future improvement for rhcos + rhel worker mix cluster upgrade process.

Comment 6 errata-xmlrpc 2020-05-04 11:25:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581