Description of problem: After creating an ImageContentSourcePolicy on v4.7.8 cluster for disconnected upgrade, new machine config were created but failed to be applied, mco and one of mcp went into "DEGRADED". # ./oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-config 4.7.8 False False True 25m # ./oc get mcp worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-7c3a33c5cd425b9a3d272a984ff80457 False True True 5 0 0 1 127m Status: Conditions: Last Transition Time: 2021-04-22T04:37:42Z Message: Cluster version is 4.7.8 Status: False Type: Progressing Last Transition Time: 2021-04-22T06:02:09Z Message: One or more machine config pool is degraded, please see `oc get mcp` for further details and resolve before upgrading Reason: DegradedPool Status: False Type: Upgradeable Last Transition Time: 2021-04-22T06:17:45Z Message: Failed to resync 4.7.8 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready, retrying. Status: (pool degraded: true total: 5, ready 0, updated: 0, unavailable: 1) Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2021-04-22T06:17:45Z Message: Cluster not available for 4.7.8 Status: False Type: Available Extension: Master: all 3 nodes are at latest configuration rendered-master-89aa86f3649a9d041f006636ad1549eb Worker: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node jiajliu221220-dbf5r-rhel-0 is reporting: \"error removing staged deployment: error running rpm-ostree cleanup -p: : exec: \\\"rpm-ostree\\\": executable file not found in $PATH: updating kernel on non-RHCOS nodes is not supported\"" # ./oc get node|grep rhel jiajliu221220-dbf5r-rhel-0 Ready,SchedulingDisabled worker 56m v1.20.0+7d0a2b2 jiajliu221220-dbf5r-rhel-1 Ready worker 56m v1.20.0+7d0a2b2 Version-Release number of selected component (if applicable): v4.7.8 How reproducible: always Steps to Reproduce: 1. Disconnected install ocp v4.7 and scale up two rhel worker nodes 2. Create ImageContentSourcePolicy for upgrade 3. Actual results: rhel worker is SchedulingDisabled Expected results: rhel worker should not be affected Additional info:
This looks like a bug in our MCO code. Since MCO doesn't support switching kernelType on RHEL nodes, instead of returning error on RHEL nodes it should log message and return nil. We will fix the problem and backport it to affected releases (I suspect backport till 4.6 may be needed).
Setting this as blocker because this bug could affect upgrade when there are RHEL nodes in a cluster with RT kernel applied to that pool.
Verified on 4.8.0-0.nightly-2021-05-13-104422. Created cluster with two RHEL workers. Used MC with extensions, kernel type, and kernel argument changes. RHCOS nodes updated successfully and RHEL nodes were unaffected. MCP did not go degraded. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-13-104422 True False 17h Cluster version is 4.8.0-0.nightly-2021-05-13-104422 $ vi trifecta.yaml $ cat trifecta.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: worker-extensions-usbguard spec: config: ignition: version: 3.2.0 extensions: - usbguard kernelType: realtime kernelArguments: - 'z=10' $ oc create -f trifecta.yaml machineconfig.machineconfiguration.openshift.io/worker-extensions-usbguard created $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h 00-worker ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h 01-master-container-runtime ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h 01-master-kubelet ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h 01-worker-container-runtime ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h 01-worker-kubelet ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h 99-master-generated-registries ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h 99-master-ssh 3.2.0 17h 99-worker-generated-registries ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h 99-worker-ssh 3.2.0 17h rendered-master-23ea8f9ea10dfbe0129dd65c8034521e ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h rendered-worker-9fc1bd32db6e9e11d0a192e132cbab5e ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h worker-extensions-usbguard 3.2.0 3s $ oc get mcp/worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-9fc1bd32db6e9e11d0a192e132cbab5e False True False 4 0 0 0 17h $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-61-192.us-east-2.compute.internal Ready master 17h v1.21.0-rc.0+41625cd ip-10-0-61-194.us-east-2.compute.internal Ready,SchedulingDisabled worker 17h v1.21.0-rc.0+41625cd ip-10-0-62-147.us-east-2.compute.internal Ready master 17h v1.21.0-rc.0+41625cd ip-10-0-62-189.us-east-2.compute.internal Ready worker 14m v1.21.0-rc.0+6998007 ip-10-0-63-9.us-east-2.compute.internal Ready worker 14m v1.21.0-rc.0+6998007 ip-10-0-76-163.us-east-2.compute.internal Ready master 17h v1.21.0-rc.0+41625cd ip-10-0-79-153.us-east-2.compute.internal Ready worker 17h v1.21.0-rc.0+41625cd $ oc get mcp/worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-a19afb0b045cb8fef7141d0ac31b684a True False False 4 4 4 0 17h $ oc get pods -A --field-selector spec.nodeName=ip-10-0-61-194.us-east-2.compute.internal | grep machine-config-daemon openshift-machine-config-operator machine-config-daemon-55bg8 2/2 Running 2 17h $ oc debug node/ip-10-0-61-194.us-east-2.compute.internal Starting pod/ip-10-0-61-194us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# rpm -qa | grep kernel kernel-rt-kvm-4.18.0-293.rt7.59.el8.x86_64 kernel-rt-core-4.18.0-293.rt7.59.el8.x86_64 kernel-rt-modules-extra-4.18.0-293.rt7.59.el8.x86_64 kernel-rt-modules-4.18.0-293.rt7.59.el8.x86_64 sh-4.4# uname -a Linux ip-10-0-61-194 4.18.0-293.rt7.59.el8.x86_64 #1 SMP PREEMPT_RT Mon Mar 1 15:40:34 EST 2021 x86_64 x86_64 x86_64 GNU/Linux sh-4.4# cat /proc/cmdline BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-f36c0a3f3d7ae96ca5ab98a46baaf6b02be0c41793fb2e811233282062cf345d/vmlinuz-4.18.0-293.rt7.59.el8.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ostree=/ostree/boot.1/rhcos/f36c0a3f3d7ae96ca5ab98a46baaf6b02be0c41793fb2e811233282062cf345d/0 ignition.platform.id=aws root=UUID=ce9f3fc4-2602-4671-b333-75b3f910271b rw rootflags=prjquota z=10 sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... $ oc debug node/ip-10-0-63-9.us-east-2.compute.internal Starting pod/ip-10-0-63-9us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.2# rpm -qa | grep kernel kernel-tools-3.10.0-1127.el7.x86_64 kernel-tools-libs-3.10.0-1127.el7.x86_64 kernel-3.10.0-1160.25.1.el7.x86_64 kernel-3.10.0-1127.el7.x86_64 sh-4.2# uname -a Linux ip-10-0-63-9.us-east-2.compute.internal 3.10.0-1160.25.1.el7.x86_64 #1 SMP Tue Apr 13 18:55:45 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux sh-4.2# cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-3.10.0-1160.25.1.el7.x86_64 root=UUID=5a000634-a1fc-467d-8ef4-5fcf5dbc6033 ro console=ttyS0,115200n8 console=tty0 net.ifnames=0 rd.blacklist=nouveau nvme_core.io_timeout=4294967295 crashkernel=auto LANG=en_US.UTF-8 sh-4.2# exit exit sh-4.2# exit exit Removing debug pod ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438