Bug 1952368
Summary: | worker pool went degraded due to no rpm-ostree on rhel worker during applying new mc | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | liujia <jiajliu> | |
Component: | Machine Config Operator | Assignee: | Sinny Kumari <skumari> | |
Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.7 | CC: | skumari | |
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: rpm-ostree related operation was not handled properly on non-CoreOS nodes like RHEL.
Consequence: As a result, RHEL nodes were going degraded when an operation like switching kernel was applied in the pool containing RHEL nodes.
Fix: Now, Machine Config Daemon logs a message whenever a non-supported operation is performed on non CoreOS nodes like RHEL. After logging the message, it returns nil instead of an error.
Result: RHEL nodes in the pool will proceed as expected for an unsupported operation like switching kernel is performed via MachineConfig.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1953493 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 23:02:52 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1953475 |
Description
liujia
2021-04-22 07:20:07 UTC
This looks like a bug in our MCO code. Since MCO doesn't support switching kernelType on RHEL nodes, instead of returning error on RHEL nodes it should log message and return nil. We will fix the problem and backport it to affected releases (I suspect backport till 4.6 may be needed). Setting this as blocker because this bug could affect upgrade when there are RHEL nodes in a cluster with RT kernel applied to that pool. Verified on 4.8.0-0.nightly-2021-05-13-104422. Created cluster with two RHEL workers. Used MC with extensions, kernel type, and kernel argument changes. RHCOS nodes updated successfully and RHEL nodes were unaffected. MCP did not go degraded. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-13-104422 True False 17h Cluster version is 4.8.0-0.nightly-2021-05-13-104422 $ vi trifecta.yaml $ cat trifecta.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: worker-extensions-usbguard spec: config: ignition: version: 3.2.0 extensions: - usbguard kernelType: realtime kernelArguments: - 'z=10' $ oc create -f trifecta.yaml machineconfig.machineconfiguration.openshift.io/worker-extensions-usbguard created $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h 00-worker ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h 01-master-container-runtime ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h 01-master-kubelet ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h 01-worker-container-runtime ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h 01-worker-kubelet ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h 99-master-generated-registries ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h 99-master-ssh 3.2.0 17h 99-worker-generated-registries ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h 99-worker-ssh 3.2.0 17h rendered-master-23ea8f9ea10dfbe0129dd65c8034521e ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h rendered-worker-9fc1bd32db6e9e11d0a192e132cbab5e ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h worker-extensions-usbguard 3.2.0 3s $ oc get mcp/worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-9fc1bd32db6e9e11d0a192e132cbab5e False True False 4 0 0 0 17h $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-61-192.us-east-2.compute.internal Ready master 17h v1.21.0-rc.0+41625cd ip-10-0-61-194.us-east-2.compute.internal Ready,SchedulingDisabled worker 17h v1.21.0-rc.0+41625cd ip-10-0-62-147.us-east-2.compute.internal Ready master 17h v1.21.0-rc.0+41625cd ip-10-0-62-189.us-east-2.compute.internal Ready worker 14m v1.21.0-rc.0+6998007 ip-10-0-63-9.us-east-2.compute.internal Ready worker 14m v1.21.0-rc.0+6998007 ip-10-0-76-163.us-east-2.compute.internal Ready master 17h v1.21.0-rc.0+41625cd ip-10-0-79-153.us-east-2.compute.internal Ready worker 17h v1.21.0-rc.0+41625cd $ oc get mcp/worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-a19afb0b045cb8fef7141d0ac31b684a True False False 4 4 4 0 17h $ oc get pods -A --field-selector spec.nodeName=ip-10-0-61-194.us-east-2.compute.internal | grep machine-config-daemon openshift-machine-config-operator machine-config-daemon-55bg8 2/2 Running 2 17h $ oc debug node/ip-10-0-61-194.us-east-2.compute.internal Starting pod/ip-10-0-61-194us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# rpm -qa | grep kernel kernel-rt-kvm-4.18.0-293.rt7.59.el8.x86_64 kernel-rt-core-4.18.0-293.rt7.59.el8.x86_64 kernel-rt-modules-extra-4.18.0-293.rt7.59.el8.x86_64 kernel-rt-modules-4.18.0-293.rt7.59.el8.x86_64 sh-4.4# uname -a Linux ip-10-0-61-194 4.18.0-293.rt7.59.el8.x86_64 #1 SMP PREEMPT_RT Mon Mar 1 15:40:34 EST 2021 x86_64 x86_64 x86_64 GNU/Linux sh-4.4# cat /proc/cmdline BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-f36c0a3f3d7ae96ca5ab98a46baaf6b02be0c41793fb2e811233282062cf345d/vmlinuz-4.18.0-293.rt7.59.el8.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ostree=/ostree/boot.1/rhcos/f36c0a3f3d7ae96ca5ab98a46baaf6b02be0c41793fb2e811233282062cf345d/0 ignition.platform.id=aws root=UUID=ce9f3fc4-2602-4671-b333-75b3f910271b rw rootflags=prjquota z=10 sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... $ oc debug node/ip-10-0-63-9.us-east-2.compute.internal Starting pod/ip-10-0-63-9us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.2# rpm -qa | grep kernel kernel-tools-3.10.0-1127.el7.x86_64 kernel-tools-libs-3.10.0-1127.el7.x86_64 kernel-3.10.0-1160.25.1.el7.x86_64 kernel-3.10.0-1127.el7.x86_64 sh-4.2# uname -a Linux ip-10-0-63-9.us-east-2.compute.internal 3.10.0-1160.25.1.el7.x86_64 #1 SMP Tue Apr 13 18:55:45 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux sh-4.2# cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-3.10.0-1160.25.1.el7.x86_64 root=UUID=5a000634-a1fc-467d-8ef4-5fcf5dbc6033 ro console=ttyS0,115200n8 console=tty0 net.ifnames=0 rd.blacklist=nouveau nvme_core.io_timeout=4294967295 crashkernel=auto LANG=en_US.UTF-8 sh-4.2# exit exit sh-4.2# exit exit Removing debug pod ... Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |