Bug 1952368
| Summary: | worker pool went degraded due to no rpm-ostree on rhel worker during applying new mc | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | liujia <jiajliu> | |
| Component: | Machine Config Operator | Assignee: | Sinny Kumari <skumari> | |
| Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 4.7 | CC: | skumari | |
| Target Milestone: | --- | |||
| Target Release: | 4.8.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause: rpm-ostree related operation was not handled properly on non-CoreOS nodes like RHEL.
Consequence: As a result, RHEL nodes were going degraded when an operation like switching kernel was applied in the pool containing RHEL nodes.
Fix: Now, Machine Config Daemon logs a message whenever a non-supported operation is performed on non CoreOS nodes like RHEL. After logging the message, it returns nil instead of an error.
Result: RHEL nodes in the pool will proceed as expected for an unsupported operation like switching kernel is performed via MachineConfig.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1953493 (view as bug list) | Environment: | ||
| Last Closed: | 2021-07-27 23:02:52 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1953475 | |||
This looks like a bug in our MCO code. Since MCO doesn't support switching kernelType on RHEL nodes, instead of returning error on RHEL nodes it should log message and return nil. We will fix the problem and backport it to affected releases (I suspect backport till 4.6 may be needed). Setting this as blocker because this bug could affect upgrade when there are RHEL nodes in a cluster with RT kernel applied to that pool. Verified on 4.8.0-0.nightly-2021-05-13-104422. Created cluster with two RHEL workers. Used MC with extensions, kernel type, and kernel argument changes. RHCOS nodes updated successfully and RHEL nodes were unaffected. MCP did not go degraded.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.8.0-0.nightly-2021-05-13-104422 True False 17h Cluster version is 4.8.0-0.nightly-2021-05-13-104422
$ vi trifecta.yaml
$ cat trifecta.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: worker-extensions-usbguard
spec:
config:
ignition:
version: 3.2.0
extensions:
- usbguard
kernelType: realtime
kernelArguments:
- 'z=10'
$ oc create -f trifecta.yaml
machineconfig.machineconfiguration.openshift.io/worker-extensions-usbguard created
$ oc get mc
NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE
00-master ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h
00-worker ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h
01-master-container-runtime ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h
01-master-kubelet ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h
01-worker-container-runtime ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h
01-worker-kubelet ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h
99-master-generated-registries ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h
99-master-ssh 3.2.0 17h
99-worker-generated-registries ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h
99-worker-ssh 3.2.0 17h
rendered-master-23ea8f9ea10dfbe0129dd65c8034521e ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h
rendered-worker-9fc1bd32db6e9e11d0a192e132cbab5e ec3c68e3d9a795af38120abdbf20e592e5c463f8 3.2.0 17h
worker-extensions-usbguard 3.2.0 3s
$ oc get mcp/worker
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
worker rendered-worker-9fc1bd32db6e9e11d0a192e132cbab5e False True False 4 0 0 0 17h
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-61-192.us-east-2.compute.internal Ready master 17h v1.21.0-rc.0+41625cd
ip-10-0-61-194.us-east-2.compute.internal Ready,SchedulingDisabled worker 17h v1.21.0-rc.0+41625cd
ip-10-0-62-147.us-east-2.compute.internal Ready master 17h v1.21.0-rc.0+41625cd
ip-10-0-62-189.us-east-2.compute.internal Ready worker 14m v1.21.0-rc.0+6998007
ip-10-0-63-9.us-east-2.compute.internal Ready worker 14m v1.21.0-rc.0+6998007
ip-10-0-76-163.us-east-2.compute.internal Ready master 17h v1.21.0-rc.0+41625cd
ip-10-0-79-153.us-east-2.compute.internal Ready worker 17h v1.21.0-rc.0+41625cd
$ oc get mcp/worker
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
worker rendered-worker-a19afb0b045cb8fef7141d0ac31b684a True False False 4 4 4 0 17h
$ oc get pods -A --field-selector spec.nodeName=ip-10-0-61-194.us-east-2.compute.internal | grep machine-config-daemon
openshift-machine-config-operator machine-config-daemon-55bg8 2/2 Running 2 17h
$ oc debug node/ip-10-0-61-194.us-east-2.compute.internal
Starting pod/ip-10-0-61-194us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm -qa | grep kernel
kernel-rt-kvm-4.18.0-293.rt7.59.el8.x86_64
kernel-rt-core-4.18.0-293.rt7.59.el8.x86_64
kernel-rt-modules-extra-4.18.0-293.rt7.59.el8.x86_64
kernel-rt-modules-4.18.0-293.rt7.59.el8.x86_64
sh-4.4# uname -a
Linux ip-10-0-61-194 4.18.0-293.rt7.59.el8.x86_64 #1 SMP PREEMPT_RT Mon Mar 1 15:40:34 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
sh-4.4# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-f36c0a3f3d7ae96ca5ab98a46baaf6b02be0c41793fb2e811233282062cf345d/vmlinuz-4.18.0-293.rt7.59.el8.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ostree=/ostree/boot.1/rhcos/f36c0a3f3d7ae96ca5ab98a46baaf6b02be0c41793fb2e811233282062cf345d/0 ignition.platform.id=aws root=UUID=ce9f3fc4-2602-4671-b333-75b3f910271b rw rootflags=prjquota z=10
sh-4.4# exit
exit
sh-4.2# exit
exit
Removing debug pod ...
$ oc debug node/ip-10-0-63-9.us-east-2.compute.internal
Starting pod/ip-10-0-63-9us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.2# rpm -qa | grep kernel
kernel-tools-3.10.0-1127.el7.x86_64
kernel-tools-libs-3.10.0-1127.el7.x86_64
kernel-3.10.0-1160.25.1.el7.x86_64
kernel-3.10.0-1127.el7.x86_64
sh-4.2# uname -a
Linux ip-10-0-63-9.us-east-2.compute.internal 3.10.0-1160.25.1.el7.x86_64 #1 SMP Tue Apr 13 18:55:45 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
sh-4.2# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-3.10.0-1160.25.1.el7.x86_64 root=UUID=5a000634-a1fc-467d-8ef4-5fcf5dbc6033 ro console=ttyS0,115200n8 console=tty0 net.ifnames=0 rd.blacklist=nouveau nvme_core.io_timeout=4294967295 crashkernel=auto LANG=en_US.UTF-8
sh-4.2# exit
exit
sh-4.2# exit
exit
Removing debug pod ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |
Description of problem: After creating an ImageContentSourcePolicy on v4.7.8 cluster for disconnected upgrade, new machine config were created but failed to be applied, mco and one of mcp went into "DEGRADED". # ./oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-config 4.7.8 False False True 25m # ./oc get mcp worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-7c3a33c5cd425b9a3d272a984ff80457 False True True 5 0 0 1 127m Status: Conditions: Last Transition Time: 2021-04-22T04:37:42Z Message: Cluster version is 4.7.8 Status: False Type: Progressing Last Transition Time: 2021-04-22T06:02:09Z Message: One or more machine config pool is degraded, please see `oc get mcp` for further details and resolve before upgrading Reason: DegradedPool Status: False Type: Upgradeable Last Transition Time: 2021-04-22T06:17:45Z Message: Failed to resync 4.7.8 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready, retrying. Status: (pool degraded: true total: 5, ready 0, updated: 0, unavailable: 1) Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2021-04-22T06:17:45Z Message: Cluster not available for 4.7.8 Status: False Type: Available Extension: Master: all 3 nodes are at latest configuration rendered-master-89aa86f3649a9d041f006636ad1549eb Worker: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node jiajliu221220-dbf5r-rhel-0 is reporting: \"error removing staged deployment: error running rpm-ostree cleanup -p: : exec: \\\"rpm-ostree\\\": executable file not found in $PATH: updating kernel on non-RHCOS nodes is not supported\"" # ./oc get node|grep rhel jiajliu221220-dbf5r-rhel-0 Ready,SchedulingDisabled worker 56m v1.20.0+7d0a2b2 jiajliu221220-dbf5r-rhel-1 Ready worker 56m v1.20.0+7d0a2b2 Version-Release number of selected component (if applicable): v4.7.8 How reproducible: always Steps to Reproduce: 1. Disconnected install ocp v4.7 and scale up two rhel worker nodes 2. Create ImageContentSourcePolicy for upgrade 3. Actual results: rhel worker is SchedulingDisabled Expected results: rhel worker should not be affected Additional info: