Bug 1952368

Summary:	worker pool went degraded due to no rpm-ostree on rhel worker during applying new mc
Product:	OpenShift Container Platform	Reporter:	liujia <jiajliu>
Component:	Machine Config Operator	Assignee:	Sinny Kumari <skumari>
Status:	CLOSED ERRATA	QA Contact:	Michael Nguyen <mnguyen>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.7	CC:	skumari
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: rpm-ostree related operation was not handled properly on non-CoreOS nodes like RHEL. Consequence: As a result, RHEL nodes were going degraded when an operation like switching kernel was applied in the pool containing RHEL nodes. Fix: Now, Machine Config Daemon logs a message whenever a non-supported operation is performed on non CoreOS nodes like RHEL. After logging the message, it returns nil instead of an error. Result: RHEL nodes in the pool will proceed as expected for an unsupported operation like switching kernel is performed via MachineConfig.	Story Points:	---
Clone Of:
Clones:	1953493 (view as bug list)		Environment:
Last Closed:	2021-07-27 23:02:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1953475

Description liujia 2021-04-22 07:20:07 UTC

Description of problem:
After creating an ImageContentSourcePolicy on v4.7.8 cluster for disconnected upgrade, new machine config were created but failed to be applied, mco and one of mcp went into "DEGRADED". 

# ./oc get co machine-config
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.7.8     False       False         True       25m

# ./oc get mcp worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-7c3a33c5cd425b9a3d272a984ff80457   False     True       True       5              0                   0                     1                      127m

Status:
  Conditions:
    Last Transition Time:  2021-04-22T04:37:42Z
    Message:               Cluster version is 4.7.8
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2021-04-22T06:02:09Z
    Message:               One or more machine config pool is degraded, please see `oc get mcp` for further details and resolve before upgrading
    Reason:                DegradedPool
    Status:                False
    Type:                  Upgradeable
    Last Transition Time:  2021-04-22T06:17:45Z
    Message:               Failed to resync 4.7.8 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready, retrying. Status: (pool degraded: true total: 5, ready 0, updated: 0, unavailable: 1)
    Reason:                RequiredPoolsFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2021-04-22T06:17:45Z
    Message:               Cluster not available for 4.7.8
    Status:                False
    Type:                  Available
  Extension:
    Master:  all 3 nodes are at latest configuration rendered-master-89aa86f3649a9d041f006636ad1549eb
    Worker:  pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node jiajliu221220-dbf5r-rhel-0 is reporting: \"error removing staged deployment: error running rpm-ostree cleanup -p: : exec: \\\"rpm-ostree\\\": executable file not found in $PATH: updating kernel on non-RHCOS nodes is not supported\""

# ./oc get node|grep rhel
jiajliu221220-dbf5r-rhel-0            Ready,SchedulingDisabled   worker   56m    v1.20.0+7d0a2b2
jiajliu221220-dbf5r-rhel-1            Ready                      worker   56m    v1.20.0+7d0a2b2


Version-Release number of selected component (if applicable):
v4.7.8

How reproducible:
always

Steps to Reproduce:
1. Disconnected install ocp v4.7 and scale up two rhel worker nodes
2. Create ImageContentSourcePolicy for upgrade 
3.

Actual results:
rhel worker is SchedulingDisabled

Expected results:
rhel worker should not be affected

Additional info:

Comment 2 Sinny Kumari 2021-04-22 10:32:33 UTC

This looks like a bug in our MCO code. Since MCO doesn't support switching kernelType on RHEL nodes, instead of returning error on RHEL nodes it should log message and return nil.
We will fix the problem and backport it to affected releases (I suspect backport till 4.6 may be needed).

Comment 3 Sinny Kumari 2021-04-23 12:17:31 UTC

Setting this as blocker because this bug could affect upgrade when there are RHEL nodes in a cluster with RT kernel applied to that pool.

Comment 5 Michael Nguyen 2021-05-14 13:30:51 UTC

Verified on 4.8.0-0.nightly-2021-05-13-104422.  Created cluster with two RHEL workers.  Used MC with extensions, kernel type, and kernel argument changes.  RHCOS nodes updated successfully and RHEL nodes were unaffected.  MCP did not go degraded. 

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-13-104422   True        False         17h     Cluster version is 4.8.0-0.nightly-2021-05-13-104422
$ vi trifecta.yaml
$ cat trifecta.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: worker-extensions-usbguard
spec:
  config:
    ignition:
      version: 3.2.0
  extensions:
    - usbguard
  kernelType: realtime
  kernelArguments:
  - 'z=10'

$ oc create -f trifecta.yaml 
machineconfig.machineconfiguration.openshift.io/worker-extensions-usbguard created
$ oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          ec3c68e3d9a795af38120abdbf20e592e5c463f8   3.2.0             17h
00-worker                                          ec3c68e3d9a795af38120abdbf20e592e5c463f8   3.2.0             17h
01-master-container-runtime                        ec3c68e3d9a795af38120abdbf20e592e5c463f8   3.2.0             17h
01-master-kubelet                                  ec3c68e3d9a795af38120abdbf20e592e5c463f8   3.2.0             17h
01-worker-container-runtime                        ec3c68e3d9a795af38120abdbf20e592e5c463f8   3.2.0             17h
01-worker-kubelet                                  ec3c68e3d9a795af38120abdbf20e592e5c463f8   3.2.0             17h
99-master-generated-registries                     ec3c68e3d9a795af38120abdbf20e592e5c463f8   3.2.0             17h
99-master-ssh                                                                                 3.2.0             17h
99-worker-generated-registries                     ec3c68e3d9a795af38120abdbf20e592e5c463f8   3.2.0             17h
99-worker-ssh                                                                                 3.2.0             17h
rendered-master-23ea8f9ea10dfbe0129dd65c8034521e   ec3c68e3d9a795af38120abdbf20e592e5c463f8   3.2.0             17h
rendered-worker-9fc1bd32db6e9e11d0a192e132cbab5e   ec3c68e3d9a795af38120abdbf20e592e5c463f8   3.2.0             17h
worker-extensions-usbguard                                                                    3.2.0             3s
$ oc get mcp/worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-9fc1bd32db6e9e11d0a192e132cbab5e   False     True       False      4              0                   0                     0                      17h
$ oc get nodes
NAME                                        STATUS                     ROLES    AGE   VERSION
ip-10-0-61-192.us-east-2.compute.internal   Ready                      master   17h   v1.21.0-rc.0+41625cd
ip-10-0-61-194.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   17h   v1.21.0-rc.0+41625cd
ip-10-0-62-147.us-east-2.compute.internal   Ready                      master   17h   v1.21.0-rc.0+41625cd
ip-10-0-62-189.us-east-2.compute.internal   Ready                      worker   14m   v1.21.0-rc.0+6998007
ip-10-0-63-9.us-east-2.compute.internal     Ready                      worker   14m   v1.21.0-rc.0+6998007
ip-10-0-76-163.us-east-2.compute.internal   Ready                      master   17h   v1.21.0-rc.0+41625cd
ip-10-0-79-153.us-east-2.compute.internal   Ready                      worker   17h   v1.21.0-rc.0+41625cd
$ oc get mcp/worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-a19afb0b045cb8fef7141d0ac31b684a   True      False      False      4              4                   4                     0                      17h
$ oc get pods -A --field-selector spec.nodeName=ip-10-0-61-194.us-east-2.compute.internal | grep machine-config-daemon
openshift-machine-config-operator        machine-config-daemon-55bg8           2/2     Running   2          17h
$ oc debug node/ip-10-0-61-194.us-east-2.compute.internal
Starting pod/ip-10-0-61-194us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm -qa | grep kernel
kernel-rt-kvm-4.18.0-293.rt7.59.el8.x86_64
kernel-rt-core-4.18.0-293.rt7.59.el8.x86_64
kernel-rt-modules-extra-4.18.0-293.rt7.59.el8.x86_64
kernel-rt-modules-4.18.0-293.rt7.59.el8.x86_64
sh-4.4# uname -a
Linux ip-10-0-61-194 4.18.0-293.rt7.59.el8.x86_64 #1 SMP PREEMPT_RT Mon Mar 1 15:40:34 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
sh-4.4# cat /proc/cmdline 
BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-f36c0a3f3d7ae96ca5ab98a46baaf6b02be0c41793fb2e811233282062cf345d/vmlinuz-4.18.0-293.rt7.59.el8.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ostree=/ostree/boot.1/rhcos/f36c0a3f3d7ae96ca5ab98a46baaf6b02be0c41793fb2e811233282062cf345d/0 ignition.platform.id=aws root=UUID=ce9f3fc4-2602-4671-b333-75b3f910271b rw rootflags=prjquota z=10
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...
$ oc debug node/ip-10-0-63-9.us-east-2.compute.internal
Starting pod/ip-10-0-63-9us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.2# rpm -qa | grep kernel
kernel-tools-3.10.0-1127.el7.x86_64
kernel-tools-libs-3.10.0-1127.el7.x86_64
kernel-3.10.0-1160.25.1.el7.x86_64
kernel-3.10.0-1127.el7.x86_64
sh-4.2# uname -a
Linux ip-10-0-63-9.us-east-2.compute.internal 3.10.0-1160.25.1.el7.x86_64 #1 SMP Tue Apr 13 18:55:45 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
sh-4.2# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-3.10.0-1160.25.1.el7.x86_64 root=UUID=5a000634-a1fc-467d-8ef4-5fcf5dbc6033 ro console=ttyS0,115200n8 console=tty0 net.ifnames=0 rd.blacklist=nouveau nvme_core.io_timeout=4294967295 crashkernel=auto LANG=en_US.UTF-8
sh-4.2# exit                 
exit
sh-4.2# exit
exit

Removing debug pod ...

Comment 8 errata-xmlrpc 2021-07-27 23:02:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438