Description of problem: Attempted upgrade of AWS UPI cluster with RHEL 7.9 worker nodes from 4.6.9 to 4.7.0-0.nightly-2021-01-06-160219 "hangs" on RHEL worker node with this error in the MCO: "Node ip-10-0-60-30.us-east-2.compute.internal is reporting: \"error enabling unit: Failed to execute operation: File exists\\n\"" [mifiedle@mffiedler must-gather]$ oc get mcp worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-7b6e0c9731fa2a8f9c3db0057ab02f9e False True True 2 0 0 1 4h29m [mifiedle@mffiedler must-gather]$ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-48-203.us-east-2.compute.internal Ready master 4h33m v1.20.0+b1e9f0d 10.0.48.203 <none> Red Hat Enterprise Linux CoreOS 47.83.202101060443-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 ip-10-0-49-168.us-east-2.compute.internal Ready worker 3h33m v1.19.0+9c69bdc 10.0.49.168 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.11.1.el7.x86_64 cri-o://1.19.0-118.rhaos4.6.gitf51f94a.el7 ip-10-0-60-30.us-east-2.compute.internal Ready,SchedulingDisabled worker 3h33m v1.19.0+9c69bdc 10.0.60.30 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.11.1.el7.x86_64 cri-o://1.19.0-118.rhaos4.6.gitf51f94a.el7 ip-10-0-63-227.us-east-2.compute.internal Ready master 4h33m v1.20.0+b1e9f0d 10.0.63.227 <none> Red Hat Enterprise Linux CoreOS 47.83.202101060443-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 ip-10-0-67-30.us-east-2.compute.internal Ready master 4h33m v1.20.0+b1e9f0d 10.0.67.30 <none> Red Hat Enterprise Linux CoreOS 47.83.202101060443-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 [mifiedle@mffiedler must-gather]$ oc get clusterverstion error: the server doesn't have a resource type "clusterverstion" [mifiedle@mffiedler must-gather]$ ^C [mifiedle@mffiedler must-gather]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.9 True True 3h33m Unable to apply 4.7.0-0.nightly-2021-01-06-160219: the cluster operator ingress is degraded Version-Release number of selected component (if applicable): 4.6.9 -> 4.7.0-0.nightly-2021-01-06-160219 How reproducible: Unknown Steps to Reproduce: 1. Install 4.6.9 UPI on AWS cluster with RHEL 7.9 workers 2. Attempt upgrade to 4.7.latest nightly build Actual results: Upgrade "hangs" for hours as described above Additional info: oc adm must-gather will be linked in a private comment.
From oc describe co machine-config: Status: Conditions: Last Transition Time: 2021-01-06T23:47:03Z Message: Working towards 4.7.0-0.nightly-2021-01-06-160219 Status: True Type: Progressing Last Transition Time: 2021-01-07T00:08:53Z Message: Unable to apply 4.7.0-0.nightly-2021-01-06-160219: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready, retrying. Status: (pool degraded: true total: 2, ready 0, updated: 0, unavailable: 1) Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2021-01-06T23:45:03Z Message: Cluster not available for 4.7.0-0.nightly-2021-01-06-160219 Status: False Type: Available Last Transition Time: 2021-01-06T23:56:11Z Reason: One or more machine config pool is degraded, please see `oc get mcp` for further details and resolve before upgrading Status: False Type: Upgradeable Extension: Master: all 3 nodes are at latest configuration rendered-master-4c5dfacd24376a49a14b9ec419a1446f Worker: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ip-10-0-60-30.us-east-2.compute.internal is reporting: \"error enabling unit: Failed to execute operation: File exists\\n\"" Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: Resource: controllerconfigs Group: machineconfiguration.openshift.io Name: Resource: kubeletconfigs Group: machineconfiguration.openshift.io Name: Resource: containerruntimeconfigs Group: machineconfiguration.openshift.io Name: Resource: machineconfigs Group: Name: Resource: nodes Versions: Name: operator Version: 4.6.9 Events: <none> From ocp get mcp worker -o yaml status: conditions: - lastTransitionTime: "2021-01-06T22:07:32Z" message: "" reason: "" status: "False" type: RenderDegraded - lastTransitionTime: "2021-01-06T23:50:13Z" message: "" reason: "" status: "False" type: Updated - lastTransitionTime: "2021-01-06T23:50:13Z" message: All nodes are updating to rendered-worker-6e086c9b284fc01f3f487c90c6274dd9 reason: "" status: "True" type: Updating - lastTransitionTime: "2021-01-06T23:51:38Z" message: 'Node ip-10-0-60-30.us-east-2.compute.internal is reporting: "error enabling unit: Failed to execute operation: File exists\n"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded - lastTransitionTime: "2021-01-06T23:51:38Z" message: "" reason: "" status: "True" type: Degraded configuration: name: rendered-worker-7b6e0c9731fa2a8f9c3db0057ab02f9e source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-fips - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh degradedMachineCount: 1 machineCount: 2 observedGeneration: 4 readyMachineCount: 0 unavailableMachineCount: 1 updatedMachineCount: 0
QE could reproduce this issue in an upgrade from 4.6.9 to 4.7.0-0.nightly-2021-01-06-222035, with RHEL worker added in the cluster. Updating one of the RHEL workers was stuck there so MCO degraded. The following error logs found in the machine-config-daemon pod on the RHEL worker: ... I0107 07:19:34.948170 18912 update.go:1462] Writing systemd unit dropin "10-mco-default-env.conf" I0107 07:19:34.949429 18912 update.go:1497] Writing systemd unit "kubelet.service" I0107 07:19:34.950666 18912 update.go:1497] Writing systemd unit "machine-config-daemon-firstboot.service" I0107 07:19:34.952152 18912 update.go:1497] Writing systemd unit "machine-config-daemon-pull.service" I0107 07:19:34.953412 18912 update.go:1497] Writing systemd unit "node-valid-hostname.service" I0107 07:19:34.954597 18912 update.go:1497] Writing systemd unit "nodeip-configuration.service" I0107 07:19:34.956064 18912 update.go:1497] Writing systemd unit "ovs-configuration.service" I0107 07:19:34.957480 18912 update.go:1462] Writing systemd unit dropin "10-ovs-vswitchd-restart.conf" I0107 07:19:35.090575 18912 update.go:1451] Preset systemd unit ovs-vswitchd.service I0107 07:19:35.090624 18912 update.go:1462] Writing systemd unit dropin "10-ovsdb-restart.conf" I0107 07:19:35.092592 18912 update.go:1462] Writing systemd unit dropin "10-mco-default-env.conf" I0107 07:19:35.099961 18912 update.go:1534] Could not reset unit preset for pivot.service, skipping. (Error msg: error running preset on unit: Failed to execute operation: No such file or directory ) I0107 07:19:35.099989 18912 update.go:1462] Writing systemd unit dropin "mco-disabled.conf" I0107 07:19:35.104432 18912 update.go:1534] Could not reset unit preset for zincati.service, skipping. (Error msg: error running preset on unit: Failed to execute operation: No such file or directory ) E0107 07:19:35.108108 18912 writer.go:135] Marking Degraded due to: error enabling unit: Failed to execute operation: File exists
Ok I think I know what's happening: the update is failing because it cannot write openvswitch.service. This is because there is a dangling symlink at /etc/systemd/system/multi-user.target.wants/openvswitch.service. The reason there is a dangling symlink is because the actual file is at: /usr/lib/systemd/system/openvswitch.service and not: /etc/systemd/system/openvswitch.service And this has been the case since the service was introduced. Now you might have two questions: 1. why doesn't this affect RHCOS 2. why is it written wrong 3. why do we only see this error now I think they actually basically have the same answer: the RHCOS node firstboot writes use ignition, which does it correctly. Until I fixed it in https://github.com/openshift/machine-config-operator/pull/2145/files, the MCD would try to write it to the wrong location because its hardcoded to do so, and the RHEL nodes use this path during its onceFrom run to provision the file/units. Thus when it tries to enable the service during upgrades, it can no longer do so since there is a dangling symlink at the location, and systemd returns an error. Hmm, which seems to imply that nodes that went through an extra upgrade before doing 4.6->4.7 would maybe also hit this on RHCOS nodes. I will try to confirm this and come up with a fix.
So removing the dangling symlink should "fix" the problem during an upgrade. Funnily enough there is a similar ovsdb-server.service dangling there but... running enable on that one does NOT give an error. Weird...
Oh I think its because the other service is `PartOf=openvswitch.service`, so enabling it doesn't actually attempt to write anything. I can confirm that if you delete the dangling file with `rm /etc/systemd/system/multi-user.target.wants/openvswitch.service` future enables in 4.7 should work fine
So the systemd version in RHCOS has a fix for this issue, where it first removes the dangling symlink, which is why only RHEL hits it, since it uses a very old systemd in rhel7 still. I will emulate this in the systemctl enable call in the MCD as a fix. Disables are fine since it will remove any dangling symlinks. Preset errors are not evaluated.
encountered while upgrade from - 4.6.0-0.nightly-2021-01-18-070340 -> 4.7.0-0.nightly-2021-01-19-051335 , just added to have information and avoid any duplicate bug.
Adding testblocker keyword since it's blocking upgrade with rhel worker from 4.6 to 4.7
two rhel 7.9 workers in cluster and upgrade from 4.6.16 to 4.7.0-0.nightly-2021-02-06-084550, no upgrade error # oc get clusterversion -oyaml ... history: - completionTime: "2021-02-09T07:35:31Z" image: registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-02-06-084550 startedTime: "2021-02-09T06:31:29Z" state: Completed verified: false version: 4.7.0-0.nightly-2021-02-06-084550 - completionTime: "2021-02-09T04:20:53Z" image: quay.io/openshift-release-dev/ocp-release@sha256:3e855ad88f46ad1b7f56c312f078ca6adaba623c5d4b360143f9f82d2f349741 startedTime: "2021-02-09T03:49:43Z" state: Completed verified: false version: 4.6.16 observedGeneration: 4 ... # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.0-0.nightly-2021-02-06-084550 True False False 161m baremetal 4.7.0-0.nightly-2021-02-06-084550 True False False 3h17m cloud-credential 4.7.0-0.nightly-2021-02-06-084550 True False False 6h23m cluster-autoscaler 4.7.0-0.nightly-2021-02-06-084550 True False False 6h17m config-operator 4.7.0-0.nightly-2021-02-06-084550 True False False 6h18m console 4.7.0-0.nightly-2021-02-06-084550 True False False 165m csi-snapshot-controller 4.7.0-0.nightly-2021-02-06-084550 True False False 160m dns 4.7.0-0.nightly-2021-02-06-084550 True False False 6h17m etcd 4.7.0-0.nightly-2021-02-06-084550 True False False 6h16m image-registry 4.7.0-0.nightly-2021-02-06-084550 True False False 174m ingress 4.7.0-0.nightly-2021-02-06-084550 True False False 6h7m insights 4.7.0-0.nightly-2021-02-06-084550 True False False 6h18m kube-apiserver 4.7.0-0.nightly-2021-02-06-084550 True False False 6h15m kube-controller-manager 4.7.0-0.nightly-2021-02-06-084550 True False False 6h15m kube-scheduler 4.7.0-0.nightly-2021-02-06-084550 True False False 6h15m kube-storage-version-migrator 4.7.0-0.nightly-2021-02-06-084550 True False False 164m machine-api 4.7.0-0.nightly-2021-02-06-084550 True False False 6h7m machine-approver 4.7.0-0.nightly-2021-02-06-084550 True False False 6h17m machine-config 4.7.0-0.nightly-2021-02-06-084550 True False False 160m marketplace 4.7.0-0.nightly-2021-02-06-084550 True False False 165m monitoring 4.7.0-0.nightly-2021-02-06-084550 True False False 166m network 4.7.0-0.nightly-2021-02-06-084550 True False False 3h4m node-tuning 4.7.0-0.nightly-2021-02-06-084550 True False False 3h12m openshift-apiserver 4.7.0-0.nightly-2021-02-06-084550 True False False 160m openshift-controller-manager 4.7.0-0.nightly-2021-02-06-084550 True False False 6h10m openshift-samples 4.7.0-0.nightly-2021-02-06-084550 True False False 3h12m operator-lifecycle-manager 4.7.0-0.nightly-2021-02-06-084550 True False False 6h17m operator-lifecycle-manager-catalog 4.7.0-0.nightly-2021-02-06-084550 True False False 6h17m operator-lifecycle-manager-packageserver 4.7.0-0.nightly-2021-02-06-084550 True False False 163m service-ca 4.7.0-0.nightly-2021-02-06-084550 True False False 6h18m storage 4.7.0-0.nightly-2021-02-06-084550 True False False 165m # oc get no -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-49-230.us-east-2.compute.internal Ready master 6h22m v1.20.0+ba45583 10.0.49.230 <none> Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49 ip-10-0-54-81.us-east-2.compute.internal Ready worker 4h27m v1.19.0+e405995 10.0.54.81 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.15.2.el7.x86_64 cri-o://1.19.1-7.rhaos4.6.git6377f68.el7 ip-10-0-58-160.us-east-2.compute.internal Ready worker 4h27m v1.19.0+e405995 10.0.58.160 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.15.2.el7.x86_64 cri-o://1.19.1-7.rhaos4.6.git6377f68.el7 ip-10-0-60-71.us-east-2.compute.internal Ready master 6h22m v1.20.0+ba45583 10.0.60.71 <none> Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49 ip-10-0-64-226.us-east-2.compute.internal Ready master 6h21m v1.20.0+ba45583 10.0.64.226 <none> Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633