Bug 1913536
| Summary: | 4.6.9 -> 4.7 upgrade hangs. RHEL 7.9 worker stuck on "error enabling unit: Failed to execute operation: File exists\\n\" | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> |
| Component: | Machine Config Operator | Assignee: | Yu Qi Zhang <jerzhang> |
| Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.7 | CC: | gpei, jerzhang, jhou, juzhao, miyadav, schoudha, smilner, wking, wsun, yanyang |
| Target Milestone: | --- | Keywords: | TestBlocker |
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-24 15:50:41 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Mike Fiedler
2021-01-07 02:43:44 UTC
From oc describe co machine-config:
Status:
Conditions:
Last Transition Time: 2021-01-06T23:47:03Z
Message: Working towards 4.7.0-0.nightly-2021-01-06-160219
Status: True
Type: Progressing
Last Transition Time: 2021-01-07T00:08:53Z
Message: Unable to apply 4.7.0-0.nightly-2021-01-06-160219: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready, retrying. Status: (pool degraded: true total: 2, ready 0, updated: 0, unavailable: 1)
Reason: RequiredPoolsFailed
Status: True
Type: Degraded
Last Transition Time: 2021-01-06T23:45:03Z
Message: Cluster not available for 4.7.0-0.nightly-2021-01-06-160219
Status: False
Type: Available
Last Transition Time: 2021-01-06T23:56:11Z
Reason: One or more machine config pool is degraded, please see `oc get mcp` for further details and resolve before upgrading
Status: False
Type: Upgradeable
Extension:
Master: all 3 nodes are at latest configuration rendered-master-4c5dfacd24376a49a14b9ec419a1446f
Worker: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ip-10-0-60-30.us-east-2.compute.internal is reporting: \"error enabling unit: Failed to execute operation: File exists\\n\""
Related Objects:
Group:
Name: openshift-machine-config-operator
Resource: namespaces
Group: machineconfiguration.openshift.io
Name:
Resource: machineconfigpools
Group: machineconfiguration.openshift.io
Name:
Resource: controllerconfigs
Group: machineconfiguration.openshift.io
Name:
Resource: kubeletconfigs
Group: machineconfiguration.openshift.io
Name:
Resource: containerruntimeconfigs
Group: machineconfiguration.openshift.io
Name:
Resource: machineconfigs
Group:
Name:
Resource: nodes
Versions:
Name: operator
Version: 4.6.9
Events: <none>
From ocp get mcp worker -o yaml
status:
conditions:
- lastTransitionTime: "2021-01-06T22:07:32Z"
message: ""
reason: ""
status: "False"
type: RenderDegraded
- lastTransitionTime: "2021-01-06T23:50:13Z"
message: ""
reason: ""
status: "False"
type: Updated
- lastTransitionTime: "2021-01-06T23:50:13Z"
message: All nodes are updating to rendered-worker-6e086c9b284fc01f3f487c90c6274dd9
reason: ""
status: "True"
type: Updating
- lastTransitionTime: "2021-01-06T23:51:38Z"
message: 'Node ip-10-0-60-30.us-east-2.compute.internal is reporting: "error enabling unit: Failed to execute operation: File exists\n"'
reason: 1 nodes are reporting degraded status on sync
status: "True"
type: NodeDegraded
- lastTransitionTime: "2021-01-06T23:51:38Z"
message: ""
reason: ""
status: "True"
type: Degraded
configuration:
name: rendered-worker-7b6e0c9731fa2a8f9c3db0057ab02f9e
source:
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 00-worker
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-container-runtime
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-kubelet
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-fips
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-generated-registries
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-ssh
degradedMachineCount: 1
machineCount: 2
observedGeneration: 4
readyMachineCount: 0
unavailableMachineCount: 1
updatedMachineCount: 0
QE could reproduce this issue in an upgrade from 4.6.9 to 4.7.0-0.nightly-2021-01-06-222035, with RHEL worker added in the cluster. Updating one of the RHEL workers was stuck there so MCO degraded. The following error logs found in the machine-config-daemon pod on the RHEL worker: ... I0107 07:19:34.948170 18912 update.go:1462] Writing systemd unit dropin "10-mco-default-env.conf" I0107 07:19:34.949429 18912 update.go:1497] Writing systemd unit "kubelet.service" I0107 07:19:34.950666 18912 update.go:1497] Writing systemd unit "machine-config-daemon-firstboot.service" I0107 07:19:34.952152 18912 update.go:1497] Writing systemd unit "machine-config-daemon-pull.service" I0107 07:19:34.953412 18912 update.go:1497] Writing systemd unit "node-valid-hostname.service" I0107 07:19:34.954597 18912 update.go:1497] Writing systemd unit "nodeip-configuration.service" I0107 07:19:34.956064 18912 update.go:1497] Writing systemd unit "ovs-configuration.service" I0107 07:19:34.957480 18912 update.go:1462] Writing systemd unit dropin "10-ovs-vswitchd-restart.conf" I0107 07:19:35.090575 18912 update.go:1451] Preset systemd unit ovs-vswitchd.service I0107 07:19:35.090624 18912 update.go:1462] Writing systemd unit dropin "10-ovsdb-restart.conf" I0107 07:19:35.092592 18912 update.go:1462] Writing systemd unit dropin "10-mco-default-env.conf" I0107 07:19:35.099961 18912 update.go:1534] Could not reset unit preset for pivot.service, skipping. (Error msg: error running preset on unit: Failed to execute operation: No such file or directory ) I0107 07:19:35.099989 18912 update.go:1462] Writing systemd unit dropin "mco-disabled.conf" I0107 07:19:35.104432 18912 update.go:1534] Could not reset unit preset for zincati.service, skipping. (Error msg: error running preset on unit: Failed to execute operation: No such file or directory ) E0107 07:19:35.108108 18912 writer.go:135] Marking Degraded due to: error enabling unit: Failed to execute operation: File exists Ok I think I know what's happening: the update is failing because it cannot write openvswitch.service. This is because there is a dangling symlink at /etc/systemd/system/multi-user.target.wants/openvswitch.service. The reason there is a dangling symlink is because the actual file is at: /usr/lib/systemd/system/openvswitch.service and not: /etc/systemd/system/openvswitch.service And this has been the case since the service was introduced. Now you might have two questions: 1. why doesn't this affect RHCOS 2. why is it written wrong 3. why do we only see this error now I think they actually basically have the same answer: the RHCOS node firstboot writes use ignition, which does it correctly. Until I fixed it in https://github.com/openshift/machine-config-operator/pull/2145/files, the MCD would try to write it to the wrong location because its hardcoded to do so, and the RHEL nodes use this path during its onceFrom run to provision the file/units. Thus when it tries to enable the service during upgrades, it can no longer do so since there is a dangling symlink at the location, and systemd returns an error. Hmm, which seems to imply that nodes that went through an extra upgrade before doing 4.6->4.7 would maybe also hit this on RHCOS nodes. I will try to confirm this and come up with a fix. So removing the dangling symlink should "fix" the problem during an upgrade. Funnily enough there is a similar ovsdb-server.service dangling there but... running enable on that one does NOT give an error. Weird... Oh I think its because the other service is `PartOf=openvswitch.service`, so enabling it doesn't actually attempt to write anything. I can confirm that if you delete the dangling file with `rm /etc/systemd/system/multi-user.target.wants/openvswitch.service` future enables in 4.7 should work fine So the systemd version in RHCOS has a fix for this issue, where it first removes the dangling symlink, which is why only RHEL hits it, since it uses a very old systemd in rhel7 still. I will emulate this in the systemctl enable call in the MCD as a fix. Disables are fine since it will remove any dangling symlinks. Preset errors are not evaluated. encountered while upgrade from - 4.6.0-0.nightly-2021-01-18-070340 -> 4.7.0-0.nightly-2021-01-19-051335 , just added to have information and avoid any duplicate bug. Adding testblocker keyword since it's blocking upgrade with rhel worker from 4.6 to 4.7 two rhel 7.9 workers in cluster and upgrade from 4.6.16 to 4.7.0-0.nightly-2021-02-06-084550, no upgrade error
# oc get clusterversion -oyaml
...
history:
- completionTime: "2021-02-09T07:35:31Z"
image: registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-02-06-084550
startedTime: "2021-02-09T06:31:29Z"
state: Completed
verified: false
version: 4.7.0-0.nightly-2021-02-06-084550
- completionTime: "2021-02-09T04:20:53Z"
image: quay.io/openshift-release-dev/ocp-release@sha256:3e855ad88f46ad1b7f56c312f078ca6adaba623c5d4b360143f9f82d2f349741
startedTime: "2021-02-09T03:49:43Z"
state: Completed
verified: false
version: 4.6.16
observedGeneration: 4
...
# oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.7.0-0.nightly-2021-02-06-084550 True False False 161m
baremetal 4.7.0-0.nightly-2021-02-06-084550 True False False 3h17m
cloud-credential 4.7.0-0.nightly-2021-02-06-084550 True False False 6h23m
cluster-autoscaler 4.7.0-0.nightly-2021-02-06-084550 True False False 6h17m
config-operator 4.7.0-0.nightly-2021-02-06-084550 True False False 6h18m
console 4.7.0-0.nightly-2021-02-06-084550 True False False 165m
csi-snapshot-controller 4.7.0-0.nightly-2021-02-06-084550 True False False 160m
dns 4.7.0-0.nightly-2021-02-06-084550 True False False 6h17m
etcd 4.7.0-0.nightly-2021-02-06-084550 True False False 6h16m
image-registry 4.7.0-0.nightly-2021-02-06-084550 True False False 174m
ingress 4.7.0-0.nightly-2021-02-06-084550 True False False 6h7m
insights 4.7.0-0.nightly-2021-02-06-084550 True False False 6h18m
kube-apiserver 4.7.0-0.nightly-2021-02-06-084550 True False False 6h15m
kube-controller-manager 4.7.0-0.nightly-2021-02-06-084550 True False False 6h15m
kube-scheduler 4.7.0-0.nightly-2021-02-06-084550 True False False 6h15m
kube-storage-version-migrator 4.7.0-0.nightly-2021-02-06-084550 True False False 164m
machine-api 4.7.0-0.nightly-2021-02-06-084550 True False False 6h7m
machine-approver 4.7.0-0.nightly-2021-02-06-084550 True False False 6h17m
machine-config 4.7.0-0.nightly-2021-02-06-084550 True False False 160m
marketplace 4.7.0-0.nightly-2021-02-06-084550 True False False 165m
monitoring 4.7.0-0.nightly-2021-02-06-084550 True False False 166m
network 4.7.0-0.nightly-2021-02-06-084550 True False False 3h4m
node-tuning 4.7.0-0.nightly-2021-02-06-084550 True False False 3h12m
openshift-apiserver 4.7.0-0.nightly-2021-02-06-084550 True False False 160m
openshift-controller-manager 4.7.0-0.nightly-2021-02-06-084550 True False False 6h10m
openshift-samples 4.7.0-0.nightly-2021-02-06-084550 True False False 3h12m
operator-lifecycle-manager 4.7.0-0.nightly-2021-02-06-084550 True False False 6h17m
operator-lifecycle-manager-catalog 4.7.0-0.nightly-2021-02-06-084550 True False False 6h17m
operator-lifecycle-manager-packageserver 4.7.0-0.nightly-2021-02-06-084550 True False False 163m
service-ca 4.7.0-0.nightly-2021-02-06-084550 True False False 6h18m
storage 4.7.0-0.nightly-2021-02-06-084550 True False False 165m
# oc get no -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-0-49-230.us-east-2.compute.internal Ready master 6h22m v1.20.0+ba45583 10.0.49.230 <none> Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49
ip-10-0-54-81.us-east-2.compute.internal Ready worker 4h27m v1.19.0+e405995 10.0.54.81 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.15.2.el7.x86_64 cri-o://1.19.1-7.rhaos4.6.git6377f68.el7
ip-10-0-58-160.us-east-2.compute.internal Ready worker 4h27m v1.19.0+e405995 10.0.58.160 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.15.2.el7.x86_64 cri-o://1.19.1-7.rhaos4.6.git6377f68.el7
ip-10-0-60-71.us-east-2.compute.internal Ready master 6h22m v1.20.0+ba45583 10.0.60.71 <none> Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49
ip-10-0-64-226.us-east-2.compute.internal Ready master 6h21m v1.20.0+ba45583 10.0.64.226 <none> Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |