Bug 1913536

Summary:	4.6.9 -> 4.7 upgrade hangs. RHEL 7.9 worker stuck on "error enabling unit: Failed to execute operation: File exists\\n\"
Product:	OpenShift Container Platform	Reporter:	Mike Fiedler <mifiedle>
Component:	Machine Config Operator	Assignee:	Yu Qi Zhang <jerzhang>
Status:	CLOSED ERRATA	QA Contact:	Michael Nguyen <mnguyen>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	gpei, jerzhang, jhou, juzhao, miyadav, schoudha, smilner, wking, wsun, yanyang
Target Milestone:	---	Keywords:	TestBlocker
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 15:50:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mike Fiedler 2021-01-07 02:43:44 UTC

Description of problem:

Attempted upgrade of AWS UPI cluster with RHEL 7.9 worker nodes from 4.6.9 to 4.7.0-0.nightly-2021-01-06-160219 "hangs" on RHEL worker node with this error in the MCO:

"Node ip-10-0-60-30.us-east-2.compute.internal is reporting: \"error enabling unit: Failed to execute operation: File exists\\n\""

[mifiedle@mffiedler must-gather]$ oc get mcp worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-7b6e0c9731fa2a8f9c3db0057ab02f9e   False     True       True       2              0                   0                     1                      4h29m
[mifiedle@mffiedler must-gather]$ oc get nodes -o wide
NAME                                        STATUS                     ROLES    AGE     VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-0-48-203.us-east-2.compute.internal   Ready                      master   4h33m   v1.20.0+b1e9f0d   10.0.48.203   <none>        Red Hat Enterprise Linux CoreOS 47.83.202101060443-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
ip-10-0-49-168.us-east-2.compute.internal   Ready                      worker   3h33m   v1.19.0+9c69bdc   10.0.49.168   <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.11.1.el7.x86_64    cri-o://1.19.0-118.rhaos4.6.gitf51f94a.el7
ip-10-0-60-30.us-east-2.compute.internal    Ready,SchedulingDisabled   worker   3h33m   v1.19.0+9c69bdc   10.0.60.30    <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.11.1.el7.x86_64    cri-o://1.19.0-118.rhaos4.6.gitf51f94a.el7
ip-10-0-63-227.us-east-2.compute.internal   Ready                      master   4h33m   v1.20.0+b1e9f0d   10.0.63.227   <none>        Red Hat Enterprise Linux CoreOS 47.83.202101060443-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
ip-10-0-67-30.us-east-2.compute.internal    Ready                      master   4h33m   v1.20.0+b1e9f0d   10.0.67.30    <none>        Red Hat Enterprise Linux CoreOS 47.83.202101060443-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
[mifiedle@mffiedler must-gather]$ oc get clusterverstion
error: the server doesn't have a resource type "clusterverstion"
[mifiedle@mffiedler must-gather]$ ^C
[mifiedle@mffiedler must-gather]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.9     True        True          3h33m   Unable to apply 4.7.0-0.nightly-2021-01-06-160219: the cluster operator ingress is degraded



Version-Release number of selected component (if applicable): 4.6.9 -> 4.7.0-0.nightly-2021-01-06-160219


How reproducible: Unknown


Steps to Reproduce:
1. Install 4.6.9 UPI on AWS cluster with RHEL 7.9 workers
2. Attempt upgrade to 4.7.latest nightly build


Actual results:

Upgrade "hangs" for hours as described above


Additional info:

oc adm must-gather will be linked in a private comment.

Comment 2 Mike Fiedler 2021-01-07 03:03:39 UTC

From oc describe co machine-config:

Status:
  Conditions:
    Last Transition Time:  2021-01-06T23:47:03Z
    Message:               Working towards 4.7.0-0.nightly-2021-01-06-160219
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2021-01-07T00:08:53Z
    Message:               Unable to apply 4.7.0-0.nightly-2021-01-06-160219: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready, retrying. Status: (pool degraded: true total: 2, ready 0, updated: 0, unavailable: 1)
    Reason:                RequiredPoolsFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2021-01-06T23:45:03Z
    Message:               Cluster not available for 4.7.0-0.nightly-2021-01-06-160219
    Status:                False
    Type:                  Available
    Last Transition Time:  2021-01-06T23:56:11Z
    Reason:                One or more machine config pool is degraded, please see `oc get mcp` for further details and resolve before upgrading
    Status:                False
    Type:                  Upgradeable
  Extension:
    Master:  all 3 nodes are at latest configuration rendered-master-4c5dfacd24376a49a14b9ec419a1446f
    Worker:  pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ip-10-0-60-30.us-east-2.compute.internal is reporting: \"error enabling unit: Failed to execute operation: File exists\\n\""
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  controllerconfigs
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  kubeletconfigs
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  containerruntimeconfigs
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  machineconfigs
    Group:     
    Name:      
    Resource:  nodes
  Versions:
    Name:     operator
    Version:  4.6.9
Events:       <none>

From ocp get mcp worker -o yaml

status:
  conditions:
  - lastTransitionTime: "2021-01-06T22:07:32Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2021-01-06T23:50:13Z"
    message: ""
    reason: ""
    status: "False"
    type: Updated
  - lastTransitionTime: "2021-01-06T23:50:13Z"
    message: All nodes are updating to rendered-worker-6e086c9b284fc01f3f487c90c6274dd9
    reason: ""
    status: "True"
    type: Updating
  - lastTransitionTime: "2021-01-06T23:51:38Z"
    message: 'Node ip-10-0-60-30.us-east-2.compute.internal is reporting: "error enabling unit: Failed to execute operation: File exists\n"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
  - lastTransitionTime: "2021-01-06T23:51:38Z"
    message: ""
    reason: ""
    status: "True"
    type: Degraded
  configuration:
    name: rendered-worker-7b6e0c9731fa2a8f9c3db0057ab02f9e
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-fips
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
  degradedMachineCount: 1
  machineCount: 2
  observedGeneration: 4
  readyMachineCount: 0
  unavailableMachineCount: 1
  updatedMachineCount: 0

Comment 3 Gaoyun Pei 2021-01-07 07:44:40 UTC

QE could reproduce this issue in an upgrade from 4.6.9 to 4.7.0-0.nightly-2021-01-06-222035, with RHEL worker added in the cluster.

Updating one of the RHEL workers was stuck there so MCO degraded.

The following error logs found in the machine-config-daemon pod on the RHEL worker:
...
I0107 07:19:34.948170   18912 update.go:1462] Writing systemd unit dropin "10-mco-default-env.conf"
I0107 07:19:34.949429   18912 update.go:1497] Writing systemd unit "kubelet.service"
I0107 07:19:34.950666   18912 update.go:1497] Writing systemd unit "machine-config-daemon-firstboot.service"
I0107 07:19:34.952152   18912 update.go:1497] Writing systemd unit "machine-config-daemon-pull.service"
I0107 07:19:34.953412   18912 update.go:1497] Writing systemd unit "node-valid-hostname.service"
I0107 07:19:34.954597   18912 update.go:1497] Writing systemd unit "nodeip-configuration.service"
I0107 07:19:34.956064   18912 update.go:1497] Writing systemd unit "ovs-configuration.service"
I0107 07:19:34.957480   18912 update.go:1462] Writing systemd unit dropin "10-ovs-vswitchd-restart.conf"
I0107 07:19:35.090575   18912 update.go:1451] Preset systemd unit ovs-vswitchd.service
I0107 07:19:35.090624   18912 update.go:1462] Writing systemd unit dropin "10-ovsdb-restart.conf"
I0107 07:19:35.092592   18912 update.go:1462] Writing systemd unit dropin "10-mco-default-env.conf"
I0107 07:19:35.099961   18912 update.go:1534] Could not reset unit preset for pivot.service, skipping. (Error msg: error running preset on unit: Failed to execute operation: No such file or directory
)
I0107 07:19:35.099989   18912 update.go:1462] Writing systemd unit dropin "mco-disabled.conf"
I0107 07:19:35.104432   18912 update.go:1534] Could not reset unit preset for zincati.service, skipping. (Error msg: error running preset on unit: Failed to execute operation: No such file or directory
)
E0107 07:19:35.108108   18912 writer.go:135] Marking Degraded due to: error enabling unit: Failed to execute operation: File exists

Comment 4 Yu Qi Zhang 2021-01-11 23:14:29 UTC

Ok I think I know what's happening: the update is failing because it cannot write openvswitch.service. This is because there is a dangling symlink at /etc/systemd/system/multi-user.target.wants/openvswitch.service. The reason there is a dangling symlink is because the actual file is at:

/usr/lib/systemd/system/openvswitch.service

and not:

/etc/systemd/system/openvswitch.service

And this has been the case since the service was introduced.

Now you might have two questions:

1. why doesn't this affect RHCOS
2. why is it written wrong
3. why do we only see this error now

I think they actually basically have the same answer: the RHCOS node firstboot writes use ignition, which does it correctly. Until I fixed it in https://github.com/openshift/machine-config-operator/pull/2145/files, the MCD would try to write it to the wrong location because its hardcoded to do so, and the RHEL nodes use this path during its onceFrom run to provision the file/units. Thus when it tries to enable the service during upgrades, it can no longer do so since there is a dangling symlink at the location, and systemd returns an error.

Hmm, which seems to imply that nodes that went through an extra upgrade before doing 4.6->4.7 would maybe also hit this on RHCOS nodes. I will try to confirm this and come up with a fix.

Comment 5 Yu Qi Zhang 2021-01-11 23:18:12 UTC

So removing the dangling symlink should "fix" the problem during an upgrade. Funnily enough there is a similar ovsdb-server.service dangling there but... running enable on that one does NOT give an error. Weird...

Comment 6 Yu Qi Zhang 2021-01-11 23:45:49 UTC

Oh I think its because the other service is `PartOf=openvswitch.service`, so enabling it doesn't actually attempt to write anything.

I can confirm that if you delete the dangling file with `rm /etc/systemd/system/multi-user.target.wants/openvswitch.service` future enables in 4.7 should work fine

Comment 7 Yu Qi Zhang 2021-01-12 17:44:53 UTC

So the systemd version in RHCOS has a fix for this issue, where it first removes the dangling symlink, which is why only RHEL hits it, since it uses a very old systemd in rhel7 still.

I will emulate this in the systemctl enable call in the MCD as a fix. Disables are fine since it will remove any dangling symlinks. Preset errors are not evaluated.

Comment 8 Milind Yadav 2021-01-19 14:22:42 UTC

 encountered while upgrade from - 4.6.0-0.nightly-2021-01-18-070340 -> 4.7.0-0.nightly-2021-01-19-051335 , just added to have information and avoid any duplicate bug.

Comment 9 Wei Sun 2021-01-20 05:58:02 UTC

Adding testblocker keyword since it's blocking upgrade with rhel worker from 4.6 to 4.7

Comment 11 Junqi Zhao 2021-02-09 10:17:01 UTC

two rhel 7.9 workers in cluster and upgrade from 4.6.16 to 4.7.0-0.nightly-2021-02-06-084550, no upgrade error
# oc get clusterversion -oyaml
...
    history:
    - completionTime: "2021-02-09T07:35:31Z"
      image: registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-02-06-084550
      startedTime: "2021-02-09T06:31:29Z"
      state: Completed
      verified: false
      version: 4.7.0-0.nightly-2021-02-06-084550
    - completionTime: "2021-02-09T04:20:53Z"
      image: quay.io/openshift-release-dev/ocp-release@sha256:3e855ad88f46ad1b7f56c312f078ca6adaba623c5d4b360143f9f82d2f349741
      startedTime: "2021-02-09T03:49:43Z"
      state: Completed
      verified: false
      version: 4.6.16
    observedGeneration: 4
...

# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-2021-02-06-084550   True        False         False      161m
baremetal                                  4.7.0-0.nightly-2021-02-06-084550   True        False         False      3h17m
cloud-credential                           4.7.0-0.nightly-2021-02-06-084550   True        False         False      6h23m
cluster-autoscaler                         4.7.0-0.nightly-2021-02-06-084550   True        False         False      6h17m
config-operator                            4.7.0-0.nightly-2021-02-06-084550   True        False         False      6h18m
console                                    4.7.0-0.nightly-2021-02-06-084550   True        False         False      165m
csi-snapshot-controller                    4.7.0-0.nightly-2021-02-06-084550   True        False         False      160m
dns                                        4.7.0-0.nightly-2021-02-06-084550   True        False         False      6h17m
etcd                                       4.7.0-0.nightly-2021-02-06-084550   True        False         False      6h16m
image-registry                             4.7.0-0.nightly-2021-02-06-084550   True        False         False      174m
ingress                                    4.7.0-0.nightly-2021-02-06-084550   True        False         False      6h7m
insights                                   4.7.0-0.nightly-2021-02-06-084550   True        False         False      6h18m
kube-apiserver                             4.7.0-0.nightly-2021-02-06-084550   True        False         False      6h15m
kube-controller-manager                    4.7.0-0.nightly-2021-02-06-084550   True        False         False      6h15m
kube-scheduler                             4.7.0-0.nightly-2021-02-06-084550   True        False         False      6h15m
kube-storage-version-migrator              4.7.0-0.nightly-2021-02-06-084550   True        False         False      164m
machine-api                                4.7.0-0.nightly-2021-02-06-084550   True        False         False      6h7m
machine-approver                           4.7.0-0.nightly-2021-02-06-084550   True        False         False      6h17m
machine-config                             4.7.0-0.nightly-2021-02-06-084550   True        False         False      160m
marketplace                                4.7.0-0.nightly-2021-02-06-084550   True        False         False      165m
monitoring                                 4.7.0-0.nightly-2021-02-06-084550   True        False         False      166m
network                                    4.7.0-0.nightly-2021-02-06-084550   True        False         False      3h4m
node-tuning                                4.7.0-0.nightly-2021-02-06-084550   True        False         False      3h12m
openshift-apiserver                        4.7.0-0.nightly-2021-02-06-084550   True        False         False      160m
openshift-controller-manager               4.7.0-0.nightly-2021-02-06-084550   True        False         False      6h10m
openshift-samples                          4.7.0-0.nightly-2021-02-06-084550   True        False         False      3h12m
operator-lifecycle-manager                 4.7.0-0.nightly-2021-02-06-084550   True        False         False      6h17m
operator-lifecycle-manager-catalog         4.7.0-0.nightly-2021-02-06-084550   True        False         False      6h17m
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-2021-02-06-084550   True        False         False      163m
service-ca                                 4.7.0-0.nightly-2021-02-06-084550   True        False         False      6h18m
storage                                    4.7.0-0.nightly-2021-02-06-084550   True        False         False      165m

# oc get no -o wide
NAME                                        STATUS   ROLES    AGE     VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-0-49-230.us-east-2.compute.internal   Ready    master   6h22m   v1.20.0+ba45583   10.0.49.230   <none>        Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49
ip-10-0-54-81.us-east-2.compute.internal    Ready    worker   4h27m   v1.19.0+e405995   10.0.54.81    <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.15.2.el7.x86_64    cri-o://1.19.1-7.rhaos4.6.git6377f68.el7
ip-10-0-58-160.us-east-2.compute.internal   Ready    worker   4h27m   v1.19.0+e405995   10.0.58.160   <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.15.2.el7.x86_64    cri-o://1.19.1-7.rhaos4.6.git6377f68.el7
ip-10-0-60-71.us-east-2.compute.internal    Ready    master   6h22m   v1.20.0+ba45583   10.0.60.71    <none>        Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49
ip-10-0-64-226.us-east-2.compute.internal   Ready    master   6h21m   v1.20.0+ba45583   10.0.64.226   <none>        Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49

Comment 14 errata-xmlrpc 2021-02-24 15:50:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633