Description of problem: Upgrade from 4.7 to 4.8 stuck on machine-config Version-Release number of selected component (if applicable): 4.7.3 -> 4.8.0-0.nightly-2021-03-22-104536 How reproducible: So far 1 on 1 try. Steps to Reproduce: This issue happened on upgrade CI, OSP16 and scaled up rhel workers. After upgrade, cluster stuck on machine-config [2021-03-22T22:10:14.042Z] Post action: #oc get co:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE [2021-03-22T22:10:14.042Z] authentication 4.8.0-0.nightly-2021-03-22-104536 True False False 3h27m [2021-03-22T22:10:14.042Z] baremetal 4.8.0-0.nightly-2021-03-22-104536 True False False 5h2m [2021-03-22T22:10:14.042Z] cloud-credential 4.8.0-0.nightly-2021-03-22-104536 True False False 5h9m [2021-03-22T22:10:14.042Z] cluster-autoscaler 4.8.0-0.nightly-2021-03-22-104536 True False False 5h [2021-03-22T22:10:14.042Z] config-operator 4.8.0-0.nightly-2021-03-22-104536 True False False 5h2m [2021-03-22T22:10:14.042Z] console 4.8.0-0.nightly-2021-03-22-104536 True False False 140m [2021-03-22T22:10:14.042Z] csi-snapshot-controller 4.8.0-0.nightly-2021-03-22-104536 True False False 3h8m [2021-03-22T22:10:14.042Z] dns 4.8.0-0.nightly-2021-03-22-104536 True False False 5h [2021-03-22T22:10:14.042Z] etcd 4.8.0-0.nightly-2021-03-22-104536 True False False 5h1m [2021-03-22T22:10:14.042Z] image-registry 4.8.0-0.nightly-2021-03-22-104536 True False False 4h51m [2021-03-22T22:10:14.042Z] ingress 4.8.0-0.nightly-2021-03-22-104536 True False False 4h50m [2021-03-22T22:10:14.042Z] insights 4.8.0-0.nightly-2021-03-22-104536 True False False 4h55m [2021-03-22T22:10:14.042Z] kube-apiserver 4.8.0-0.nightly-2021-03-22-104536 True False False 4h59m [2021-03-22T22:10:14.042Z] kube-controller-manager 4.8.0-0.nightly-2021-03-22-104536 True False False 4h58m [2021-03-22T22:10:14.042Z] kube-scheduler 4.8.0-0.nightly-2021-03-22-104536 True False False 4h58m [2021-03-22T22:10:14.042Z] kube-storage-version-migrator 4.8.0-0.nightly-2021-03-22-104536 True False False 3h5m [2021-03-22T22:10:14.042Z] machine-api 4.8.0-0.nightly-2021-03-22-104536 True False False 4h57m [2021-03-22T22:10:14.042Z] machine-approver 4.8.0-0.nightly-2021-03-22-104536 True False False 5h1m [2021-03-22T22:10:14.042Z] machine-config 4.7.3 False True True 91m [2021-03-22T22:10:14.042Z] marketplace 4.8.0-0.nightly-2021-03-22-104536 True False False 141m [2021-03-22T22:10:14.042Z] monitoring 4.8.0-0.nightly-2021-03-22-104536 True False False 6m46s [2021-03-22T22:10:14.042Z] network 4.8.0-0.nightly-2021-03-22-104536 True False False 4h59m [2021-03-22T22:10:14.042Z] node-tuning 4.8.0-0.nightly-2021-03-22-104536 True False False 142m [2021-03-22T22:10:14.042Z] openshift-apiserver 4.8.0-0.nightly-2021-03-22-104536 True False False 175m [2021-03-22T22:10:14.042Z] openshift-controller-manager 4.8.0-0.nightly-2021-03-22-104536 True False False 141m [2021-03-22T22:10:14.042Z] openshift-samples 4.8.0-0.nightly-2021-03-22-104536 True False False 142m [2021-03-22T22:10:14.042Z] operator-lifecycle-manager 4.8.0-0.nightly-2021-03-22-104536 True False False 5h1m [2021-03-22T22:10:14.042Z] operator-lifecycle-manager-catalog 4.8.0-0.nightly-2021-03-22-104536 True False False 5h1m [2021-03-22T22:10:14.042Z] operator-lifecycle-manager-packageserver 4.8.0-0.nightly-2021-03-22-104536 True False False 140m [2021-03-22T22:10:14.043Z] service-ca 4.8.0-0.nightly-2021-03-22-104536 True False False 5h2m [2021-03-22T22:10:14.043Z] storage 4.8.0-0.nightly-2021-03-22-104536 True False False 3h6m #oc get node: NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME [2021-03-22T22:10:14.042Z] ugd-12708-7v9h4-master-0 Ready master 5h8m v1.20.0+551f7b2 192.168.1.56 <none> Red Hat Enterprise Linux CoreOS 47.83.202103140039-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.20.1-5.rhaos4.7.git62f21aa.el8 [2021-03-22T22:10:14.042Z] ugd-12708-7v9h4-master-1 Ready master 5h5m v1.20.0+551f7b2 192.168.0.23 <none> Red Hat Enterprise Linux CoreOS 47.83.202103140039-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.20.1-5.rhaos4.7.git62f21aa.el8 [2021-03-22T22:10:14.042Z] ugd-12708-7v9h4-master-2 Ready master 5h4m v1.20.0+551f7b2 192.168.0.157 <none> Red Hat Enterprise Linux CoreOS 47.83.202103140039-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.20.1-5.rhaos4.7.git62f21aa.el8 [2021-03-22T22:10:14.042Z] ugd-12708-7v9h4-rhel-0 Ready worker 4h6m v1.20.0+bafe72f 192.168.0.165 10.0.97.72 Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.21.1.el7.x86_64 cri-o://1.20.2-3.rhaos4.7.gitfecc319.el7 [2021-03-22T22:10:14.042Z] ugd-12708-7v9h4-rhel-1 Ready worker 4h6m v1.20.0+bafe72f 192.168.0.131 10.0.97.159 Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.21.1.el7.x86_64 cri-o://1.20.2-3.rhaos4.7.gitfecc319.el7 [2021-03-22T22:10:14.042Z] ugd-12708-7v9h4-worker-0 Ready worker 4h54m v1.20.0+551f7b2 192.168.1.66 <none> Red Hat Enterprise Linux CoreOS 47.83.202103140039-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.20.1-5.rhaos4.7.git62f21aa.el8 [2021-03-22T22:10:14.042Z] ugd-12708-7v9h4-worker-1 Ready worker 4h55m v1.20.0+551f7b2 192.168.3.147 <none> Red Hat Enterprise Linux CoreOS 47.83.202103140039-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.20.1-5.rhaos4.7.git62f21aa.el8 [2021-03-22T22:10:14.042Z] ugd-12708-7v9h4-worker-2 Ready worker 4h55m v1.20.0+551f7b2 192.168.1.89 <none> Red Hat Enterprise Linux CoreOS 47.83.202103140039-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.20.1-5.rhaos4.7.git62f21aa.el8 Checked must-gather logs, some errors are in machine-config-deamon pods. ----------------------------------------- cat machine-config-daemon/machine-config-daemon/logs/current.log 2021-03-22T18:12:30.507057316-04:00 I0322 22:12:30.506621 150865 start.go:108] Version: v4.8.0-202103220526.p0-dirty (d6dabadeca05789363a2bbd56fbd3b16a3b21777) 2021-03-22T18:12:30.509752749-04:00 I0322 22:12:30.509683 150865 start.go:121] Calling chroot("/rootfs") 2021-03-22T18:12:30.562586722-04:00 I0322 22:12:30.562493 150865 start.go:97] Copied self to /run/bin/machine-config-daemon on host 2021-03-22T18:12:30.565547109-04:00 I0322 22:12:30.564695 150865 metrics.go:105] Registering Prometheus metrics 2021-03-22T18:12:30.565547109-04:00 I0322 22:12:30.564776 150865 metrics.go:110] Starting metrics listener on 127.0.0.1:8797 2021-03-22T18:12:30.567619353-04:00 I0322 22:12:30.567574 150865 update.go:1851] Starting to manage node: ugd-12708-7v9h4-rhel-1 2021-03-22T18:12:30.569712726-04:00 I0322 22:12:30.569621 150865 rpm-ostree.go:258] Running captured: journalctl --list-boots 2021-03-22T18:12:30.571688870-04:00 I0322 22:12:30.571533 150865 daemon.go:669] Detected a new login session: New session 1 of user cloud-user. 2021-03-22T18:12:30.571688870-04:00 I0322 22:12:30.571551 150865 daemon.go:670] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh 2021-03-22T18:12:30.572547241-04:00 I0322 22:12:30.572377 150865 daemon.go:858] journalctl --list-boots: 2021-03-22T18:12:30.572547241-04:00 0 e2082f440ebd4603818fb63fe6a70f49 Mon 2021-03-22 14:03:39 EDT—Mon 2021-03-22 18:12:30 EDT 2021-03-22T18:12:30.572547241-04:00 I0322 22:12:30.572399 150865 rpm-ostree.go:258] Running captured: systemctl list-units --state=failed --no-legend 2021-03-22T18:12:30.577332744-04:00 I0322 22:12:30.577155 150865 daemon.go:871] systemctl --failed: 2021-03-22T18:12:30.577332744-04:00 afterburn-hostname.service loaded failed failed Afterburn Hostname 2021-03-22T18:12:30.577332744-04:00 ovirt-guest-agent.service loaded failed failed oVirt Guest Agent 2021-03-22T18:12:30.577332744-04:00 I0322 22:12:30.577178 150865 daemon.go:607] Starting MachineConfigDaemon 2021-03-22T18:12:30.577332744-04:00 I0322 22:12:30.577186 150865 daemon.go:577] Guarding against sigterm signal 2021-03-22T18:12:30.577332744-04:00 I0322 22:12:30.577205 150865 daemon.go:614] Enabling Kubelet Healthz Monitor 2021-03-22T18:12:30.677649628-04:00 W0322 22:12:30.677546 150865 daemon.go:635] Got an error from auxiliary tools: error: cannot apply annotation for SSH access due to: unable to update node "nil": node "ugd-12708-7v9h4-rhel-1" not found 2021-03-22T18:12:30.677649628-04:00 I0322 22:12:30.677614 150865 daemon.go:636] Shutting down MachineConfigDaemon 2021-03-22T18:12:30.677760643-04:00 F0322 22:12:30.677698 150865 helpers.go:147] error: cannot apply annotation for SSH access due to: unable to update node "nil": node "ugd-12708-7v9h4-rhel-1" not found ~ ---------------------------- snippet of machine-config-daemon-vhv2g.yaml - containerID: cri-o://d67ce5f38b2f540e7bdaf522697929beca1d84c07e13b8475fae175d5d09989a image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c183d99756e887d90846b0dd379157173646feb5fe2fe7379f26932a7a591ae4 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c183d99756e887d90846b0dd379157173646feb5fe2fe7379f26932a7a591ae4 lastState: terminated: containerID: cri-o://d67ce5f38b2f540e7bdaf522697929beca1d84c07e13b8475fae175d5d09989a exitCode: 255 finishedAt: "2021-03-22T22:12:30Z" message: | I0322 22:12:30.506621 150865 start.go:108] Version: v4.8.0-202103220526.p0-dirty (d6dabadeca05789363a2bbd56fbd3b16a3b21777) I0322 22:12:30.509683 150865 start.go:121] Calling chroot("/rootfs") I0322 22:12:30.562493 150865 start.go:97] Copied self to /run/bin/machine-config-daemon on host I0322 22:12:30.564695 150865 metrics.go:105] Registering Prometheus metrics I0322 22:12:30.564776 150865 metrics.go:110] Starting metrics listener on 127.0.0.1:8797 I0322 22:12:30.567574 150865 update.go:1851] Starting to manage node: ugd-12708-7v9h4-rhel-1 I0322 22:12:30.569621 150865 rpm-ostree.go:258] Running captured: journalctl --list-boots I0322 22:12:30.571533 150865 daemon.go:669] Detected a new login session: New session 1 of user cloud-user. I0322 22:12:30.571551 150865 daemon.go:670] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh I0322 22:12:30.572377 150865 daemon.go:858] journalctl --list-boots: 0 e2082f440ebd4603818fb63fe6a70f49 Mon 2021-03-22 14:03:39 EDT—Mon 2021-03-22 18:12:30 EDT I0322 22:12:30.572399 150865 rpm-ostree.go:258] Running captured: systemctl list-units --state=failed --no-legend I0322 22:12:30.577155 150865 daemon.go:871] systemctl --failed: afterburn-hostname.service loaded failed failed Afterburn Hostname ovirt-guest-agent.service loaded failed failed oVirt Guest Agent I0322 22:12:30.577178 150865 daemon.go:607] Starting MachineConfigDaemon I0322 22:12:30.577186 150865 daemon.go:577] Guarding against sigterm signal I0322 22:12:30.577205 150865 daemon.go:614] Enabling Kubelet Healthz Monitor W0322 22:12:30.677546 150865 daemon.go:635] Got an error from auxiliary tools: error: cannot apply annotation for SSH access due to: unable to update node "nil": node "ugd-12708-7v9h4-rhel-1" not found I0322 22:12:30.677614 150865 daemon.go:636] Shutting down MachineConfigDaemon F0322 22:12:30.677698 150865 helpers.go:147] error: cannot apply annotation for SSH access due to: unable to update node "nil": node "ugd-12708-7v9h4-rhel-1" not found reason: Error startedAt: "2021-03-22T22:12:30Z" name: machine-config-daemon ready: false restartCount: 7 started: false state: waiting: message: back-off 5m0s restarting failed container=machine-config-daemon pod=machine-config-daemon-vhv2g_openshift-machine-config-operator(96b6e38f-b5f7-4b84-84fa-11a68cc5c095) reason: CrashLoopBackOff --------------------------------- Abnormal co details [2021-03-22T22:11:23.775Z] Conditions: [2021-03-22T22:11:23.775Z] Last Transition Time: 2021-03-22T20:40:47Z [2021-03-22T22:11:23.775Z] Message: Working towards 4.8.0-0.nightly-2021-03-22-104536 [2021-03-22T22:11:23.775Z] Status: True [2021-03-22T22:11:23.775Z] Type: Progressing [2021-03-22T22:11:23.775Z] Last Transition Time: 2021-03-22T20:50:49Z [2021-03-22T22:11:23.775Z] Message: Unable to apply 4.8.0-0.nightly-2021-03-22-104536: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 8, updated: 5, ready: 6, unavailable: 2) [2021-03-22T22:11:23.775Z] Reason: MachineConfigDaemonFailed [2021-03-22T22:11:23.775Z] Status: True [2021-03-22T22:11:23.775Z] Type: Degraded [2021-03-22T22:11:23.775Z] Last Transition Time: 2021-03-22T20:38:13Z [2021-03-22T22:11:23.775Z] Message: Cluster not available for 4.8.0-0.nightly-2021-03-22-104536 [2021-03-22T22:11:23.775Z] Status: False [2021-03-22T22:11:23.775Z] Type: Available [2021-03-22T22:11:23.775Z] Last Transition Time: 2021-03-22T17:09:10Z [2021-03-22T22:11:23.775Z] Reason: AsExpected [2021-03-22T22:11:23.775Z] Status: True [2021-03-22T22:11:23.775Z] Type: Upgradeable Actual results: Upgrade failed Expected results: Upgrade succeed. Additional info: