Created attachment 1788204 [details] must gather from cluster Platform: libvirt ``` $ oc version Client Version: 4.7.5 Server Version: 4.8.0-0.nightly-2021-05-26-021757 Kubernetes Version: v1.21.0-rc.0+936c9e2 $ oc describe node [...] System Info: Machine ID: a9a56b2ccc644e98b5c39e43ea33b80d System UUID: 4cdef27a-2f86-4fe5-b861-e1470d46aff5 Boot ID: f7948d7c-1734-4993-b44c-28d0d23c5552 Kernel Version: 4.18.0-305.el8.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 48.84.202105251733-0 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.21.0-98.rhaos4.8.git1f3c5cb.el8 Kubelet Version: v1.21.0-rc.0+9acd492 Kube-Proxy Version: v1.21.0-rc.0+9acd492 ``` We are using https://github.com/code-ready/snc/blob/master/snc.sh to provision a single node openshift cluster. Once it is provision we gracefully shutdown the node and create the disk images. What happened? What went wrong or what did you expect? With 4.8 side we tried to enable MCO as part of CVO (which was not case before since 4.7 single node cluster was not fully supported for MCO). We observed that during `virsh shutdown node` it takes around 6-7 mins:( What are the steps to reproduce your issue? ``` $ git clone https://github.com/code-ready/snc.git $ cd snc $ export OPENSHIFT_VERSION=4.8.0-0.nightly-2021-05-29-114625 $ export MIRROR=https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview $ export OPENSHIFT_PULL_SECRET_PATH=<pull-secret_path> $ snc.sh < Wait till the cluster provision happen> $ sudo virsh shutdown <node> < keep on eye of `virsh domstate <info>` > ``` Attached must gather logs and logs of `journalctl --no-pager -rb -1` Looks like something with `machine-config-` side, saw something in the journal logs. (check timestamps) ``` May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: Stopped Dynamically sets the system reserved for the kubelet. May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: kubelet-auto-node-size.service: Succeeded. May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio.service: Consumed 3min 3.008s CPU time May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: Stopped Open Container Initiative Daemon. May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio.service: Succeeded. May 31 09:17:17 crc-r6m8k-master-0 crio[2038]: time="2021-05-31 09:17:17.283482143Z" level=error msg="Failed to update container state for e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e: stdout: , stderr: " May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: Stopping Open Container Initiative Daemon... May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Consumed 1.662s CPU time May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: Stopped libcontainer container e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e. May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Failed with result 'timeout'. May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-conmon-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Consumed 47ms CPU time May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-conmon-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Succeeded. May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Killing process 9525 (machine-config-) with signal SIGKILL. May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Stopping timed out. Killing. May 31 09:08:51 crc-r6m8k-master-0 systemd[1]: kubepods-burstable-podd22a725b_fc5c_4d26_a280_5ac941fc79f7.slice: Consumed 9min 14.777s CPU time May 31 09:08:51 crc-r6m8k-master-0 systemd[1]: Removed slice libcontainer container kubepods-burstable-podd22a725b_fc5c_4d26_a280_5ac941fc79f7.slice. May 31 09:08:51 crc-r6m8k-master-0 systemd[1]: crio-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Consumed 8min 48.175s CPU time May 31 09:08:51 crc-r6m8k-master-0 systemd[1]: Stopped libcontainer container fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb. May 31 09:08:51 crc-r6m8k-master-0 systemd[1]: crio-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Succeeded. May 31 09:08:47 crc-r6m8k-master-0 systemd[1]: crio-conmon-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Consumed 707ms CPU time May 31 09:08:47 crc-r6m8k-master-0 systemd[1]: Stopped crio-conmon-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope. May 31 09:08:47 crc-r6m8k-master-0 systemd[1]: crio-conmon-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Failed with result 'timeout'. May 31 09:08:47 crc-r6m8k-master-0 systemd[1]: crio-conmon-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Killing process 34571 (conmon) with signal SIGKILL. May 31 09:08:47 crc-r6m8k-master-0 systemd[1]: crio-conmon-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Stopping timed out. Killing. ```
Created attachment 1788205 [details] journal logs
From the same logs, it seems to be the machine-config-daemon: ``` May 31 08:36:44 crc-r6m8k-master-0 crio[2038]: time="2021-05-31 08:36:44.905811374Z" level=info msg="Created container e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e: openshift-machine-config-operator/machine-config-daemon-mznlz/machine-config-daemon" id=68ff21bb-cc5a-4abe-a100-b98f3aae6e92 name=/runtime.v1alpha2.RuntimeService/CreateContainer ```
machine config daemon logs after it get SIGTERM request (not sure why it has to wait around 7 mins to get cluster health) ``` $ cat /var/log/containers/machine-config-daemon-mznlz_openshift-machine-config-operator_machine-config-daemon-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.log 2021-05-31T09:07:17.147984780+00:00 stderr F I0531 09:07:17.145715 9525 daemon.go:588] Got SIGTERM, but actively updating 2021-05-31T09:07:45.754982656+00:00 stderr F W0531 09:07:45.754713 9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 1 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused 2021-05-31T09:08:15.758405252+00:00 stderr F W0531 09:08:15.758132 9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 2 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused 2021-05-31T09:08:45.760437892+00:00 stderr F W0531 09:08:45.760191 9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 3 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused 2021-05-31T09:09:15.765011434+00:00 stderr F W0531 09:09:15.764051 9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 4 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused 2021-05-31T09:09:45.767975905+00:00 stderr F W0531 09:09:45.767926 9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 5 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused 2021-05-31T09:10:15.772725881+00:00 stderr F W0531 09:10:15.772608 9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 6 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused 2021-05-31T09:10:45.775096522+00:00 stderr F W0531 09:10:45.774862 9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 7 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused 2021-05-31T09:11:15.776137890+00:00 stderr F W0531 09:11:15.776012 9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 8 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused 2021-05-31T09:11:45.778179989+00:00 stderr F W0531 09:11:45.778116 9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 9 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused 2021-05-31T09:12:15.780463162+00:00 stderr F W0531 09:12:15.780395 9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 10 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused 2021-05-31T09:12:45.783185069+00:00 stderr F W0531 09:12:45.783045 9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 11 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused 2021-05-31T09:13:15.786308702+00:00 stderr F W0531 09:13:15.786164 9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 12 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused 2021-05-31T09:13:45.789917291+00:00 stderr F W0531 09:13:45.789742 9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 13 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused 2021-05-31T09:14:15.792876829+00:00 stderr F W0531 09:14:15.792743 9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 14 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused 2021-05-31T09:14:45.795707517+00:00 stderr F W0531 09:14:45.795522 9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 15 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused ```
Created attachment 1788214 [details] Machine daemon logs
Some initial thoughts: 1. "Got SIGTERM, but actively updating" signifies that the MCO is currently trying to perform an update of some sort, and since the MCO itself is not atomic in terms of operations, leaving it in a half-baked state would break the cluster, thus the MCO does not terminate immediately. That said I haven't looked at that code in awhile so maybe some part of it is not working as intended. 2. "Got an error from auxiliary tools: kubelet health check has failed x times" is a bit of a red herring, the kubelet health monitor is a separate process in the MCO that doesn't really block anything (as far as I'm aware), and in this case it hasn't been shut down yet but the kubelet is down, so it just spams the logs a bit. 3. If possible, could you check the MCD logs (prior to shutdown, if they are still available somehow) and provide it? Or if you could replicate this in a non-libvirt singlenode, or multi-node cluster/non-libvirt setup somehow, that might be a good other data point.
Created attachment 1788412 [details] mcd logs
> If possible, could you check the MCD logs (prior to shutdown, if they are still available somehow) and provide it? Or if you could replicate this in a non-libvirt singlenode, or multi-node cluster/non-libvirt setup somehow, that might be a good other data point. @Zhang I attached the MCD logs, I will try to see if same can be replicated on the aws/gcloud single node cluster.
In the provided MCD logs, the MCD did attempt to perform an update (that fortunately was rebootless). I wonder if that did something with the lock. The times don't really align with the virsh shutdown logs though. Nothing specific jumps out. Generally speaking manual shutdowns are not really tested, so maybe this does exist on cloud platforms too. Let me know if you are able to replicate
>Generally speaking manual shutdowns are not really tested, so maybe this does exist on cloud platforms too. Let me know if you are able to replicate @Zhang I did try to reproduce it on GCP by creating a single node cluster (using cluster-bot) and then manually stopping that VM from the gcloud UI. On the GCP when you try to stop any VM it mention something following so I am not sure if it force shutdown in that case. I am now out of idea how to reproduce it any cloud environment. On the libvirt side it is consistently happening even with latest nightly. ``` Stop shuts down the instance. If the shutdown doesn't complete within 90 seconds, the instance is forced to halt. This can lead to file-system corruption. Do you want to stop instance ```
I am able to reproduce this intermittently with snc.sh + recent nightlies ( 4.8.0-0.nightly-2021-06-11-053039 is one example). To be clear, I think they're all affected, I think it's just a function of timing because the snc.sh process is complicated and non-deterministic. Now that I understand what happens, I can also reproduce this on an AWS (non-libvirt) cluster of one or more nodes. It looks like the "ignore SIGTERM" handler doesn't get removed in the case of rebootless updates (so that "fortunately rebootless" update may not have been so fortunate). SSH keys appear to make it happen reliably. An MCO-triggered reboot takes a different code path and removes the lock itself, so that's why we don't see this when the MCO itself invokes the reboot. Steps to reproduce: 1.) Build a cluster 2.) Update an SSH key. I applied this manifest with pubkeys in it (for the paranoid, yes they are trash public keys): cat << EOF > 99-broken-ssh.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 99-broken-ssh spec: config: ignition: version: 2.2.0 passwd: users: - name: core sshAuthorizedKeys: - ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBAHZYVq/odCCXlmc6PoAIXQbSvNyFOwLbxC9IvWBXeBZsEqFM/9JRV6fJ/fdje/ItnJLPT7opn/BkqM/RumyvQvANQDwqsXaiIFZbWqKuNywH6vc45BhrD7vSLpcLkHRmqDXa7YHAH14xRgyPptbqdMjE9zzThJf68U3+dzT3hkaRGaDNA== - ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBACMeaFNRf39fvCSH76K/i9le65hZw69kNZWbMXPsrD+Pk33kEcM2jkpJmw1/3UGYD7iWcV9hA9pa9BIUZHbiXmHbQFGK2VqSlmdna5SI5zYMsjARoLY6Q5agBKpYOSt5UJOf5LvYm5+eReYVR9i6/EZ7MjCRX0stOFL3dTv9VlK55/oEQ== EOF oc apply -f ./99-broken-ssh.yaml 3.) Let the MCO finish on a node: I0615 04:20:11.611842 11794 daemon.go:1167] In desired config rendered-master-c614aecdd2ffdc1798c6e730bc7b3b8b 4.) Debug into that node: oc debug node/ip-10-0-134-142.ec2.internal 5.) Try to kill the pid: chroot /host kill 11794 #pid of the MCO daemon 6.) Watch MCO log: oc logs -f machine-config-daemon-5m86r -n openshift-machine-config-operator machine-config-daemon I0615 04:23:31.037170 11794 daemon.go:588] Got SIGTERM, but actively updating I0615 04:23:35.817256 11794 daemon.go:588] Got SIGTERM, but actively updating I do intend to fix this, but as a workaround in the interim you could probably just delete the "locked" machine-config-daemon pod (obviously waiting until it's actually done doing its work) with "--grace-period 0" (so it goes down on your terms, not after the 10 minute grace period), let it respawn clean (without the lock), and then shutdown the machine: oc delete pod machine-config-daemon-6kpnn -n openshift-machine-config-operator --grace-period 0
*** Bug 1927041 has been marked as a duplicate of this bug. ***
Verified on 4.9.0-0.nightly-2021-06-30-034414. Used the reproducer in https://bugzilla.redhat.com/show_bug.cgi?id=1965992#c10 on AWS. Verified the log messages and reboot times are sane. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-06-30-034414 True False 66m Cluster version is 4.9.0-0.nightly-2021-06-30-034414 $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-131-173.us-west-2.compute.internal Ready worker 149m v1.21.0-rc.0+120883f ip-10-0-156-96.us-west-2.compute.internal Ready master 155m v1.21.0-rc.0+120883f ip-10-0-174-62.us-west-2.compute.internal Ready master 155m v1.21.0-rc.0+120883f ip-10-0-191-194.us-west-2.compute.internal Ready worker 149m v1.21.0-rc.0+120883f ip-10-0-194-80.us-west-2.compute.internal Ready worker 147m v1.21.0-rc.0+120883f ip-10-0-197-105.us-west-2.compute.internal Ready master 155m v1.21.0-rc.0+120883f $ oc create -f 99-broken-ssh.yaml machineconfig.machineconfiguration.openshift.io/99-broken-ssh created $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 68e602d88967f30eebf86086d68f7c6303064893 3.2.0 153m 00-worker 68e602d88967f30eebf86086d68f7c6303064893 3.2.0 153m 01-master-container-runtime 68e602d88967f30eebf86086d68f7c6303064893 3.2.0 153m 01-master-kubelet 68e602d88967f30eebf86086d68f7c6303064893 3.2.0 153m 01-worker-container-runtime 68e602d88967f30eebf86086d68f7c6303064893 3.2.0 153m 01-worker-kubelet 68e602d88967f30eebf86086d68f7c6303064893 3.2.0 153m 99-broken-ssh 2.2.0 6s 99-master-generated-registries 68e602d88967f30eebf86086d68f7c6303064893 3.2.0 153m 99-master-ssh 3.2.0 158m 99-worker-generated-registries 68e602d88967f30eebf86086d68f7c6303064893 3.2.0 153m 99-worker-kdump 3.2.0 158m 99-worker-ssh 3.2.0 158m rendered-master-64338423686061e4583e9e597bca20af 68e602d88967f30eebf86086d68f7c6303064893 3.2.0 55m rendered-master-98b717ef3d573223f244a2e5785766ba 68e602d88967f30eebf86086d68f7c6303064893 3.2.0 153m rendered-worker-cc49ba4b1d8fed46e59bd1033a96ec21 68e602d88967f30eebf86086d68f7c6303064893 3.2.0 153m $ oc get mcp/master NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-98b717ef3d573223f244a2e5785766ba False True False 3 0 0 0 156m $ oc get pods -A --field-selector spec.nodeName=ip-10-0-156-96.us-west-2.compute.internal NAMESPACE NAME READY STATUS RESTARTS AGE openshift-apiserver apiserver-7db786b4bf-dqnxv 2/2 Running 0 148m openshift-authentication oauth-openshift-599db8cc46-62mx9 1/1 Running 0 148m openshift-cluster-csi-drivers aws-ebs-csi-driver-controller-579c84cd8b-7dp5n 11/11 Running 0 156m openshift-cluster-csi-drivers aws-ebs-csi-driver-node-xgnfs 3/3 Running 0 156m openshift-cluster-csi-drivers aws-ebs-csi-driver-operator-7ffb5c44cd-j5ctt 1/1 Running 0 157m openshift-cluster-node-tuning-operator tuned-vjcm7 1/1 Running 0 156m openshift-cluster-samples-operator cluster-samples-operator-8cf7cbcff-8ffrd 2/2 Running 0 154m openshift-cluster-storage-operator csi-snapshot-controller-586cffd8f6-pnhcc 1/1 Running 1 157m openshift-cluster-storage-operator csi-snapshot-webhook-fc88979cf-69s9r 1/1 Running 0 157m openshift-console-operator console-operator-56847b9495-9w4n6 1/1 Running 1 153m openshift-console downloads-5c5df6fcd4-5pdrh 1/1 Running 0 153m openshift-controller-manager controller-manager-5bwjl 1/1 Running 0 150m openshift-dns dns-default-bk589 2/2 Running 0 156m openshift-dns node-resolver-6wp2t 1/1 Running 0 156m openshift-etcd etcd-ip-10-0-156-96.us-west-2.compute.internal 4/4 Running 0 149m openshift-etcd etcd-quorum-guard-7c5b74d958-ntctp 1/1 Running 0 157m openshift-etcd installer-2-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 154m openshift-etcd installer-3-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 149m openshift-etcd revision-pruner-2-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 154m openshift-etcd revision-pruner-3-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 148m openshift-image-registry node-ca-x9s9r 1/1 Running 0 153m openshift-kube-apiserver installer-3-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 154m openshift-kube-apiserver installer-8-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 143m openshift-kube-apiserver kube-apiserver-ip-10-0-156-96.us-west-2.compute.internal 5/5 Running 0 138m openshift-kube-apiserver revision-pruner-8-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 137m openshift-kube-controller-manager installer-4-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 154m openshift-kube-controller-manager installer-6-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 151m openshift-kube-controller-manager installer-7-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 149m openshift-kube-controller-manager installer-8-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 147m openshift-kube-controller-manager kube-controller-manager-ip-10-0-156-96.us-west-2.compute.internal 4/4 Running 0 146m openshift-kube-controller-manager revision-pruner-4-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 154m openshift-kube-controller-manager revision-pruner-7-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 149m openshift-kube-controller-manager revision-pruner-8-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 146m openshift-kube-scheduler installer-3-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 156m openshift-kube-scheduler installer-5-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 154m openshift-kube-scheduler installer-6-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 153m openshift-kube-scheduler installer-7-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 152m openshift-kube-scheduler installer-8-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 147m openshift-kube-scheduler openshift-kube-scheduler-ip-10-0-156-96.us-west-2.compute.internal 3/3 Running 0 147m openshift-kube-scheduler revision-pruner-7-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 151m openshift-kube-scheduler revision-pruner-8-ip-10-0-156-96.us-west-2.compute.internal 0/1 Completed 0 146m openshift-kube-storage-version-migrator migrator-d8994dcb5-nbnlw 1/1 Running 0 157m openshift-machine-api machine-api-controllers-6c474f66c7-vpx87 7/7 Running 0 156m openshift-machine-config-operator machine-config-controller-6f68d96758-d72z2 1/1 Running 0 155m openshift-machine-config-operator machine-config-daemon-xzxp4 2/2 Running 0 157m openshift-machine-config-operator machine-config-server-dqmwl 1/1 Running 1 155m openshift-monitoring node-exporter-9ztb4 2/2 Running 0 157m openshift-monitoring prometheus-operator-95f874f68-stp2d 2/2 Running 1 157m openshift-multus multus-additional-cni-plugins-cfz4b 1/1 Running 0 158m openshift-multus multus-admission-controller-5k97j 2/2 Running 0 157m openshift-multus multus-sw87x 1/1 Running 0 158m openshift-multus network-metrics-daemon-4584q 2/2 Running 0 158m openshift-network-diagnostics network-check-target-hmd9t 1/1 Running 0 157m openshift-oauth-apiserver apiserver-595d66f4d8-tgd8r 1/1 Running 0 155m openshift-operator-lifecycle-manager packageserver-77fb8c75b6-x28mm 1/1 Running 0 156m openshift-sdn sdn-5prsb 2/2 Running 0 157m openshift-sdn sdn-controller-k496j 1/1 Running 0 157m openshift-service-ca service-ca-85f5b4fc7-wm98z 1/1 Running 0 157m $ oc -n openshift-machine-config-operator logs machine-config-daemon-xzxp4 -c machine-config-daemon | grep 'In desired config' I0630 12:51:34.780835 6835 daemon.go:1173] In desired config rendered-master-98b717ef3d573223f244a2e5785766ba I0630 14:29:18.824652 6835 daemon.go:1173] In desired config rendered-master-64338423686061e4583e9e597bca20af I0630 15:23:57.679354 6835 daemon.go:1173] In desired config rendered-master-98b717ef3d573223f244a2e5785766ba I0630 15:24:58.365064 6835 daemon.go:1173] In desired config rendered-master-64338423686061e4583e9e597bca20af $ oc get mcp/master NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-64338423686061e4583e9e597bca20af True False False 3 3 3 0 157m $ oc debug node/ip-10-0-156-96.us-west-2.compute.internal Starting pod/ip-10-0-156-96us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# ps -ef | grep dqmwl | grep -v daemon | grep -v controller root 518609 1 0 15:21 ? 00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/4fef6999aa6c8590b29382b68067851d434a411bdde27d0043ed52340e8950e0/userdata -c 4fef6999aa6c8590b29382b68067851d434a411bdde27d0043ed52340e8950e0 --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-machine-config-operator_machine-config-server-dqmwl_0439100c-6b90-422a-a6d1-23b88012e888/4fef6999aa6c8590b29382b68067851d434a411bdde27d0043ed52340e8950e0.log --log-level info -n k8s_POD_machine-config-server-dqmwl_openshift-machine-config-operator_0439100c-6b90-422a-a6d1-23b88012e888_1 -P /var/run/containers/storage/overlay-containers/4fef6999aa6c8590b29382b68067851d434a411bdde27d0043ed52340e8950e0/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/4fef6999aa6c8590b29382b68067851d434a411bdde27d0043ed52340e8950e0/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/4fef6999aa6c8590b29382b68067851d434a411bdde27d0043ed52340e8950e0/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 4fef6999aa6c8590b29382b68067851d434a411bdde27d0043ed52340e8950e0 -s root 518641 1 0 15:21 ? 00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/1c02cb20449bfbcff9fbe4dc4d981780ead0fadbc57256ca95b7f01a46d5a391/userdata -c 1c02cb20449bfbcff9fbe4dc4d981780ead0fadbc57256ca95b7f01a46d5a391 --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-machine-config-operator_machine-config-server-dqmwl_0439100c-6b90-422a-a6d1-23b88012e888/machine-config-server/1.log --log-level info -n k8s_machine-config-server_machine-config-server-dqmwl_openshift-machine-config-operator_0439100c-6b90-422a-a6d1-23b88012e888_1 -P /var/run/containers/storage/overlay-containers/1c02cb20449bfbcff9fbe4dc4d981780ead0fadbc57256ca95b7f01a46d5a391/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/1c02cb20449bfbcff9fbe4dc4d981780ead0fadbc57256ca95b7f01a46d5a391/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/1c02cb20449bfbcff9fbe4dc4d981780ead0fadbc57256ca95b7f01a46d5a391/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 1c02cb20449bfbcff9fbe4dc4d981780ead0fadbc57256ca95b7f01a46d5a391 -s root 538271 534844 0 15:27 ? 00:00:00 grep dqmwl sh-4.4# kill 518609 sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... $ oc -n openshift-machine-config-operator logs machine-config-daemon-xzxp4 -c machine-config-daemon | grep SIGTERM I0630 14:29:15.892202 6835 update.go:1896] Adding SIGTERM protection I0630 14:29:18.824726 6835 update.go:1904] Removing SIGTERM protection I0630 15:23:54.199978 6835 update.go:1896] Adding SIGTERM protection I0630 15:23:57.679378 6835 update.go:1904] Removing SIGTERM protection I0630 15:24:54.831041 6835 update.go:1896] Adding SIGTERM protection I0630 15:24:58.365166 6835 update.go:1904] Removing SIGTERM protection $ oc debug node/ip-10-0-156-96.us-west-2.compute.internal Starting pod/ip-10-0-156-96us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# sudo systemctl reboot sh: sudo: command not found sh-4.2# chroot /host sh-4.4# sudo systemctl reboot sh-4.4# exit Removing debug pod ... $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-131-173.us-west-2.compute.internal Ready worker 155m v1.21.0-rc.0+120883f ip-10-0-156-96.us-west-2.compute.internal Ready master 162m v1.21.0-rc.0+120883f ip-10-0-174-62.us-west-2.compute.internal Ready master 162m v1.21.0-rc.0+120883f ip-10-0-191-194.us-west-2.compute.internal Ready worker 155m v1.21.0-rc.0+120883f ip-10-0-194-80.us-west-2.compute.internal Ready worker 154m v1.21.0-rc.0+120883f ip-10-0-197-105.us-west-2.compute.internal Ready master 162m v1.21.0-rc.0+120883f $ watch console for
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759