1965992 – Gracefully shutdown taking around 6-7 mins (libvirt provider)

Bug 1965992 - Gracefully shutdown taking around 6-7 mins (libvirt provider)

Summary: Gracefully shutdown taking around 6-7 mins (libvirt provider)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	MCO Team
QA Contact:	Rio Liu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1927041 (view as bug list)
Depends On:
Blocks:	1975078
TreeView+	depends on / blocked

Reported:	2021-05-31 10:25 UTC by Praveen Kumar
Modified:	2024-12-20 20:08 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-29 15:18:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
must gather from cluster (2.24 MB, application/x-xz) 2021-05-31 10:25 UTC, Praveen Kumar	no flags	Details
journal logs (3.86 MB, text/plain) 2021-05-31 10:26 UTC, Praveen Kumar	no flags	Details
Machine daemon logs (23.98 KB, text/plain) 2021-05-31 11:05 UTC, Praveen Kumar	no flags	Details
mcd logs (12.62 KB, text/plain) 2021-06-01 04:08 UTC, Praveen Kumar	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2631	0	None	Merged	Bug 1965992: Gracefully shutdown taking around 6-7 mins (libvirt provider)	2021-11-02 23:35:09 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-11-01 01:26:08 UTC

Description Praveen Kumar 2021-05-31 10:25:26 UTC

Created attachment 1788204 [details]
must gather from cluster

Platform: libvirt


```
$ oc version
Client Version: 4.7.5
Server Version: 4.8.0-0.nightly-2021-05-26-021757
Kubernetes Version: v1.21.0-rc.0+936c9e2

$ oc describe node
[...]
System Info:
  Machine ID:                                 a9a56b2ccc644e98b5c39e43ea33b80d
  System UUID:                                4cdef27a-2f86-4fe5-b861-e1470d46aff5
  Boot ID:                                    f7948d7c-1734-4993-b44c-28d0d23c5552
  Kernel Version:                             4.18.0-305.el8.x86_64
  OS Image:                                   Red Hat Enterprise Linux CoreOS 48.84.202105251733-0 (Ootpa)
  Operating System:                           linux
  Architecture:                               amd64
  Container Runtime Version:                  cri-o://1.21.0-98.rhaos4.8.git1f3c5cb.el8
  Kubelet Version:                            v1.21.0-rc.0+9acd492
  Kube-Proxy Version:                         v1.21.0-rc.0+9acd492

```

We are using https://github.com/code-ready/snc/blob/master/snc.sh to provision a single node openshift cluster. Once it is provision we gracefully shutdown the node and create the disk images.


What happened? What went wrong or what did you expect?

With 4.8 side we tried to enable MCO as part of CVO (which was not case before since 4.7 single node cluster was not fully supported for MCO). We observed that during `virsh shutdown node` it takes around 6-7 mins:(


What are the steps to reproduce your issue? 
```
$ git clone https://github.com/code-ready/snc.git
$ cd snc
$ export OPENSHIFT_VERSION=4.8.0-0.nightly-2021-05-29-114625
$ export MIRROR=https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview
$ export OPENSHIFT_PULL_SECRET_PATH=<pull-secret_path>
$ snc.sh
< Wait till the cluster provision happen>
$ sudo virsh shutdown <node>
< keep on eye of `virsh domstate <info>` > 
```

Attached must gather logs and logs of `journalctl --no-pager -rb -1`

Looks like something with `machine-config-` side, saw something in the journal logs. (check timestamps)
```
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: Stopped Dynamically sets the system reserved for the kubelet.
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: kubelet-auto-node-size.service: Succeeded.
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio.service: Consumed 3min 3.008s CPU time
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: Stopped Open Container Initiative Daemon.
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio.service: Succeeded.
May 31 09:17:17 crc-r6m8k-master-0 crio[2038]: time="2021-05-31 09:17:17.283482143Z" level=error msg="Failed to update container state for e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e: stdout: , stderr: "
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: Stopping Open Container Initiative Daemon...
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Consumed 1.662s CPU time
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: Stopped libcontainer container e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Failed with result 'timeout'.
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-conmon-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Consumed 47ms CPU time
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-conmon-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Succeeded.
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Killing process 9525 (machine-config-) with signal SIGKILL.
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Stopping timed out. Killing.
May 31 09:08:51 crc-r6m8k-master-0 systemd[1]: kubepods-burstable-podd22a725b_fc5c_4d26_a280_5ac941fc79f7.slice: Consumed 9min 14.777s CPU time
May 31 09:08:51 crc-r6m8k-master-0 systemd[1]: Removed slice libcontainer container kubepods-burstable-podd22a725b_fc5c_4d26_a280_5ac941fc79f7.slice.
May 31 09:08:51 crc-r6m8k-master-0 systemd[1]: crio-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Consumed 8min 48.175s CPU time
May 31 09:08:51 crc-r6m8k-master-0 systemd[1]: Stopped libcontainer container fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.
May 31 09:08:51 crc-r6m8k-master-0 systemd[1]: crio-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Succeeded.
May 31 09:08:47 crc-r6m8k-master-0 systemd[1]: crio-conmon-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Consumed 707ms CPU time
May 31 09:08:47 crc-r6m8k-master-0 systemd[1]: Stopped crio-conmon-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope.
May 31 09:08:47 crc-r6m8k-master-0 systemd[1]: crio-conmon-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Failed with result 'timeout'.
May 31 09:08:47 crc-r6m8k-master-0 systemd[1]: crio-conmon-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Killing process 34571 (conmon) with signal SIGKILL.
May 31 09:08:47 crc-r6m8k-master-0 systemd[1]: crio-conmon-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Stopping timed out. Killing.
```

Comment 1 Praveen Kumar 2021-05-31 10:26:07 UTC

Created attachment 1788205 [details]
journal logs

Comment 2 Luca BRUNO 2021-05-31 10:44:35 UTC

From the same logs, it seems to be the machine-config-daemon:
```
May 31 08:36:44 crc-r6m8k-master-0 crio[2038]: time="2021-05-31 08:36:44.905811374Z" level=info msg="Created container e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e: openshift-machine-config-operator/machine-config-daemon-mznlz/machine-config-daemon" id=68ff21bb-cc5a-4abe-a100-b98f3aae6e92 name=/runtime.v1alpha2.RuntimeService/CreateContainer
```

Comment 3 Praveen Kumar 2021-05-31 11:00:04 UTC

machine config daemon logs after it get SIGTERM request (not sure why it has to wait around 7 mins to get cluster health)

```
$ cat /var/log/containers/machine-config-daemon-mznlz_openshift-machine-config-operator_machine-config-daemon-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.log
2021-05-31T09:07:17.147984780+00:00 stderr F I0531 09:07:17.145715    9525 daemon.go:588] Got SIGTERM, but actively updating
2021-05-31T09:07:45.754982656+00:00 stderr F W0531 09:07:45.754713    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 1 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:08:15.758405252+00:00 stderr F W0531 09:08:15.758132    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 2 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:08:45.760437892+00:00 stderr F W0531 09:08:45.760191    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 3 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:09:15.765011434+00:00 stderr F W0531 09:09:15.764051    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 4 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:09:45.767975905+00:00 stderr F W0531 09:09:45.767926    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 5 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:10:15.772725881+00:00 stderr F W0531 09:10:15.772608    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 6 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:10:45.775096522+00:00 stderr F W0531 09:10:45.774862    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 7 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:11:15.776137890+00:00 stderr F W0531 09:11:15.776012    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 8 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:11:45.778179989+00:00 stderr F W0531 09:11:45.778116    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 9 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:12:15.780463162+00:00 stderr F W0531 09:12:15.780395    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 10 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:12:45.783185069+00:00 stderr F W0531 09:12:45.783045    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 11 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:13:15.786308702+00:00 stderr F W0531 09:13:15.786164    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 12 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:13:45.789917291+00:00 stderr F W0531 09:13:45.789742    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 13 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:14:15.792876829+00:00 stderr F W0531 09:14:15.792743    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 14 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:14:45.795707517+00:00 stderr F W0531 09:14:45.795522    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 15 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
```

Comment 4 Praveen Kumar 2021-05-31 11:05:25 UTC

Created attachment 1788214 [details]
Machine daemon logs

Comment 5 Yu Qi Zhang 2021-06-01 01:15:29 UTC

Some initial thoughts:

1. "Got SIGTERM, but actively updating" signifies that the MCO is currently trying to perform an update of some sort, and since the MCO itself is not atomic in terms of operations, leaving it in a half-baked state would break the cluster, thus the MCO does not terminate immediately. That said I haven't looked at that code in awhile so maybe some part of it is not working as intended.

2. "Got an error from auxiliary tools: kubelet health check has failed x times" is a bit of a red herring, the kubelet health monitor is a separate process in the MCO that doesn't really block anything (as far as I'm aware), and in this case it hasn't been shut down yet but the kubelet is down, so it just spams the logs a bit.

3. If possible, could you check the MCD logs (prior to shutdown, if they are still available somehow) and provide it? Or if you could replicate this in a non-libvirt singlenode, or multi-node cluster/non-libvirt setup somehow, that might be a good other data point.

Comment 6 Praveen Kumar 2021-06-01 04:08:23 UTC

Created attachment 1788412 [details]
mcd logs

Comment 7 Praveen Kumar 2021-06-01 04:09:22 UTC

> If possible, could you check the MCD logs (prior to shutdown, if they are still available somehow) and provide it? Or if you could replicate this in a non-libvirt singlenode, or multi-node cluster/non-libvirt setup somehow, that might be a good other data point.

@Zhang I attached the MCD logs, I will try to see if same can be replicated on the aws/gcloud single node cluster.

Comment 8 Yu Qi Zhang 2021-06-02 17:30:37 UTC

In the provided MCD logs, the MCD did attempt to perform an update (that fortunately was rebootless). I wonder if that did something with the lock. The times don't really align with the virsh shutdown logs though. Nothing specific jumps out.

Generally speaking manual shutdowns are not really tested, so maybe this does exist on cloud platforms too. Let me know if you are able to replicate

Comment 9 Praveen Kumar 2021-06-04 09:46:09 UTC

>Generally speaking manual shutdowns are not really tested, so maybe this does exist on cloud platforms too. Let me know if you are able to replicate

@Zhang I did try to reproduce it on GCP by creating a single node cluster (using cluster-bot) and then manually stopping that VM from the gcloud UI. On the GCP when you try to stop any VM it mention something following so I am not sure if it force shutdown in that case. I am now out of idea how to reproduce it any cloud environment. On the libvirt side it is consistently happening even with latest nightly. 

```
Stop shuts down the instance. If the shutdown doesn't complete within 90 seconds, the instance is forced to halt. This can lead to file-system corruption. Do you want to stop instance
```

Comment 10 John Kyros 2021-06-15 15:51:13 UTC

I am able to reproduce this intermittently with snc.sh + recent nightlies ( 4.8.0-0.nightly-2021-06-11-053039 is one example). To be clear, I think they're all affected, I think it's just a function of timing because the snc.sh process is complicated and non-deterministic. 

Now that I understand what happens, I can also reproduce this on an AWS (non-libvirt) cluster of one or more nodes. 

It looks like the "ignore SIGTERM" handler doesn't get removed in the case of rebootless updates (so that "fortunately rebootless" update may not have been so fortunate). SSH keys appear to make it happen reliably. 

An MCO-triggered reboot takes a different code path and removes the lock itself, so that's why we don't see this when the MCO itself invokes the reboot. 


Steps to reproduce: 

1.) Build a cluster 
2.) Update an SSH key. I applied this manifest with pubkeys in it (for the paranoid, yes they are trash public keys): 

cat << EOF > 99-broken-ssh.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 99-broken-ssh
spec:
  config:
    ignition:
      version: 2.2.0
    passwd:
      users:
      - name: core
        sshAuthorizedKeys:
        - ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBAHZYVq/odCCXlmc6PoAIXQbSvNyFOwLbxC9IvWBXeBZsEqFM/9JRV6fJ/fdje/ItnJLPT7opn/BkqM/RumyvQvANQDwqsXaiIFZbWqKuNywH6vc45BhrD7vSLpcLkHRmqDXa7YHAH14xRgyPptbqdMjE9zzThJf68U3+dzT3hkaRGaDNA==
        - ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBACMeaFNRf39fvCSH76K/i9le65hZw69kNZWbMXPsrD+Pk33kEcM2jkpJmw1/3UGYD7iWcV9hA9pa9BIUZHbiXmHbQFGK2VqSlmdna5SI5zYMsjARoLY6Q5agBKpYOSt5UJOf5LvYm5+eReYVR9i6/EZ7MjCRX0stOFL3dTv9VlK55/oEQ==
EOF
oc apply -f ./99-broken-ssh.yaml 

3.) Let the MCO finish on a node:
I0615 04:20:11.611842   11794 daemon.go:1167] In desired config rendered-master-c614aecdd2ffdc1798c6e730bc7b3b8b

4.) Debug into that node: 
oc debug node/ip-10-0-134-142.ec2.internal

5.) Try to kill the pid: 
chroot /host
kill 11794 #pid of the MCO daemon 

6.) Watch MCO log: 
oc logs -f machine-config-daemon-5m86r -n openshift-machine-config-operator machine-config-daemon
I0615 04:23:31.037170   11794 daemon.go:588] Got SIGTERM, but actively updating
I0615 04:23:35.817256   11794 daemon.go:588] Got SIGTERM, but actively updating




I do intend to fix this, but as a workaround in the interim you could probably just delete the "locked" machine-config-daemon pod (obviously waiting until it's actually done doing its work) with "--grace-period 0" (so it goes down on your terms, not after the 10 minute grace period), let it respawn clean (without the lock), and then shutdown the machine:

oc delete pod machine-config-daemon-6kpnn -n openshift-machine-config-operator --grace-period 0

Comment 12 John Kyros 2021-06-24 20:10:24 UTC

*** Bug 1927041 has been marked as a duplicate of this bug. ***

Comment 13 Michael Nguyen 2021-06-30 15:41:32 UTC

Verified on 4.9.0-0.nightly-2021-06-30-034414.  Used the reproducer in https://bugzilla.redhat.com/show_bug.cgi?id=1965992#c10 on AWS.  Verified the log messages and reboot times are sane.


$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-06-30-034414   True        False         66m     Cluster version is 4.9.0-0.nightly-2021-06-30-034414

$ oc get nodes
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-131-173.us-west-2.compute.internal   Ready    worker   149m   v1.21.0-rc.0+120883f
ip-10-0-156-96.us-west-2.compute.internal    Ready    master   155m   v1.21.0-rc.0+120883f
ip-10-0-174-62.us-west-2.compute.internal    Ready    master   155m   v1.21.0-rc.0+120883f
ip-10-0-191-194.us-west-2.compute.internal   Ready    worker   149m   v1.21.0-rc.0+120883f
ip-10-0-194-80.us-west-2.compute.internal    Ready    worker   147m   v1.21.0-rc.0+120883f
ip-10-0-197-105.us-west-2.compute.internal   Ready    master   155m   v1.21.0-rc.0+120883f

$ oc create -f 99-broken-ssh.yaml 
machineconfig.machineconfiguration.openshift.io/99-broken-ssh created

$ oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          68e602d88967f30eebf86086d68f7c6303064893   3.2.0             153m
00-worker                                          68e602d88967f30eebf86086d68f7c6303064893   3.2.0             153m
01-master-container-runtime                        68e602d88967f30eebf86086d68f7c6303064893   3.2.0             153m
01-master-kubelet                                  68e602d88967f30eebf86086d68f7c6303064893   3.2.0             153m
01-worker-container-runtime                        68e602d88967f30eebf86086d68f7c6303064893   3.2.0             153m
01-worker-kubelet                                  68e602d88967f30eebf86086d68f7c6303064893   3.2.0             153m
99-broken-ssh                                                                                 2.2.0             6s
99-master-generated-registries                     68e602d88967f30eebf86086d68f7c6303064893   3.2.0             153m
99-master-ssh                                                                                 3.2.0             158m
99-worker-generated-registries                     68e602d88967f30eebf86086d68f7c6303064893   3.2.0             153m
99-worker-kdump                                                                               3.2.0             158m
99-worker-ssh                                                                                 3.2.0             158m
rendered-master-64338423686061e4583e9e597bca20af   68e602d88967f30eebf86086d68f7c6303064893   3.2.0             55m
rendered-master-98b717ef3d573223f244a2e5785766ba   68e602d88967f30eebf86086d68f7c6303064893   3.2.0             153m
rendered-worker-cc49ba4b1d8fed46e59bd1033a96ec21   68e602d88967f30eebf86086d68f7c6303064893   3.2.0             153m

$ oc get mcp/master
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-98b717ef3d573223f244a2e5785766ba   False     True       False      3              0                   0                     0                      156m

$ oc get pods -A --field-selector spec.nodeName=ip-10-0-156-96.us-west-2.compute.internal
NAMESPACE                                 NAME                                                                 READY   STATUS      RESTARTS   AGE
openshift-apiserver                       apiserver-7db786b4bf-dqnxv                                           2/2     Running     0          148m
openshift-authentication                  oauth-openshift-599db8cc46-62mx9                                     1/1     Running     0          148m
openshift-cluster-csi-drivers             aws-ebs-csi-driver-controller-579c84cd8b-7dp5n                       11/11   Running     0          156m
openshift-cluster-csi-drivers             aws-ebs-csi-driver-node-xgnfs                                        3/3     Running     0          156m
openshift-cluster-csi-drivers             aws-ebs-csi-driver-operator-7ffb5c44cd-j5ctt                         1/1     Running     0          157m
openshift-cluster-node-tuning-operator    tuned-vjcm7                                                          1/1     Running     0          156m
openshift-cluster-samples-operator        cluster-samples-operator-8cf7cbcff-8ffrd                             2/2     Running     0          154m
openshift-cluster-storage-operator        csi-snapshot-controller-586cffd8f6-pnhcc                             1/1     Running     1          157m
openshift-cluster-storage-operator        csi-snapshot-webhook-fc88979cf-69s9r                                 1/1     Running     0          157m
openshift-console-operator                console-operator-56847b9495-9w4n6                                    1/1     Running     1          153m
openshift-console                         downloads-5c5df6fcd4-5pdrh                                           1/1     Running     0          153m
openshift-controller-manager              controller-manager-5bwjl                                             1/1     Running     0          150m
openshift-dns                             dns-default-bk589                                                    2/2     Running     0          156m
openshift-dns                             node-resolver-6wp2t                                                  1/1     Running     0          156m
openshift-etcd                            etcd-ip-10-0-156-96.us-west-2.compute.internal                       4/4     Running     0          149m
openshift-etcd                            etcd-quorum-guard-7c5b74d958-ntctp                                   1/1     Running     0          157m
openshift-etcd                            installer-2-ip-10-0-156-96.us-west-2.compute.internal                0/1     Completed   0          154m
openshift-etcd                            installer-3-ip-10-0-156-96.us-west-2.compute.internal                0/1     Completed   0          149m
openshift-etcd                            revision-pruner-2-ip-10-0-156-96.us-west-2.compute.internal          0/1     Completed   0          154m
openshift-etcd                            revision-pruner-3-ip-10-0-156-96.us-west-2.compute.internal          0/1     Completed   0          148m
openshift-image-registry                  node-ca-x9s9r                                                        1/1     Running     0          153m
openshift-kube-apiserver                  installer-3-ip-10-0-156-96.us-west-2.compute.internal                0/1     Completed   0          154m
openshift-kube-apiserver                  installer-8-ip-10-0-156-96.us-west-2.compute.internal                0/1     Completed   0          143m
openshift-kube-apiserver                  kube-apiserver-ip-10-0-156-96.us-west-2.compute.internal             5/5     Running     0          138m
openshift-kube-apiserver                  revision-pruner-8-ip-10-0-156-96.us-west-2.compute.internal          0/1     Completed   0          137m
openshift-kube-controller-manager         installer-4-ip-10-0-156-96.us-west-2.compute.internal                0/1     Completed   0          154m
openshift-kube-controller-manager         installer-6-ip-10-0-156-96.us-west-2.compute.internal                0/1     Completed   0          151m
openshift-kube-controller-manager         installer-7-ip-10-0-156-96.us-west-2.compute.internal                0/1     Completed   0          149m
openshift-kube-controller-manager         installer-8-ip-10-0-156-96.us-west-2.compute.internal                0/1     Completed   0          147m
openshift-kube-controller-manager         kube-controller-manager-ip-10-0-156-96.us-west-2.compute.internal    4/4     Running     0          146m
openshift-kube-controller-manager         revision-pruner-4-ip-10-0-156-96.us-west-2.compute.internal          0/1     Completed   0          154m
openshift-kube-controller-manager         revision-pruner-7-ip-10-0-156-96.us-west-2.compute.internal          0/1     Completed   0          149m
openshift-kube-controller-manager         revision-pruner-8-ip-10-0-156-96.us-west-2.compute.internal          0/1     Completed   0          146m
openshift-kube-scheduler                  installer-3-ip-10-0-156-96.us-west-2.compute.internal                0/1     Completed   0          156m
openshift-kube-scheduler                  installer-5-ip-10-0-156-96.us-west-2.compute.internal                0/1     Completed   0          154m
openshift-kube-scheduler                  installer-6-ip-10-0-156-96.us-west-2.compute.internal                0/1     Completed   0          153m
openshift-kube-scheduler                  installer-7-ip-10-0-156-96.us-west-2.compute.internal                0/1     Completed   0          152m
openshift-kube-scheduler                  installer-8-ip-10-0-156-96.us-west-2.compute.internal                0/1     Completed   0          147m
openshift-kube-scheduler                  openshift-kube-scheduler-ip-10-0-156-96.us-west-2.compute.internal   3/3     Running     0          147m
openshift-kube-scheduler                  revision-pruner-7-ip-10-0-156-96.us-west-2.compute.internal          0/1     Completed   0          151m
openshift-kube-scheduler                  revision-pruner-8-ip-10-0-156-96.us-west-2.compute.internal          0/1     Completed   0          146m
openshift-kube-storage-version-migrator   migrator-d8994dcb5-nbnlw                                             1/1     Running     0          157m
openshift-machine-api                     machine-api-controllers-6c474f66c7-vpx87                             7/7     Running     0          156m
openshift-machine-config-operator         machine-config-controller-6f68d96758-d72z2                           1/1     Running     0          155m
openshift-machine-config-operator         machine-config-daemon-xzxp4                                          2/2     Running     0          157m
openshift-machine-config-operator         machine-config-server-dqmwl                                          1/1     Running     1          155m
openshift-monitoring                      node-exporter-9ztb4                                                  2/2     Running     0          157m
openshift-monitoring                      prometheus-operator-95f874f68-stp2d                                  2/2     Running     1          157m
openshift-multus                          multus-additional-cni-plugins-cfz4b                                  1/1     Running     0          158m
openshift-multus                          multus-admission-controller-5k97j                                    2/2     Running     0          157m
openshift-multus                          multus-sw87x                                                         1/1     Running     0          158m
openshift-multus                          network-metrics-daemon-4584q                                         2/2     Running     0          158m
openshift-network-diagnostics             network-check-target-hmd9t                                           1/1     Running     0          157m
openshift-oauth-apiserver                 apiserver-595d66f4d8-tgd8r                                           1/1     Running     0          155m
openshift-operator-lifecycle-manager      packageserver-77fb8c75b6-x28mm                                       1/1     Running     0          156m
openshift-sdn                             sdn-5prsb                                                            2/2     Running     0          157m
openshift-sdn                             sdn-controller-k496j                                                 1/1     Running     0          157m
openshift-service-ca                      service-ca-85f5b4fc7-wm98z                                           1/1     Running     0          157m

$ oc -n openshift-machine-config-operator logs machine-config-daemon-xzxp4 -c machine-config-daemon | grep 'In desired config'
I0630 12:51:34.780835    6835 daemon.go:1173] In desired config rendered-master-98b717ef3d573223f244a2e5785766ba
I0630 14:29:18.824652    6835 daemon.go:1173] In desired config rendered-master-64338423686061e4583e9e597bca20af
I0630 15:23:57.679354    6835 daemon.go:1173] In desired config rendered-master-98b717ef3d573223f244a2e5785766ba
I0630 15:24:58.365064    6835 daemon.go:1173] In desired config rendered-master-64338423686061e4583e9e597bca20af

$ oc get mcp/master
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-64338423686061e4583e9e597bca20af   True      False      False      3              3                   3                     0                      157m

$ oc debug node/ip-10-0-156-96.us-west-2.compute.internal
Starting pod/ip-10-0-156-96us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# ps -ef | grep dqmwl | grep -v daemon | grep -v controller
root      518609       1  0 15:21 ?        00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/4fef6999aa6c8590b29382b68067851d434a411bdde27d0043ed52340e8950e0/userdata -c 4fef6999aa6c8590b29382b68067851d434a411bdde27d0043ed52340e8950e0 --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-machine-config-operator_machine-config-server-dqmwl_0439100c-6b90-422a-a6d1-23b88012e888/4fef6999aa6c8590b29382b68067851d434a411bdde27d0043ed52340e8950e0.log --log-level info -n k8s_POD_machine-config-server-dqmwl_openshift-machine-config-operator_0439100c-6b90-422a-a6d1-23b88012e888_1 -P /var/run/containers/storage/overlay-containers/4fef6999aa6c8590b29382b68067851d434a411bdde27d0043ed52340e8950e0/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/4fef6999aa6c8590b29382b68067851d434a411bdde27d0043ed52340e8950e0/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/4fef6999aa6c8590b29382b68067851d434a411bdde27d0043ed52340e8950e0/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 4fef6999aa6c8590b29382b68067851d434a411bdde27d0043ed52340e8950e0 -s
root      518641       1  0 15:21 ?        00:00:00 /usr/libexec/crio/conmon -b /var/run/containers/storage/overlay-containers/1c02cb20449bfbcff9fbe4dc4d981780ead0fadbc57256ca95b7f01a46d5a391/userdata -c 1c02cb20449bfbcff9fbe4dc4d981780ead0fadbc57256ca95b7f01a46d5a391 --exit-dir /var/run/crio/exits -l /var/log/pods/openshift-machine-config-operator_machine-config-server-dqmwl_0439100c-6b90-422a-a6d1-23b88012e888/machine-config-server/1.log --log-level info -n k8s_machine-config-server_machine-config-server-dqmwl_openshift-machine-config-operator_0439100c-6b90-422a-a6d1-23b88012e888_1 -P /var/run/containers/storage/overlay-containers/1c02cb20449bfbcff9fbe4dc4d981780ead0fadbc57256ca95b7f01a46d5a391/userdata/conmon-pidfile -p /var/run/containers/storage/overlay-containers/1c02cb20449bfbcff9fbe4dc4d981780ead0fadbc57256ca95b7f01a46d5a391/userdata/pidfile --persist-dir /var/lib/containers/storage/overlay-containers/1c02cb20449bfbcff9fbe4dc4d981780ead0fadbc57256ca95b7f01a46d5a391/userdata -r /usr/bin/runc --runtime-arg --root=/run/runc --socket-dir-path /var/run/crio -u 1c02cb20449bfbcff9fbe4dc4d981780ead0fadbc57256ca95b7f01a46d5a391 -s
root      538271  534844  0 15:27 ?        00:00:00 grep dqmwl
sh-4.4# kill 518609
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...

$ oc -n openshift-machine-config-operator logs machine-config-daemon-xzxp4 -c machine-config-daemon | grep SIGTERM
I0630 14:29:15.892202    6835 update.go:1896] Adding SIGTERM protection
I0630 14:29:18.824726    6835 update.go:1904] Removing SIGTERM protection
I0630 15:23:54.199978    6835 update.go:1896] Adding SIGTERM protection
I0630 15:23:57.679378    6835 update.go:1904] Removing SIGTERM protection
I0630 15:24:54.831041    6835 update.go:1896] Adding SIGTERM protection
I0630 15:24:58.365166    6835 update.go:1904] Removing SIGTERM protection
$ oc debug node/ip-10-0-156-96.us-west-2.compute.internal
Starting pod/ip-10-0-156-96us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# sudo systemctl reboot
sh: sudo: command not found
sh-4.2# chroot /host
sh-4.4# sudo systemctl reboot
sh-4.4# exit
Removing debug pod ...
$ oc get nodes
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-131-173.us-west-2.compute.internal   Ready    worker   155m   v1.21.0-rc.0+120883f
ip-10-0-156-96.us-west-2.compute.internal    Ready    master   162m   v1.21.0-rc.0+120883f
ip-10-0-174-62.us-west-2.compute.internal    Ready    master   162m   v1.21.0-rc.0+120883f
ip-10-0-191-194.us-west-2.compute.internal   Ready    worker   155m   v1.21.0-rc.0+120883f
ip-10-0-194-80.us-west-2.compute.internal    Ready    worker   154m   v1.21.0-rc.0+120883f
ip-10-0-197-105.us-west-2.compute.internal   Ready    master   162m   v1.21.0-rc.0+120883f
$ watch console for

Comment 18 errata-xmlrpc 2021-11-01 01:26:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.