Bug 1975078 - Gracefully shutdown taking around 6-7 mins (libvirt provider)
Summary: Gracefully shutdown taking around 6-7 mins (libvirt provider)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.8.z
Assignee: John Kyros
QA Contact: Rio Liu
URL:
Whiteboard:
Depends On: 1965992
Blocks: 1995853
TreeView+ depends on / blocked
 
Reported: 2021-06-23 05:25 UTC by OpenShift BugZilla Robot
Modified: 2024-12-20 20:18 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-31 16:17:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2636 0 None open [release-4.8] Bug 1975078: Gracefully shutdown taking around 6-7 mins (libvirt provider) 2021-06-23 05:25:46 UTC
Red Hat Product Errata RHBA-2021:3247 0 None None None 2021-08-31 16:17:35 UTC

Description OpenShift BugZilla Robot 2021-06-23 05:25:35 UTC
+++ This bug was initially created as a clone of Bug #1965992 +++

Created attachment 1788204 [details]
must gather from cluster

Platform: libvirt


```
$ oc version
Client Version: 4.7.5
Server Version: 4.8.0-0.nightly-2021-05-26-021757
Kubernetes Version: v1.21.0-rc.0+936c9e2

$ oc describe node
[...]
System Info:
  Machine ID:                                 a9a56b2ccc644e98b5c39e43ea33b80d
  System UUID:                                4cdef27a-2f86-4fe5-b861-e1470d46aff5
  Boot ID:                                    f7948d7c-1734-4993-b44c-28d0d23c5552
  Kernel Version:                             4.18.0-305.el8.x86_64
  OS Image:                                   Red Hat Enterprise Linux CoreOS 48.84.202105251733-0 (Ootpa)
  Operating System:                           linux
  Architecture:                               amd64
  Container Runtime Version:                  cri-o://1.21.0-98.rhaos4.8.git1f3c5cb.el8
  Kubelet Version:                            v1.21.0-rc.0+9acd492
  Kube-Proxy Version:                         v1.21.0-rc.0+9acd492

```

We are using https://github.com/code-ready/snc/blob/master/snc.sh to provision a single node openshift cluster. Once it is provision we gracefully shutdown the node and create the disk images.


What happened? What went wrong or what did you expect?

With 4.8 side we tried to enable MCO as part of CVO (which was not case before since 4.7 single node cluster was not fully supported for MCO). We observed that during `virsh shutdown node` it takes around 6-7 mins:(


What are the steps to reproduce your issue? 
```
$ git clone https://github.com/code-ready/snc.git
$ cd snc
$ export OPENSHIFT_VERSION=4.8.0-0.nightly-2021-05-29-114625
$ export MIRROR=https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview
$ export OPENSHIFT_PULL_SECRET_PATH=<pull-secret_path>
$ snc.sh
< Wait till the cluster provision happen>
$ sudo virsh shutdown <node>
< keep on eye of `virsh domstate <info>` > 
```

Attached must gather logs and logs of `journalctl --no-pager -rb -1`

Looks like something with `machine-config-` side, saw something in the journal logs. (check timestamps)
```
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: Stopped Dynamically sets the system reserved for the kubelet.
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: kubelet-auto-node-size.service: Succeeded.
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio.service: Consumed 3min 3.008s CPU time
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: Stopped Open Container Initiative Daemon.
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio.service: Succeeded.
May 31 09:17:17 crc-r6m8k-master-0 crio[2038]: time="2021-05-31 09:17:17.283482143Z" level=error msg="Failed to update container state for e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e: stdout: , stderr: "
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: Stopping Open Container Initiative Daemon...
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Consumed 1.662s CPU time
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: Stopped libcontainer container e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Failed with result 'timeout'.
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-conmon-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Consumed 47ms CPU time
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-conmon-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Succeeded.
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Killing process 9525 (machine-config-) with signal SIGKILL.
May 31 09:17:17 crc-r6m8k-master-0 systemd[1]: crio-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.scope: Stopping timed out. Killing.
May 31 09:08:51 crc-r6m8k-master-0 systemd[1]: kubepods-burstable-podd22a725b_fc5c_4d26_a280_5ac941fc79f7.slice: Consumed 9min 14.777s CPU time
May 31 09:08:51 crc-r6m8k-master-0 systemd[1]: Removed slice libcontainer container kubepods-burstable-podd22a725b_fc5c_4d26_a280_5ac941fc79f7.slice.
May 31 09:08:51 crc-r6m8k-master-0 systemd[1]: crio-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Consumed 8min 48.175s CPU time
May 31 09:08:51 crc-r6m8k-master-0 systemd[1]: Stopped libcontainer container fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.
May 31 09:08:51 crc-r6m8k-master-0 systemd[1]: crio-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Succeeded.
May 31 09:08:47 crc-r6m8k-master-0 systemd[1]: crio-conmon-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Consumed 707ms CPU time
May 31 09:08:47 crc-r6m8k-master-0 systemd[1]: Stopped crio-conmon-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope.
May 31 09:08:47 crc-r6m8k-master-0 systemd[1]: crio-conmon-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Failed with result 'timeout'.
May 31 09:08:47 crc-r6m8k-master-0 systemd[1]: crio-conmon-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Killing process 34571 (conmon) with signal SIGKILL.
May 31 09:08:47 crc-r6m8k-master-0 systemd[1]: crio-conmon-fa84338620b953f6a5981aa41bc8d3702b74efcb7a3b1c4f467ee0d528db97eb.scope: Stopping timed out. Killing.
```

--- Additional comment from prkumar on 2021-05-31 10:26:07 UTC ---

Created attachment 1788205 [details]
journal logs

--- Additional comment from lucab on 2021-05-31 10:44:35 UTC ---

From the same logs, it seems to be the machine-config-daemon:
```
May 31 08:36:44 crc-r6m8k-master-0 crio[2038]: time="2021-05-31 08:36:44.905811374Z" level=info msg="Created container e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e: openshift-machine-config-operator/machine-config-daemon-mznlz/machine-config-daemon" id=68ff21bb-cc5a-4abe-a100-b98f3aae6e92 name=/runtime.v1alpha2.RuntimeService/CreateContainer
```

--- Additional comment from prkumar on 2021-05-31 11:00:04 UTC ---

machine config daemon logs after it get SIGTERM request (not sure why it has to wait around 7 mins to get cluster health)

```
$ cat /var/log/containers/machine-config-daemon-mznlz_openshift-machine-config-operator_machine-config-daemon-e445bcd0ecf41026578b2e3dda0fcb6232791f5854e8719a8fbc503c3866128e.log
2021-05-31T09:07:17.147984780+00:00 stderr F I0531 09:07:17.145715    9525 daemon.go:588] Got SIGTERM, but actively updating
2021-05-31T09:07:45.754982656+00:00 stderr F W0531 09:07:45.754713    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 1 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:08:15.758405252+00:00 stderr F W0531 09:08:15.758132    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 2 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:08:45.760437892+00:00 stderr F W0531 09:08:45.760191    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 3 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:09:15.765011434+00:00 stderr F W0531 09:09:15.764051    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 4 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:09:45.767975905+00:00 stderr F W0531 09:09:45.767926    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 5 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:10:15.772725881+00:00 stderr F W0531 09:10:15.772608    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 6 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:10:45.775096522+00:00 stderr F W0531 09:10:45.774862    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 7 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:11:15.776137890+00:00 stderr F W0531 09:11:15.776012    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 8 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:11:45.778179989+00:00 stderr F W0531 09:11:45.778116    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 9 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:12:15.780463162+00:00 stderr F W0531 09:12:15.780395    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 10 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:12:45.783185069+00:00 stderr F W0531 09:12:45.783045    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 11 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:13:15.786308702+00:00 stderr F W0531 09:13:15.786164    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 12 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:13:45.789917291+00:00 stderr F W0531 09:13:45.789742    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 13 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:14:15.792876829+00:00 stderr F W0531 09:14:15.792743    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 14 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
2021-05-31T09:14:45.795707517+00:00 stderr F W0531 09:14:45.795522    9525 daemon.go:633] Got an error from auxiliary tools: kubelet health check has failed 15 times: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused
```

--- Additional comment from prkumar on 2021-05-31 11:05:25 UTC ---

Created attachment 1788214 [details]
Machine daemon logs

--- Additional comment from jerzhang on 2021-06-01 01:15:29 UTC ---

Some initial thoughts:

1. "Got SIGTERM, but actively updating" signifies that the MCO is currently trying to perform an update of some sort, and since the MCO itself is not atomic in terms of operations, leaving it in a half-baked state would break the cluster, thus the MCO does not terminate immediately. That said I haven't looked at that code in awhile so maybe some part of it is not working as intended.

2. "Got an error from auxiliary tools: kubelet health check has failed x times" is a bit of a red herring, the kubelet health monitor is a separate process in the MCO that doesn't really block anything (as far as I'm aware), and in this case it hasn't been shut down yet but the kubelet is down, so it just spams the logs a bit.

3. If possible, could you check the MCD logs (prior to shutdown, if they are still available somehow) and provide it? Or if you could replicate this in a non-libvirt singlenode, or multi-node cluster/non-libvirt setup somehow, that might be a good other data point.

--- Additional comment from prkumar on 2021-06-01 04:08:23 UTC ---

Created attachment 1788412 [details]
mcd logs

--- Additional comment from prkumar on 2021-06-01 04:09:22 UTC ---

> If possible, could you check the MCD logs (prior to shutdown, if they are still available somehow) and provide it? Or if you could replicate this in a non-libvirt singlenode, or multi-node cluster/non-libvirt setup somehow, that might be a good other data point.

@Zhang I attached the MCD logs, I will try to see if same can be replicated on the aws/gcloud single node cluster.

--- Additional comment from jerzhang on 2021-06-02 17:30:37 UTC ---

In the provided MCD logs, the MCD did attempt to perform an update (that fortunately was rebootless). I wonder if that did something with the lock. The times don't really align with the virsh shutdown logs though. Nothing specific jumps out.

Generally speaking manual shutdowns are not really tested, so maybe this does exist on cloud platforms too. Let me know if you are able to replicate

--- Additional comment from prkumar on 2021-06-04 09:46:09 UTC ---

>Generally speaking manual shutdowns are not really tested, so maybe this does exist on cloud platforms too. Let me know if you are able to replicate

@Zhang I did try to reproduce it on GCP by creating a single node cluster (using cluster-bot) and then manually stopping that VM from the gcloud UI. On the GCP when you try to stop any VM it mention something following so I am not sure if it force shutdown in that case. I am now out of idea how to reproduce it any cloud environment. On the libvirt side it is consistently happening even with latest nightly. 

```
Stop shuts down the instance. If the shutdown doesn't complete within 90 seconds, the instance is forced to halt. This can lead to file-system corruption. Do you want to stop instance
```

--- Additional comment from jkyros on 2021-06-15 15:51:13 UTC ---

I am able to reproduce this intermittently with snc.sh + recent nightlies ( 4.8.0-0.nightly-2021-06-11-053039 is one example). To be clear, I think they're all affected, I think it's just a function of timing because the snc.sh process is complicated and non-deterministic. 

Now that I understand what happens, I can also reproduce this on an AWS (non-libvirt) cluster of one or more nodes. 

It looks like the "ignore SIGTERM" handler doesn't get removed in the case of rebootless updates (so that "fortunately rebootless" update may not have been so fortunate). SSH keys appear to make it happen reliably. 

An MCO-triggered reboot takes a different code path and removes the lock itself, so that's why we don't see this when the MCO itself invokes the reboot. 


Steps to reproduce: 

1.) Build a cluster 
2.) Update an SSH key. I applied this manifest with pubkeys in it (for the paranoid, yes they are trash public keys): 

cat << EOF > 99-broken-ssh.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 99-broken-ssh
spec:
  config:
    ignition:
      version: 2.2.0
    passwd:
      users:
      - name: core
        sshAuthorizedKeys:
        - ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBAHZYVq/odCCXlmc6PoAIXQbSvNyFOwLbxC9IvWBXeBZsEqFM/9JRV6fJ/fdje/ItnJLPT7opn/BkqM/RumyvQvANQDwqsXaiIFZbWqKuNywH6vc45BhrD7vSLpcLkHRmqDXa7YHAH14xRgyPptbqdMjE9zzThJf68U3+dzT3hkaRGaDNA==
        - ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBACMeaFNRf39fvCSH76K/i9le65hZw69kNZWbMXPsrD+Pk33kEcM2jkpJmw1/3UGYD7iWcV9hA9pa9BIUZHbiXmHbQFGK2VqSlmdna5SI5zYMsjARoLY6Q5agBKpYOSt5UJOf5LvYm5+eReYVR9i6/EZ7MjCRX0stOFL3dTv9VlK55/oEQ==
EOF
oc apply -f ./99-broken-ssh.yaml 

3.) Let the MCO finish on a node:
I0615 04:20:11.611842   11794 daemon.go:1167] In desired config rendered-master-c614aecdd2ffdc1798c6e730bc7b3b8b

4.) Debug into that node: 
oc debug node/ip-10-0-134-142.ec2.internal

5.) Try to kill the pid: 
chroot /host
kill 11794 #pid of the MCO daemon 

6.) Watch MCO log: 
oc logs -f machine-config-daemon-5m86r -n openshift-machine-config-operator machine-config-daemon
I0615 04:23:31.037170   11794 daemon.go:588] Got SIGTERM, but actively updating
I0615 04:23:35.817256   11794 daemon.go:588] Got SIGTERM, but actively updating




I do intend to fix this, but as a workaround in the interim you could probably just delete the "locked" machine-config-daemon pod (obviously waiting until it's actually done doing its work) with "--grace-period 0" (so it goes down on your terms, not after the 10 minute grace period), let it respawn clean (without the lock), and then shutdown the machine:

oc delete pod machine-config-daemon-6kpnn -n openshift-machine-config-operator --grace-period 0

Comment 1 Sinny Kumari 2021-07-01 16:36:48 UTC
Removing TestBlockerForLayeredProduct keyword. Fix will go into 4.8.z stream release

Comment 5 ximhan 2021-08-20 07:26:57 UTC
OpenShift engineering has decided to NOT ship 4.8.6 on 8/23 due to the following issue.
https://bugzilla.redhat.com/show_bug.cgi?id=1995785
All the fixes part will be now included in 4.8.7 on 8/30.

Comment 9 errata-xmlrpc 2021-08-31 16:17:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.9 bug fix), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3247


Note You need to log in before you can comment on or make changes to this bug.