The quorum guard pod doesn't respond to TERM (sleep doesn't register a signal handler for TERM as PID 1 and so gets no events), which means it takes 30s to shut down.
This will need to be backported to 4.1.z
Does https://github.com/openshift/machine-config-operator/pull/789 address this?
email@example.com -- yes, the pull request referenced does address this. How should I handle this bug (close it, POST, whatnot)?
After deleting ectd quorum guard pod, it restarts within few seconds.
Also sending TERM signal to PID of etcd quorum guard container from nodes kills the pod and it restarts in around 3-5 seconds.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.1.0-0.nightly-2019-06-27-030910 True False 21h Cluster version is 4.1.0-0.nightly-2019-06-27-030910
NAME READY STATUS RESTARTS AGE
etcd-quorum-guard-7f577fc654-dc8gk 1/1 Running 0 16s
etcd-quorum-guard-7f577fc654-p58p4 1/1 Running 0 22h
etcd-quorum-guard-7f577fc654-tgg2j 1/1 Running 1 22h
$ oc describe pod etcd-quorum-guard-7f577fc654-g8dw5
Container ID: cri-o://259a10908a30b07400098b21b29382727a3f33750de8f00536918272cbc17fb2
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a3e0f24b20754f73c9f2a939ff16aebff879d4c74e82faccb56230a1274cac9
Host Port: <none>
# properly handle TERM and exit as soon as it is signaled
set -euo pipefail
trap 'jobs -p | xargs -r kill; exit 0' TERM
sleep infinity & wait
Started: Fri, 28 Jun 2019 16:36:02 +0530
Last State: Terminated
Exit Code: 0
Started: Thu, 27 Jun 2019 19:47:03 +0530
Finished: Fri, 28 Jun 2019 16:36:01 +0530
Restart Count: 1
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.