Bug 1712507 - etcdquorumguard should handle TERM correctly and shut down gracefully
Summary: etcdquorumguard should handle TERM correctly and shut down gracefully
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.1.z
Assignee: Robert Krawitz
QA Contact: Sunil Choudhary
URL:
Whiteboard: 4.1.4
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-21 16:12 UTC by Clayton Coleman
Modified: 2019-10-22 13:03 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-07-04 09:01:22 UTC
Target Upstream Version:
erich: needinfo-


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:1635 None None None 2019-07-04 09:01:33 UTC

Description Clayton Coleman 2019-05-21 16:12:32 UTC
The quorum guard pod doesn't respond to TERM (sleep doesn't register a signal handler for TERM as PID 1 and so gets no events), which means it takes 30s to shut down.

This will need to be backported to 4.1.z

Comment 1 Eric Rich 2019-05-28 13:12:37 UTC
Does https://github.com/openshift/machine-config-operator/pull/789 address this?

Comment 2 Robert Krawitz 2019-06-18 15:02:56 UTC
erich@redhat.com -- yes, the pull request referenced does address this.  How should I handle this bug (close it, POST, whatnot)?

Comment 4 Sunil Choudhary 2019-06-28 12:35:15 UTC
After deleting ectd quorum guard pod, it restarts within few seconds.
Also sending TERM signal to PID of etcd quorum guard container from nodes kills the pod and it restarts in around 3-5 seconds.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-06-27-030910   True        False         21h     Cluster version is 4.1.0-0.nightly-2019-06-27-030910

NAME                                         READY   STATUS    RESTARTS   AGE
etcd-quorum-guard-7f577fc654-dc8gk           1/1     Running   0          16s
etcd-quorum-guard-7f577fc654-p58p4           1/1     Running   0          22h
etcd-quorum-guard-7f577fc654-tgg2j           1/1     Running   1          22h


$ oc describe pod etcd-quorum-guard-7f577fc654-g8dw5
...
Containers:
  guard:
    Container ID:  cri-o://259a10908a30b07400098b21b29382727a3f33750de8f00536918272cbc17fb2
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a3e0f24b20754f73c9f2a939ff16aebff879d4c74e82faccb56230a1274cac9
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a3e0f24b20754f73c9f2a939ff16aebff879d4c74e82faccb56230a1274cac9
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
    Args:
      -c
      # properly handle TERM and exit as soon as it is signaled
      set -euo pipefail
      trap 'jobs -p | xargs -r kill; exit 0' TERM
      sleep infinity & wait
      
    State:          Running
      Started:      Fri, 28 Jun 2019 16:36:02 +0530
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 27 Jun 2019 19:47:03 +0530
      Finished:     Fri, 28 Jun 2019 16:36:01 +0530
    Ready:          True
    Restart Count:  1
    Requests:
      cpu:      10m
      memory:   5Mi
...

Comment 6 errata-xmlrpc 2019-07-04 09:01:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1635


Note You need to log in before you can comment on or make changes to this bug.