Bug 1713039 - etcd quorum guard test does not correctly make nodes unschedulable
Summary: etcd quorum guard test does not correctly make nodes unschedulable
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.1.0
Hardware: All
OS: All
Target Milestone: ---
: 4.1.z
Assignee: Robert Krawitz
QA Contact: Micah Abbott
Whiteboard: 4.1.3
Depends On:
TreeView+ depends on / blocked
Reported: 2019-05-22 18:00 UTC by Robert Krawitz
Modified: 2019-06-26 08:50 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2019-06-26 08:50:22 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:1589 0 None None None 2019-06-26 08:50:29 UTC

Description Robert Krawitz 2019-05-22 18:00:43 UTC
The etcd quorum guard test does not correctly make nodes unschedulable, resulting in occasional failures when the quorum guard exits more quickly per fix for bug 1712507

Comment 2 Antonio Murdaca 2019-05-27 11:02:52 UTC
PR merged.

Comment 11 W. Trevor King 2019-06-05 23:15:27 UTC
https://github.com/openshift/machine-config-operator/pull/822 is still open.

Comment 14 Micah Abbott 2019-06-19 20:04:37 UTC
I searched through the last 14d of CI results for log messages that were removed/changed in the PR (https://github.com/openshift/machine-config-operator/pull/822):

- "etcdQuotaGard deployment not present"
- "Node object was modified and not up to date; retrying"
- "Failed to make node %s %sschedulable"

I was unable to find any evidence of those messages.

Additionally, I pulled the machine-config-operator image included in the 4.1.0-0.nightly-2019-06-19-033215 release and inspected the contents of the changed manifest:

$ ./oc image info -a ../all-the-pull-secrets.json $(./oc adm release info -a ../all-the-pull-secrets.json --image-for=machine-config-operator registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-06-19-033215) | grep Name
Name:        quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:976cd21a9b96fa2e4e1bed568e3f34b9087703f4d18c914beb0379e05b43aeaf

$ sudo podman pull --authfile ../all-the-pull-secrets.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:976cd21a9b96fa

$ ctr=$(sudo podman create quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:976cd21a9b96fa2e4e1bed568e3f34b9087703f4d18c91
$ mnt=$(sudo podman mount $ctr)

$ sudo grep -C 10 TERM  $mnt/manifests/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml
        imagePullPolicy: IfNotPresent                                                                                                                                                                                
        name: guard                                                                                       
        - mountPath: /mnt/kube                                                                            
          name: kubecerts                                                                                                                                                                                            
        - /bin/bash                                                                                                                                                                                                  
        - -c                                                                                              
        - |                                                                                               
          # properly handle TERM and exit as soon as it is signaled                
          set -euo pipefail                                                                               
          trap 'jobs -p | xargs -r kill; exit 0' TERM                                                     
          sleep infinity & wait                                                                                                                                                                                      
            - /bin/sh                                                                                     
            - -c                                     
            - |                                                                                                                                                                                                      
                declare -r croot=/mnt/kube                                                                                                                                                                                           declare -r health_endpoint=""                             
                declare -r cert="$(find $croot -name 'system:etcd-peer*.crt' -print -quit)"

This confirms the manifest has the changes included in https://github.com/openshift/machine-config-operator/pull/822                       

Moving to VERIFIED

Comment 17 errata-xmlrpc 2019-06-26 08:50:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.