Bug 1713039

Summary: etcd quorum guard test does not correctly make nodes unschedulable
Product: OpenShift Container Platform Reporter: Robert Krawitz <rkrawitz>
Component: Machine Config OperatorAssignee: Robert Krawitz <rkrawitz>
Status: CLOSED ERRATA QA Contact: Micah Abbott <miabbott>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: amurdaca, ccoleman, erich, sponnaga, wking
Target Milestone: ---Keywords: OSE41z_next
Target Release: 4.1.z   
Hardware: All   
OS: All   
Whiteboard: 4.1.3
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-26 08:50:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Robert Krawitz 2019-05-22 18:00:43 UTC
The etcd quorum guard test does not correctly make nodes unschedulable, resulting in occasional failures when the quorum guard exits more quickly per fix for bug 1712507

Comment 2 Antonio Murdaca 2019-05-27 11:02:52 UTC
PR merged.

Comment 11 W. Trevor King 2019-06-05 23:15:27 UTC
https://github.com/openshift/machine-config-operator/pull/822 is still open.

Comment 14 Micah Abbott 2019-06-19 20:04:37 UTC
I searched through the last 14d of CI results for log messages that were removed/changed in the PR (https://github.com/openshift/machine-config-operator/pull/822):

- "etcdQuotaGard deployment not present"
- "Node object was modified and not up to date; retrying"
- "Failed to make node %s %sschedulable"


I was unable to find any evidence of those messages.

Additionally, I pulled the machine-config-operator image included in the 4.1.0-0.nightly-2019-06-19-033215 release and inspected the contents of the changed manifest:

```
$ ./oc image info -a ../all-the-pull-secrets.json $(./oc adm release info -a ../all-the-pull-secrets.json --image-for=machine-config-operator registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-06-19-033215) | grep Name
Name:        quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:976cd21a9b96fa2e4e1bed568e3f34b9087703f4d18c914beb0379e05b43aeaf

$ sudo podman pull --authfile ../all-the-pull-secrets.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:976cd21a9b96fa
2e4e1bed568e3f34b9087703f4d18c914beb0379e05b43aeaf

$ ctr=$(sudo podman create quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:976cd21a9b96fa2e4e1bed568e3f34b9087703f4d18c91
4beb0379e05b43aeaf)                                                                                       
$ mnt=$(sudo podman mount $ctr)

$ sudo grep -C 10 TERM  $mnt/manifests/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml
        imagePullPolicy: IfNotPresent                                                                                                                                                                                
        name: guard                                                                                       
        volumeMounts:                                                                                                                                                                                                
        - mountPath: /mnt/kube                                                                            
          name: kubecerts                                                                                                                                                                                            
        command:                                                                                          
        - /bin/bash                                                                                                                                                                                                  
        args:                                                                                                                                                                                                        
        - -c                                                                                              
        - |                                                                                               
          # properly handle TERM and exit as soon as it is signaled                
          set -euo pipefail                                                                               
          trap 'jobs -p | xargs -r kill; exit 0' TERM                                                     
          sleep infinity & wait                                                                                                                                                                                      
        readinessProbe:                                                                                                                                                                                              
          exec:                                                                                                                                                                                                      
            command:                                 
            - /bin/sh                                                                                     
            - -c                                     
            - |                                                                                                                                                                                                      
                declare -r croot=/mnt/kube                                                                                                                                                                                           declare -r health_endpoint="https://127.0.0.1:2379/health"                             
                declare -r cert="$(find $croot -name 'system:etcd-peer*.crt' -print -quit)"
```

This confirms the manifest has the changes included in https://github.com/openshift/machine-config-operator/pull/822                       

Moving to VERIFIED

Comment 17 errata-xmlrpc 2019-06-26 08:50:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1589