The etcd quorum guard test does not correctly make nodes unschedulable, resulting in occasional failures when the quorum guard exits more quickly per fix for bug 1712507
https://github.com/openshift/machine-config-operator/pull/789
PR merged.
https://github.com/openshift/machine-config-operator/pull/822 is still open.
I searched through the last 14d of CI results for log messages that were removed/changed in the PR (https://github.com/openshift/machine-config-operator/pull/822): - "etcdQuotaGard deployment not present" - "Node object was modified and not up to date; retrying" - "Failed to make node %s %sschedulable" I was unable to find any evidence of those messages. Additionally, I pulled the machine-config-operator image included in the 4.1.0-0.nightly-2019-06-19-033215 release and inspected the contents of the changed manifest: ``` $ ./oc image info -a ../all-the-pull-secrets.json $(./oc adm release info -a ../all-the-pull-secrets.json --image-for=machine-config-operator registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-06-19-033215) | grep Name Name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:976cd21a9b96fa2e4e1bed568e3f34b9087703f4d18c914beb0379e05b43aeaf $ sudo podman pull --authfile ../all-the-pull-secrets.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:976cd21a9b96fa 2e4e1bed568e3f34b9087703f4d18c914beb0379e05b43aeaf $ ctr=$(sudo podman create quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:976cd21a9b96fa2e4e1bed568e3f34b9087703f4d18c91 4beb0379e05b43aeaf) $ mnt=$(sudo podman mount $ctr) $ sudo grep -C 10 TERM $mnt/manifests/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml imagePullPolicy: IfNotPresent name: guard volumeMounts: - mountPath: /mnt/kube name: kubecerts command: - /bin/bash args: - -c - | # properly handle TERM and exit as soon as it is signaled set -euo pipefail trap 'jobs -p | xargs -r kill; exit 0' TERM sleep infinity & wait readinessProbe: exec: command: - /bin/sh - -c - | declare -r croot=/mnt/kube declare -r health_endpoint="https://127.0.0.1:2379/health" declare -r cert="$(find $croot -name 'system:etcd-peer*.crt' -print -quit)" ``` This confirms the manifest has the changes included in https://github.com/openshift/machine-config-operator/pull/822 Moving to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1589