Description of problem: Upgrade from 4.5.22 to 4.6.8 is stuck trying to evacuate etcd-quorum-guard-f8fb588f8-j9crv from master-1. Unable to apply 4.6.8: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-3c2d8a855f039d576a09cec356b48683 expected c470febe19e3b004fb99baa6679e7597f50554c5 has bc4ece5c0409f288eed8aa74b11fb646fc02226e: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ocp-lab06-rcjfk-master-1 is reporting: \"failed to drain node (5 tries): timed out waiting for the condition: error when evicting pod \\\"etcd-quorum-guard-f8fb588f8-j9crv\\\": global timeout reached: 1m30s\"", retrying etcd-quorum-guard-f8fb588f8-j9crv no longer exists in the cluster. Customer hit this bug during the upgrade when the IP address of master-1 changed when the node was rebooted: https://bugzilla.redhat.com/show_bug.cgi?id=1899316 This caused the master-1 etcd member to drop out of the cluster and stop the upgrade on etcd-quorum-guard. The IP was fixed, node was rebooted and etcd reformed without any issues. After that happened, the node was still stuck on evicting etcd-quorum-guard-f8fb588f8-j9crv. In reality, you can see etcd-quorum-guard-f8fb588f8-j9crv does not exist any more: etcd-quorum-guard-f8fb588f8-6wd4w 1/1 Running 0 5h4m 10.17.24.128 ocp-lab06-rcjfk-master-1 etcd-quorum-guard-f8fb588f8-f96gf 1/1 Running 0 11d 10.17.24.127 ocp-lab06-rcjfk-master-2 etcd-quorum-guard-f8fb588f8-tb4nb 1/1 Running 0 11d 10.17.24.126 ocp-lab06-rcjfk-master-0 Checking a little deeper, it is also confirm to not exist in etcd: [user@host ~]$ oc exec etcd-ocp-lab06-rcjfk-master-0 -- etcdctl get --keys-only --prefix /kubernetes.io/pods/openshift-etcd/ /kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-6wd4w /kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-f96gf /kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-tb4nb [user@host ~]$ oc exec etcd-ocp-lab06-rcjfk-master-1 -- etcdctl get --keys-only --prefix /kubernetes.io/pods/openshift-etcd/ /kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-6wd4w /kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-f96gf /kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-tb4nb [user@host ~]$ oc exec etcd-ocp-lab06-rcjfk-master-2 -- etcdctl get --keys-only --prefix /kubernetes.io/pods/openshift-etcd/ /kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-6wd4w /kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-f96gf /kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-tb4nb A second cluster also hit the IP issue that lead to this, but it not get stuck on the quorum guard pod after the issue was corrected and the cluster upgraded successfully. Version-Release number of selected component (if applicable): 4.6.8 How reproducible: Once during upgade Steps to Reproduce: 1. Upgrade from 4.5.20 to 4.6.8 2. 3. Actual results: Cluster is stuck and can not complete the upgrade Expected results: Successful upgrade Additional info:
Eviction is done by the kubelet. Moving over to node team to investigate why it fails.
There is no evidence in the kubelet log showing its stuck evicting etcd-quorum-guard-f8fb588f8-j9crv (no logs for the pod at all)... The pod etcd-quorum-guard-f8fb588f8-j9crv does not exist in the namespace or in etcd, its totally gone. To me it appears as if MCD is stuck based on some old / out of sync view of the cluster. We have tried to restart MCO / MCD with no change. Ex: We see logs for all of the actually running quorum guard pods: Jan 11 14:17:49.614217 ocp-wdc-lab06-rcjfk-master-2 hyperkube[1506]: I0111 14:17:49.614167 1506 kubelet.go:1990] SyncLoop (SYNC): 2 pods; etcd-quorum-guard-f8fb588f8-f96gf_openshift-etcd(664c5726-dc37-4a3c-96bb-a8a883bb9f60), recyler-pod-ocp-wdc-lab06-rcjfk-master-2_openshift-infra(673f75a5afe7b1868f4dfe45420bc256)
etcd quorum guard was migrated from the MCO to the Etcd operator. If it is not running at all then perhaps there is an issue with the upgrade path. Moving over to MCO, but this is likely an issue with the migration of that component.
I saw another bug in my queue that may be related... https://bugzilla.redhat.com/show_bug.cgi?id=1915235#c10 There were changes in the PDB code... I'm going to create a slack room.
The must-gather attached to the associated case doesn't have any string match for 'j9crv' in logs or otherwise. Also, looking at the cluster operator info, components are all reporting version 4.5.15. I'm not confident we have the must-gather from the correct cluster.
(In reply to Michael Gugino from comment #8) > The must-gather attached to the associated case doesn't have any string > match for 'j9crv' in logs or otherwise. Also, looking at the cluster > operator info, components are all reporting version 4.5.15. > > I'm not confident we have the must-gather from the correct cluster. Disregard this, I think I'm looking at the wrong one.
We can see that the MCD did in fact drain the pod in question. The synchronization between the MCD and the MCO must not be robust. namespaces/openshift-machine-config-operator/pods/machine-config-daemon-69sx9/machine-config-daemon/machine-config-daemon/logs/current.log 2020-12-18T14:45:10.5451798Z I1218 14:45:10.545119 2172117 daemon.go:330] Evicted pod openshift-etcd/etcd-quorum-guard-f8fb588f8-jr9hg 2020-12-18T14:45:24.951966081Z I1218 14:45:24.951919 2172117 daemon.go:330] Evicted pod openshift-console/console-566c79b5bc-xzbp4 2020-12-18T14:45:39.75229594Z I1218 14:45:39.752243 2172117 daemon.go:330] Evicted pod openshift-authentication/oauth-openshift-5997c6f6ff-cnjfc 2020-12-18T14:46:00.158513195Z I1218 14:46:00.158472 2172117 daemon.go:330] Evicted pod openshift-apiserver/apiserver-f6f8fd7cc-tcthf 2020-12-18T14:46:00.551062949Z I1218 14:46:00.551009 2172117 daemon.go:330] Evicted pod openshift-oauth-apiserver/apiserver-844b7fb64c-kwkv4 2020-12-18T14:46:00.551062949Z I1218 14:46:00.551048 2172117 update.go:1653] drain complete
As discussed on slack, the real error is in `master-1`: Marking Degraded due to: unexpected on-disk state validating against rendered-master-54ea053f55c5a36400062a7679388bde: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4c5b83a192734ad6aa33da798f51b4b7ebe0f633ed63d53867a0c3fb73993024", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7063a4a169c39fe3daa5d7aff6befa4c151cc28d3afdc3e0d18ea6b2a3015ebb" The pool never resynced the node status, causing the MCO never to have updated the error. For the above comment, "Evicted pod openshift-etcd/etcd-quorum-guard-f8fb588f8-jr9hg" is for a different node entirely and not related to the error. The initial etcd-quorum drain was blocking the upgrade, and in the newest cluster state, the update did not complete correctly for that master. Currently uncertain whether this is due to the aforementioned manual reboot or not, but in any case the cluster should progress once the node is on the correct osimageurl
(In reply to Yu Qi Zhang from comment #13) > As discussed on slack, the real error is in `master-1`: > > Marking Degraded due to: unexpected on-disk state validating against > rendered-master-54ea053f55c5a36400062a7679388bde: expected target osImageURL > "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > 4c5b83a192734ad6aa33da798f51b4b7ebe0f633ed63d53867a0c3fb73993024", have > "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > 7063a4a169c39fe3daa5d7aff6befa4c151cc28d3afdc3e0d18ea6b2a3015ebb" > > The pool never resynced the node status, causing the MCO never to have > updated the error. > > For the above comment, "Evicted pod > openshift-etcd/etcd-quorum-guard-f8fb588f8-jr9hg" is for a different node > entirely and not related to the error. The initial etcd-quorum drain was > blocking the upgrade, and in the newest cluster state, the update did not > complete correctly for that master. Currently uncertain whether this is due > to the aforementioned manual reboot or not, but in any case the cluster > should progress once the node is on the correct osimageurl I believe etcd-quorum-guard was working as designed. If another control-plane host had to be manually fixed to rejoin the cluster, then etcd-quorum-guard on all the others will block drain until the PDB is satisfied. After the broken host was restored, etcd-quorum-guard appeared to be evicted as normal, but subsequently the MCD failed for a different reason and never updated the operator status to reflect this.
Agreed - the quorum guard did do its job. master-1 ultimately hit this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1902963 Which was hidden by the outdated status related to the old quorum guard pod.
Sorry for the delay, since the other issues are covered, I am going to go ahead and rephrase this bug as: MachineConfigPools sometimes do not resync error status, causing the MCO operator status to not report updated errors Also targetting 4.8 and setting priority to medium as this should be relatively rare. Will try to find exact reproduction conditions.
This is a rare issue that is hard to reproduce. We are closing for now as a won't fix.