Bug 1915499 - MachineConfigPools sometimes do not resync error status, causing the MCO operator status to not report updated errors
Summary: MachineConfigPools sometimes do not resync error status, causing the MCO oper...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6.z
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: 4.9.0
Assignee: MCO Team
QA Contact: Rio Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-12 18:23 UTC by Matthew Robson
Modified: 2021-10-25 15:58 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-25 15:58:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Matthew Robson 2021-01-12 18:23:34 UTC
Description of problem:

Upgrade from 4.5.22 to 4.6.8 is stuck trying to evacuate etcd-quorum-guard-f8fb588f8-j9crv from master-1.

Unable to apply 4.6.8: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-3c2d8a855f039d576a09cec356b48683 expected c470febe19e3b004fb99baa6679e7597f50554c5 has bc4ece5c0409f288eed8aa74b11fb646fc02226e: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ocp-lab06-rcjfk-master-1 is reporting: \"failed to drain node (5 tries): timed out waiting for the condition: error when evicting pod \\\"etcd-quorum-guard-f8fb588f8-j9crv\\\": global timeout reached: 1m30s\"", retrying

etcd-quorum-guard-f8fb588f8-j9crv no longer exists in the cluster.

Customer hit this bug during the upgrade when the IP address of master-1 changed when the node was rebooted: https://bugzilla.redhat.com/show_bug.cgi?id=1899316

This caused the master-1 etcd member to drop out of the cluster and stop the upgrade on etcd-quorum-guard.

The IP was fixed, node was rebooted and etcd reformed without any issues.

After that happened, the node was still stuck on evicting etcd-quorum-guard-f8fb588f8-j9crv.

In reality, you can see etcd-quorum-guard-f8fb588f8-j9crv does not exist any more:
etcd-quorum-guard-f8fb588f8-6wd4w               1/1    Running    0         5h4m  10.17.24.128  ocp-lab06-rcjfk-master-1
etcd-quorum-guard-f8fb588f8-f96gf               1/1    Running    0         11d   10.17.24.127  ocp-lab06-rcjfk-master-2
etcd-quorum-guard-f8fb588f8-tb4nb               1/1    Running    0         11d   10.17.24.126  ocp-lab06-rcjfk-master-0

Checking a little deeper, it is also confirm to not exist in etcd:

[user@host ~]$ oc exec etcd-ocp-lab06-rcjfk-master-0 -- etcdctl  get --keys-only --prefix /kubernetes.io/pods/openshift-etcd/
/kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-6wd4w
/kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-f96gf
/kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-tb4nb

[user@host ~]$ oc exec etcd-ocp-lab06-rcjfk-master-1 -- etcdctl  get --keys-only --prefix /kubernetes.io/pods/openshift-etcd/
/kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-6wd4w
/kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-f96gf
/kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-tb4nb

[user@host ~]$ oc exec etcd-ocp-lab06-rcjfk-master-2 -- etcdctl  get --keys-only --prefix /kubernetes.io/pods/openshift-etcd/
/kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-6wd4w
/kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-f96gf
/kubernetes.io/pods/openshift-etcd/etcd-quorum-guard-f8fb588f8-tb4nb

A second cluster also hit the IP issue that lead to this, but it not get stuck on the quorum guard pod after the issue was corrected and the cluster upgraded successfully.


Version-Release number of selected component (if applicable):
4.6.8

How reproducible:
Once during upgade

Steps to Reproduce:
1. Upgrade from 4.5.20 to 4.6.8 
2.
3.

Actual results:
Cluster is stuck and can not complete the upgrade


Expected results:
Successful upgrade


Additional info:

Comment 3 Stefan Schimanski 2021-01-14 10:25:12 UTC
Eviction is done by the kubelet. Moving over to node team to investigate why it fails.

Comment 4 Matthew Robson 2021-01-14 15:25:53 UTC
There is no evidence in the kubelet log showing its stuck evicting etcd-quorum-guard-f8fb588f8-j9crv (no logs for the pod at all)... The pod etcd-quorum-guard-f8fb588f8-j9crv does not exist in the namespace or in etcd, its totally gone.

To me it appears as if MCD is stuck based on some old / out of sync view of the cluster.

We have tried to restart MCO / MCD with no change.


Ex: We see logs for all of the actually running quorum guard pods:

Jan 11 14:17:49.614217 ocp-wdc-lab06-rcjfk-master-2 hyperkube[1506]: I0111 14:17:49.614167    1506 kubelet.go:1990] SyncLoop (SYNC): 2 pods; etcd-quorum-guard-f8fb588f8-f96gf_openshift-etcd(664c5726-dc37-4a3c-96bb-a8a883bb9f60), recyler-pod-ocp-wdc-lab06-rcjfk-master-2_openshift-infra(673f75a5afe7b1868f4dfe45420bc256)

Comment 5 Ryan Phillips 2021-01-21 14:07:02 UTC
etcd quorum guard was migrated from the MCO to the Etcd operator. If it is not running at all then perhaps there is an issue with the upgrade path.

Moving over to MCO, but this is likely an issue with the migration of that component.

Comment 6 Ryan Phillips 2021-01-21 14:57:34 UTC
I saw another bug in my queue that may be related...

https://bugzilla.redhat.com/show_bug.cgi?id=1915235#c10

There were changes in the PDB code... I'm going to create a slack room.

Comment 8 Michael Gugino 2021-01-21 15:57:21 UTC
The must-gather attached to the associated case doesn't have any string match for 'j9crv' in logs or otherwise.  Also, looking at the cluster operator info, components are all reporting version 4.5.15.

I'm not confident we have the must-gather from the correct cluster.

Comment 11 Michael Gugino 2021-01-21 16:02:47 UTC
(In reply to Michael Gugino from comment #8)
> The must-gather attached to the associated case doesn't have any string
> match for 'j9crv' in logs or otherwise.  Also, looking at the cluster
> operator info, components are all reporting version 4.5.15.
> 
> I'm not confident we have the must-gather from the correct cluster.

Disregard this, I think I'm looking at the wrong one.

Comment 12 Michael Gugino 2021-01-21 16:25:02 UTC
We can see that the MCD did in fact drain the pod in question.  The synchronization between the MCD and the MCO must not be robust.

namespaces/openshift-machine-config-operator/pods/machine-config-daemon-69sx9/machine-config-daemon/machine-config-daemon/logs/current.log

2020-12-18T14:45:10.5451798Z I1218 14:45:10.545119 2172117 daemon.go:330] Evicted pod openshift-etcd/etcd-quorum-guard-f8fb588f8-jr9hg
2020-12-18T14:45:24.951966081Z I1218 14:45:24.951919 2172117 daemon.go:330] Evicted pod openshift-console/console-566c79b5bc-xzbp4
2020-12-18T14:45:39.75229594Z I1218 14:45:39.752243 2172117 daemon.go:330] Evicted pod openshift-authentication/oauth-openshift-5997c6f6ff-cnjfc
2020-12-18T14:46:00.158513195Z I1218 14:46:00.158472 2172117 daemon.go:330] Evicted pod openshift-apiserver/apiserver-f6f8fd7cc-tcthf
2020-12-18T14:46:00.551062949Z I1218 14:46:00.551009 2172117 daemon.go:330] Evicted pod openshift-oauth-apiserver/apiserver-844b7fb64c-kwkv4
2020-12-18T14:46:00.551062949Z I1218 14:46:00.551048 2172117 update.go:1653] drain complete

Comment 13 Yu Qi Zhang 2021-01-21 17:14:02 UTC
As discussed on slack, the real error is in `master-1`:

Marking Degraded due to: unexpected on-disk state validating against rendered-master-54ea053f55c5a36400062a7679388bde: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4c5b83a192734ad6aa33da798f51b4b7ebe0f633ed63d53867a0c3fb73993024", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7063a4a169c39fe3daa5d7aff6befa4c151cc28d3afdc3e0d18ea6b2a3015ebb"

The pool never resynced the node status, causing the MCO never to have updated the error.

For the above comment, "Evicted pod openshift-etcd/etcd-quorum-guard-f8fb588f8-jr9hg" is for a different node entirely and not related to the error. The initial etcd-quorum drain was blocking the upgrade, and in the newest cluster state, the update did not complete correctly for that master. Currently uncertain whether this is due to the aforementioned manual reboot or not, but in any case the cluster should progress once the node is on the correct osimageurl

Comment 14 Michael Gugino 2021-01-21 18:25:42 UTC
(In reply to Yu Qi Zhang from comment #13)
> As discussed on slack, the real error is in `master-1`:
> 
> Marking Degraded due to: unexpected on-disk state validating against
> rendered-master-54ea053f55c5a36400062a7679388bde: expected target osImageURL
> "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> 4c5b83a192734ad6aa33da798f51b4b7ebe0f633ed63d53867a0c3fb73993024", have
> "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> 7063a4a169c39fe3daa5d7aff6befa4c151cc28d3afdc3e0d18ea6b2a3015ebb"
> 
> The pool never resynced the node status, causing the MCO never to have
> updated the error.
> 
> For the above comment, "Evicted pod
> openshift-etcd/etcd-quorum-guard-f8fb588f8-jr9hg" is for a different node
> entirely and not related to the error. The initial etcd-quorum drain was
> blocking the upgrade, and in the newest cluster state, the update did not
> complete correctly for that master. Currently uncertain whether this is due
> to the aforementioned manual reboot or not, but in any case the cluster
> should progress once the node is on the correct osimageurl

I believe etcd-quorum-guard was working as designed.  If another control-plane host had to be manually fixed to rejoin the cluster, then etcd-quorum-guard on all the others will block drain until the PDB is satisfied.  After the broken host was restored, etcd-quorum-guard appeared to be evicted as normal, but subsequently the MCD failed for a different reason and never updated the operator status to reflect this.

Comment 15 Matthew Robson 2021-01-21 23:08:56 UTC
Agreed - the quorum guard did do its job.

master-1 ultimately hit this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1902963

Which was hidden by the outdated status related to the old quorum guard pod.

Comment 16 Yu Qi Zhang 2021-02-24 03:14:10 UTC
Sorry for the delay, since the other issues are covered, I am going to go ahead and rephrase this bug as:

MachineConfigPools sometimes do not resync error status, causing the MCO operator status to not report updated errors

Also targetting 4.8 and setting priority to medium as this should be relatively rare. Will try to find exact reproduction conditions.

Comment 22 Michelle Krejci 2021-10-25 15:58:51 UTC
This is a rare issue that is hard to reproduce. We are closing for now as a won't fix.


Note You need to log in before you can comment on or make changes to this bug.