Related to BZ 2063831: Once the above BZ is fixed [1] with the controller that deploys the static guard pods, we would need a controller that handles clean up of those guard pods in event of a downgrade from 4.11->4.10. [1]: https://github.com/openshift/cluster-etcd-operator/pull/763 +++ This bug was initially created as a clone of Bug #2063831 +++ TRT recently added a test to monitor for this and it exposed that etcd quorum pods are actually landing on the same node for periods of time: https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=openshift-tests-upgrade.[sig-scheduling][Early]%20The%20openshift-etcd%20pods%20should%20be%20scheduled%20on%20different%20nodes%20[Suite:openshift/conformance/parallel] Sample job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1503258288765014016 This seems to be happening alarmingly often: https://search.ci.openshift.org/?search=The+openshift-etcd+pods+should+be+scheduled+on+different+nodes&maxAge=48h&context=0&type=junit&name=4.11&excludeName=quorum&maxMatches=5&maxBytes=20971520&groupBy=job Marking sev high as this has potential to cause loss of quorum. Backporting to 4.10 should probably be discussed. Jan Chaloupka did some work to allow force assign PDB pods to nodes instead of relying on scheduler, may be a good idea to make use of this for etcd. --- Additional comment from Devan Goodwin on 2022-03-14 13:37:19 UTC --- TRT is double checking the results to make absolutely sure the test is catching something real. --- Additional comment from Ken Zhang on 2022-03-14 15:10:50 UTC --- I confirmed that for both HAProxy and ETCD cases, the test is catching real problems. There is a bug with image-registry that is being fixed. --- Additional comment from Haseeb Tariq on 2022-03-14 21:20:55 UTC --- Working on an update to replace the etcd-operator's quorum guard controller with the staticpod quorum guard controller. This would also include a new readyz server sidecar on the etcd-pods for the guard controller to be able to check for pod readiness. --- Additional comment from W. Trevor King on 2022-03-21 22:17:28 UTC ---
Closing this as we don't officially support minor version downgrades e.g 4.11 -> 4.10