Description of problem: cluster-version-operator is crashing/restarting and is going though leader elections during the kube-apiserver rollout which currently takes around 60 seconds with shutdown-delay-duration and gracefulTerminationDuration is now set to 0 and 15 seconds ( https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 and https://github.com/openshift/library-go/pull/1104 ). kube-scheduler leader election timeout should be set to > 60 seconds to handle the downtime gracefully in SNO. Recommended lease duration values to be considered for reference as noted in https://github.com/openshift/enhancements/pull/832/files#diff-2e28754e69aa417e5b6d89e99e42f05bfb6330800fa823753383db1d170fbc2fR183: LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s. These are the configurable values in k8s.io/client-go based leases and controller-runtime exposes them. This gives us 1. clock skew tolerance == 30s 2. kube-apiserver downtime tolerance == 78s 3. worst non-graceful lease reacquisition == 163s 4. worst graceful lease reacquisition == 26s Here is the trace of the events during the rollout: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/cluster-version-operator/cerberus_cluster_state.log The leader election can be disabled given that there's no HA in SNO. Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-07-24-113438 How reproducible: Always Steps to Reproduce: 1. Install a SNO cluster using the latest nightly payload. 2. Trigger kube-apiserver rollout or outage which lasts for at least 60 seconds ( kube-apiserver rollout on a cluster built using payload with https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 should take ~60 seconds ) - $ oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATIONX"}}' where X can be 1,2...n 3. Observe the state of cluster-version-operator. Actual results: cluster-version-operator is crashing/restarting and going through leader elections. Expected results: cluster-version-operator should handle the API rollout/outage gracefully. Additional info: Logs including must-gather: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/cluster-version-operator/
[Pre-Merge QA Testing] - Custom version that includes this PR: ~~~ $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False 68m Cluster version is 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest ~~~ - Status of the cluster after fresh installation: ~~~ $ oc get nodes NAME STATUS ROLES AGE VERSION master-00.pamoedo-snotest3.qe.devcluster.openshift.com Ready master,worker 86m v1.21.1+38b3ecc $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 67m baremetal 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 79m cloud-controller-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m cloud-credential 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 105m cluster-autoscaler 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 78m config-operator 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m console 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 69m csi-snapshot-controller 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m dns 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 79m etcd 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 79m image-registry 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 72m ingress 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 73m insights 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 73m kube-apiserver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 73m kube-controller-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 78m kube-scheduler 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 78m kube-storage-version-migrator 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m machine-api 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 79m machine-approver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m machine-config 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 78m marketplace 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 78m monitoring 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 69m network 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 81m node-tuning 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m openshift-apiserver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 69m openshift-controller-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 79m openshift-samples 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 74m operator-lifecycle-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m operator-lifecycle-manager-catalog 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m operator-lifecycle-manager-packageserver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 75m service-ca 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m storage 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 80m ~~~ - Forced a kubeapiserver redeploy: ~~~ $ oc patch kubeapiserver/cluster --type merge -p "{\"spec\":{\"forceRedeploymentReason\":\"Forcing new revision with random number $RANDOM to make message unique\"}}" kubeapiserver.operator.openshift.io/cluster patched $ oc describe kubeapiserver/cluster | grep Redeployment f:forceRedeploymentReason: Force Redeployment Reason: Forcing new revision with random number 14640 to make message unique ~~~ - After some minutes, the clusteroperators finished to progress and all of them are properly running as expected: ~~~ $ oc get nodes NAME STATUS ROLES AGE VERSION master-00.pamoedo-snotest3.qe.devcluster.openshift.com Ready master,worker 124m v1.21.1+38b3ecc $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 115m baremetal 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 127m cloud-controller-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m cloud-credential 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 153m cluster-autoscaler 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 126m config-operator 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m console 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 117m csi-snapshot-controller 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m dns 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 127m etcd 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 127m image-registry 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 120m ingress 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 122m insights 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 122m kube-apiserver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 121m kube-controller-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 126m kube-scheduler 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 126m kube-storage-version-migrator 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m machine-api 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 127m machine-approver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m machine-config 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 126m marketplace 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 126m monitoring 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 117m network 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 129m node-tuning 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m openshift-apiserver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 117m openshift-controller-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 127m openshift-samples 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 123m operator-lifecycle-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m operator-lifecycle-manager-catalog 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m operator-lifecycle-manager-packageserver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 123m service-ca 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m storage 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 128m ~~~ Best Regards.
[Pre-Merge QA Testing]-Extension Taking advantage of the testing cluster I have also forced a redeploy of "kubecontrollermanager/cluster" and "kubescheduler/cluster" with the following commands: ~~~ $ oc patch kubecontrollermanager/cluster --type merge -p "{\"spec\":{\"forceRedeploymentReason\":\"Forcing new revision with random number $RANDOM to make message unique\"}}" kubecontrollermanager.operator.openshift.io/cluster patched $ oc patch kubescheduler/cluster --type merge -p "{\"spec\":{\"forceRedeploymentReason\":\"Forcing new revision with random number $RANDOM to make message unique\"}}" kubescheduler.operator.openshift.io/cluster patched ~~~ Both operations progressed quickly and without errors, all clusteroperators are working as expected: ~~~ $ oc get co | grep "kube-apiserver\|kube-controller-manager\|kube-scheduler" kube-apiserver 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 151m kube-controller-manager 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 156m kube-scheduler 4.8.0-0.ci.test-2021-07-30-095611-ci-ln-dyxsvsb-latest True False False 155m $ oc get pods -A | grep "kube-apiserver-master\|kube-controller-manager-master\|kube-scheduler-master" openshift-kube-apiserver kube-apiserver-master-00.pamoedo-snotest3.qe.devcluster.openshift.com 5/5 Running 0 52m openshift-kube-controller-manager kube-controller-manager-master-00.pamoedo-snotest3.qe.devcluster.openshift.com 4/4 Running 0 6m46s openshift-kube-scheduler openshift-kube-scheduler-master-00.pamoedo-snotest3.qe.devcluster.openshift.com 3/3 Running 0 4m51s ~~~ Regards.
[QA Summary] [Version] ~~~ $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-08-07-175228 True False 9m Cluster version is 4.9.0-0.nightly-2021-08-07-175228 $ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-08-07-175228 | grep cluster-version-operator cluster-version-operator https://github.com/openshift/cluster-version-operator 0ec39d9b2ab1feee8815d7b6b4bbe2db23daf847 [pamoedo@p50 cluster-version-operator] $ git --no-pager log --oneline --first-parent origin/master -3 0ec39d9b (HEAD -> master, origin/release-4.9, origin/release-4.10, origin/master, origin/HEAD) Merge pull request #634 from LalatenduMohanty/BZ_1985802 6e9ea6f5 Merge pull request #636 from sdodson/approvers_emeritus bd36a2e1 Merge pull request #635 from jan--f/patch-1 ~~~ [Parameters] BareMetal SNO installation with default values. [Results] As expected, the installation succeeded with latest 4-9.nightly and all operators were ready: ~~~ $ oc get nodes NAME STATUS ROLES AGE VERSION master-00.pamoedo-bz1985802.qe.devcluster.openshift.com Ready master,worker 26m v1.21.1+8268f88 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.9.0-0.nightly-2021-08-07-175228 True False False 6m16s baremetal 4.9.0-0.nightly-2021-08-07-175228 True False False 22m cloud-controller-manager 4.9.0-0.nightly-2021-08-07-175228 True False False 25m cloud-credential 4.9.0-0.nightly-2021-08-07-175228 True False False 36m cluster-autoscaler 4.9.0-0.nightly-2021-08-07-175228 True False False 22m config-operator 4.9.0-0.nightly-2021-08-07-175228 True False False 23m console 4.9.0-0.nightly-2021-08-07-175228 True False False 12m csi-snapshot-controller 4.9.0-0.nightly-2021-08-07-175228 True False False 12m dns 4.9.0-0.nightly-2021-08-07-175228 True False False 22m etcd 4.9.0-0.nightly-2021-08-07-175228 True False False 21m image-registry 4.9.0-0.nightly-2021-08-07-175228 True False False 11m ingress 4.9.0-0.nightly-2021-08-07-175228 True False False 17m insights 4.9.0-0.nightly-2021-08-07-175228 True False False 16m kube-apiserver 4.9.0-0.nightly-2021-08-07-175228 True False False 20m kube-controller-manager 4.9.0-0.nightly-2021-08-07-175228 True False False 20m kube-scheduler 4.9.0-0.nightly-2021-08-07-175228 True False False 20m kube-storage-version-migrator 4.9.0-0.nightly-2021-08-07-175228 True False False 23m machine-api 4.9.0-0.nightly-2021-08-07-175228 True False False 22m machine-approver 4.9.0-0.nightly-2021-08-07-175228 True False False 22m machine-config 4.9.0-0.nightly-2021-08-07-175228 True False False 22m marketplace 4.9.0-0.nightly-2021-08-07-175228 True False False 22m monitoring 4.9.0-0.nightly-2021-08-07-175228 True False False 12m network 4.9.0-0.nightly-2021-08-07-175228 True False False 23m node-tuning 4.9.0-0.nightly-2021-08-07-175228 True False False 22m openshift-apiserver 4.9.0-0.nightly-2021-08-07-175228 True False False 12m openshift-controller-manager 4.9.0-0.nightly-2021-08-07-175228 True False False 20m openshift-samples 4.9.0-0.nightly-2021-08-07-175228 True False False 15m operator-lifecycle-manager 4.9.0-0.nightly-2021-08-07-175228 True False False 22m operator-lifecycle-manager-catalog 4.9.0-0.nightly-2021-08-07-175228 True False False 22m operator-lifecycle-manager-packageserver 4.9.0-0.nightly-2021-08-07-175228 True False False 19m service-ca 4.9.0-0.nightly-2021-08-07-175228 True False False 23m storage 4.9.0-0.nightly-2021-08-07-175228 True False False 23m ~~~ After forcing a redeployment of "kubeapiserver/cluster", "kubescheduler/cluster" and "kubecontrollermanager/cluster", all operators successfully recovered and pods running as expected: ~~~ $ oc patch kubeapiserver/cluster --type merge -p "{\"spec\":{\"forceRedeploymentReason\":\"Forcing new revision with random number $RANDOM to make message unique\"}}" $ oc patch kubescheduler/cluster --type merge -p "{\"spec\":{\"forceRedeploymentReason\":\"Forcing new revision with random number $RANDOM to make message unique\"}}" $ oc patch kubecontrollermanager/cluster --type merge -p "{\"spec\":{\"forceRedeploymentReason\":\"Forcing new revision with random number $RANDOM to make message unique\"}}" $ oc get co | grep kube- kube-apiserver 4.9.0-0.nightly-2021-08-07-175228 True True False 21m NodeInstallerProgressing: 1 nodes are at revision 6; 0 nodes have achieved new revision 7 kube-controller-manager 4.9.0-0.nightly-2021-08-07-175228 True True False 22m NodeInstallerProgressing: 1 nodes are at revision 9; 0 nodes have achieved new revision 10 kube-scheduler 4.9.0-0.nightly-2021-08-07-175228 True True False 22m NodeInstallerProgressing: 1 nodes are at revision 8; 0 nodes have achieved new revision 9 kube-storage-version-migrator 4.9.0-0.nightly-2021-08-07-175228 True False False 24m $ oc get co | grep kube- kube-apiserver 4.9.0-0.nightly-2021-08-07-175228 True False False 26m kube-controller-manager 4.9.0-0.nightly-2021-08-07-175228 True False False 26m kube-scheduler 4.9.0-0.nightly-2021-08-07-175228 True False False 26m kube-storage-version-migrator 4.9.0-0.nightly-2021-08-07-175228 True False False 28m $ oc get pods -A | grep "kube-apiserver-master\|kube-controller-manager-master\|kube-scheduler-master" openshift-kube-apiserver kube-apiserver-master-00.pamoedo-bz1985802.qe.devcluster.openshift.com 5/5 Running 0 3m48s openshift-kube-controller-manager kube-controller-manager-master-00.pamoedo-bz1985802.qe.devcluster.openshift.com 4/4 Running 1 3m49s openshift-kube-scheduler openshift-kube-scheduler-master-00.pamoedo-bz1985802.qe.devcluster.openshift.com 3/3 Running 1 4m38s ~~~ Best Regards.
*** Bug 1969257 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759