Description of problem: sdn-controller is going through leader elections and restarting during the kube-apiserver rollout which currently takes around 60 seconds with shutdown-delay-duration and gracefulTerminationDuration is now set to 0 and 15 seconds ( https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 and https://github.com/openshift/library-go/pull/1104 ). sdn-controller leader election lease duration needs to be set to > 60 seconds to handle the downtime gracefully in SNO. Recommended lease duration values to be considered for reference as noted in https://github.com/openshift/enhancements/pull/832/files#diff-2e28754e69aa417e5b6d89e99e42f05bfb6330800fa823753383db1d170fbc2fR183: LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s. These are the configurable values in k8s.io/client-go based leases and controller-runtime exposes them. This gives us 1. clock skew tolerance == 30s 2. kube-apiserver downtime tolerance == 78s 3. worst non-graceful lease reacquisition == 163s 4. worst graceful lease reacquisition == 26s Here is the trace of the events during the rollout: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/sdn-controller/cerberus_api_rollout_trace.jsonWe can see that leader lease failures in the log: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/sdn-controller/sdn-controller.log. The leader election can also be disabled given that there's no HA in SNO. Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-07-19-192457 How reproducible: Always Steps to Reproduce: 1. Install a SNO cluster using the latest nightly payload. 2. Trigger kube-apiserver rollout or outage which lasts for at least 60 seconds ( kube-apiserver rollout on a cluster built using payload with https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 should take ~60 seconds ) - $oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATIONX"}}' where X can be 1,2...n 3. Observe the state of sdn-controller. Actual results: sdn-controller goes through leader election and restarts. Expected results: sdn-controller should handle the API rollout/outage gracefully. Additional info: Logs including must-gather: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/sdn-controller/
If we change the timeouts then won't that make it take longer to recover in "real" clusters? It seems like we should just make it not do leader election at all in SNO / if there's only a single master.
Right, leader elections are not needed in SNO given that there's no HA. We can take the route of detecting an SNO deployment and flipping the leader-elect flag if it's exposed as an option and is a feasible change for the 4.9 time frame. Thoughts?
Move this to verified according comment 8 9 10.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days