Description of problem: On a Single Node OpenShift cluster, Machine-api-operator is restarting trying to acquire lease even when it's lease duration is set to 90 sec and kube-apiserver rollout takes around 60 seconds given that the shutdown-delay-duration and gracefulTerminationDuration is now set to 0 and 15 seconds respectively to reduce the downtime. Recommended lease duration values to be considered for reference as noted in https://github.com/openshift/enhancements/pull/832/files#diff-2e28754e69aa417e5b6d89e99e42f05bfb6330800fa823753383db1d170fbc2fR183: LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s. These are the configurable values in k8s.io/client-go based leases and controller-runtime exposes them. This gives us 1. clock skew tolerance == 30s 2. kube-apiserver downtime tolerance == 78s 3. worst non-graceful lease reacquisition == 163s 4. worst graceful lease reacquisition == 26s Here is the cluster state during the 60 sec API server downtime: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/machine-api-operator-leader-election/cerberus_cluster_state.log. We can see that the machine-api-operator restarted at 18:57:37 UTC and it was trying to acquire the lease looking at the exited container on the node during that time: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/machine-api-operator-leader-election/machine-api-operator-container.log. The leader election can also be disabled given that there's no HA in SNO. Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-07-07-021823 How reproducible: Always Steps to Reproduce: 1. Install a SNO cluster using the latest nightly payload. 2. Trigger kube-apiserver rollout or outage which lasts for at least 60 seconds ( kube-apiserver rollout on a cluster built using payload with https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 should take ~60 seconds ) - $ oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATIONX"}}' where X can be 1,2...n 3. Observe the state of Machine-api-operator. Actual results: Machine-api-operator goes through leader election and restarts. Expected results: Machine-api-operator should handle the API rollout/outage gracefully as the lease duration is set to 90 seconds. Additional information: Logs including must-gather are at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/machine-api-operator-leader-election/.
> We can see that the machine-api-operator restarted at 18:57:37 UTC and it was trying to acquire the lease looking at the exited container on the node during that time: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/machine-api-operator-leader-election/machine-api-operator-container.log. Looking at this log, I can see that the leader election was lost 60 seconds after the first error that is logged. I think this is entirely reasonable behaviour as the lease may have already been renewed 30 seconds before, this seems very much like a timing issue. After we lose the election, the only safe thing to do is to panic, we have to ensure the controller shuts down to prevent duplicate controllers running at the same time. The numbers for the lease duration/refresh have all been chosen based on other components within openshift and some advice from Clayton after we initially set these too aggressively (we were refreshing every 15 seconds IIRC). We were causing too much load on the API server so had to back this off to what we have now. I don't think there's actually an issue here, or at least, I don't think there's one we can safely protect against without extending the lease considerably, and ensure that renewal happens before 60 seconds is remaining on the lease.
Thanks for taking a look Joel. One of the goals for the SNO is to bring down the API rollout time to 60 seconds and have various cluster operators and respective components handle the downtime of API gracefully: https://github.com/openshift/library-go/pull/1104. Is it possible to wait for 90 seconds after the error/hitting an issue talking to the API before triggering the election and restarting to handle it gracefully instead? Thoughts?
> Is it possible to wait for 90 seconds after the error/hitting an issue talking to the API before triggering the election and restarting to handle it gracefully instead? Thoughts? No, this is not safe. If we can't access the API, we have to assume it's our fault or the node that we are running on has become partitioned. In this scenario, another pod would be scheduled elsewhere and will start up and take the lease from us as soon as it has expired. Waiting beyond the lease is not safe. The only possible option here is extending the lease durations and refresh periods, but the values we are using I believe are standard across OCP. Are you not seeing similar issues in other components? Also, given that our component will come back and work again after the restart, is this really an issue?
@jspeed you can find more info here: https://github.com/openshift/enhancements/pull/832 For SNO, our goal is for all the operators to be able to tolerate 60s of api downtime.
I've reviewed the enhancement, I think the proposed changes are acceptable to our team. Will make sure that our components get updated to the recommended values from the enhancement. For reference, these are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.
As per the discussions in comments and validations done , We are able to confirm no panics and leader elections happening , after the fix merged . Moving to VERIFIED .
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759