Bug 1980930 - Machine-api-operator is going through leader election even when API rollout takes ~60 sec in SNO
Summary: Machine-api-operator is going through leader election even when API rollout t...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.9
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: ---
: 4.9.0
Assignee: Mike Fedosin
QA Contact: Milind Yadav
URL:
Whiteboard: chaos
Depends On:
Blocks: 1984730
TreeView+ depends on / blocked
 
Reported: 2021-07-09 22:52 UTC by Naga Ravi Chaitanya Elluri
Modified: 2021-10-18 17:39 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:39:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 890 0 None open Bug 1980930: Update the default durations for MAO leader election operations 2021-07-16 14:36:02 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:39:29 UTC

Description Naga Ravi Chaitanya Elluri 2021-07-09 22:52:56 UTC
Description of problem:
On a Single Node OpenShift cluster, Machine-api-operator is restarting trying to acquire lease even when it's lease duration is set to 90 sec and kube-apiserver rollout takes around 60 seconds given that the shutdown-delay-duration and gracefulTerminationDuration is now set to 0 and 15 seconds respectively to reduce the downtime.

Recommended lease duration values to be considered for reference as noted in https://github.com/openshift/enhancements/pull/832/files#diff-2e28754e69aa417e5b6d89e99e42f05bfb6330800fa823753383db1d170fbc2fR183:

LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.
These are the configurable values in k8s.io/client-go based leases and controller-runtime exposes them.
This gives us
   1. clock skew tolerance == 30s
   2. kube-apiserver downtime tolerance == 78s
   3. worst non-graceful lease reacquisition == 163s
   4. worst graceful lease reacquisition == 26s

Here is the cluster state during the 60 sec API server downtime: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/machine-api-operator-leader-election/cerberus_cluster_state.log. We can see that the machine-api-operator restarted at 18:57:37 UTC and it was trying to acquire the lease looking at the exited container on the node during that time: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/machine-api-operator-leader-election/machine-api-operator-container.log. The leader election can also be disabled given that there's no HA in SNO.

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-07-07-021823

How reproducible:
Always

Steps to Reproduce:
1. Install a SNO cluster using the latest nightly payload.
2. Trigger kube-apiserver rollout or outage which lasts for at least 60 seconds ( kube-apiserver rollout on a cluster built using payload with https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 should take ~60 seconds ) - $ oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATIONX"}}' where X can be 1,2...n
3. Observe the state of Machine-api-operator. 

Actual results:
Machine-api-operator goes through leader election and restarts.

Expected results:
Machine-api-operator should handle the API rollout/outage gracefully as the lease duration is set to 90 seconds.

Additional information:
Logs including must-gather are at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/machine-api-operator-leader-election/.

Comment 1 Joel Speed 2021-07-12 13:34:45 UTC
>  We can see that the machine-api-operator restarted at 18:57:37 UTC and it was trying to acquire the lease looking at the exited container on the node during that time: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/machine-api-operator-leader-election/machine-api-operator-container.log. 

Looking at this log, I can see that the leader election was lost 60 seconds after the first error that is logged.
I think this is entirely reasonable behaviour as the lease may have already been renewed 30 seconds before, this seems very much like a timing issue.
After we lose the election, the only safe thing to do is to panic, we have to ensure the controller shuts down to prevent duplicate controllers running at the same time.

The numbers for the lease duration/refresh have all been chosen based on other components within openshift and some advice from Clayton after we initially set these too aggressively (we were refreshing every 15 seconds IIRC).
We were causing too much load on the API server so had to back this off to what we have now.

I don't think there's actually an issue here, or at least, I don't think there's one we can safely protect against without extending the lease considerably, and ensure that renewal happens before 60 seconds is remaining on the lease.

Comment 2 Naga Ravi Chaitanya Elluri 2021-07-13 03:42:31 UTC
Thanks for taking a look Joel. One of the goals for the SNO is to bring down the API rollout time to 60 seconds and have various cluster operators and respective components handle the downtime of API gracefully: https://github.com/openshift/library-go/pull/1104. Is it possible to wait for 90 seconds after the error/hitting an issue talking to the API before triggering the election and restarting to handle it gracefully instead? Thoughts?

Comment 3 Joel Speed 2021-07-13 13:06:43 UTC
> Is it possible to wait for 90 seconds after the error/hitting an issue talking to the API before triggering the election and restarting to handle it gracefully instead? Thoughts?


No, this is not safe. If we can't access the API, we have to assume it's our fault or the node that we are running on has become partitioned. In this scenario, another pod would be scheduled elsewhere and will start up and take the lease from us as soon as it has expired. Waiting beyond the lease is not safe.

The only possible option here is extending the lease durations and refresh periods, but the values we are using I believe are standard across OCP. Are you not seeing similar issues in other components?

Also, given that our component will come back and work again after the restart, is this really an issue?

Comment 4 Rom Freiman 2021-07-13 13:12:11 UTC
@jspeed you can find more info here: https://github.com/openshift/enhancements/pull/832
For SNO, our goal is for all the operators to be able to tolerate 60s of api downtime.

Comment 5 Joel Speed 2021-07-14 15:32:39 UTC
I've reviewed the enhancement, I think the proposed changes are acceptable to our team. 
Will make sure that our components get updated to the recommended values from the enhancement.

For reference, these are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.

Comment 15 Milind Yadav 2021-08-16 02:16:06 UTC
As per the discussions in comments and validations done , We are able to confirm no panics and leader elections happening , after the fix merged . Moving to VERIFIED .

Comment 18 errata-xmlrpc 2021-10-18 17:39:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.