Bug 1980930

Summary:	Machine-api-operator is going through leader election even when API rollout takes ~60 sec in SNO
Product:	OpenShift Container Platform	Reporter:	Naga Ravi Chaitanya Elluri <nelluri>
Component:	Cloud Compute	Assignee:	Mike Fedosin <mfedosin>
Cloud Compute sub component:	Other Providers	QA Contact:	Milind Yadav <miyadav>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	mfedosin, nelluri, rfreiman
Version:	4.9
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Linux
Whiteboard:	chaos
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-18 17:39:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1984730

Description Naga Ravi Chaitanya Elluri 2021-07-09 22:52:56 UTC

Description of problem:
On a Single Node OpenShift cluster, Machine-api-operator is restarting trying to acquire lease even when it's lease duration is set to 90 sec and kube-apiserver rollout takes around 60 seconds given that the shutdown-delay-duration and gracefulTerminationDuration is now set to 0 and 15 seconds respectively to reduce the downtime.

Recommended lease duration values to be considered for reference as noted in https://github.com/openshift/enhancements/pull/832/files#diff-2e28754e69aa417e5b6d89e99e42f05bfb6330800fa823753383db1d170fbc2fR183:

LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.
These are the configurable values in k8s.io/client-go based leases and controller-runtime exposes them.
This gives us
1. clock skew tolerance == 30s
2. kube-apiserver downtime tolerance == 78s
3. worst non-graceful lease reacquisition == 163s
4. worst graceful lease reacquisition == 26s

Here is the cluster state during the 60 sec API server downtime: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/machine-api-operator-leader-election/cerberus_cluster_state.log. We can see that the machine-api-operator restarted at 18:57:37 UTC and it was trying to acquire the lease looking at the exited container on the node during that time: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/machine-api-operator-leader-election/machine-api-operator-container.log. The leader election can also be disabled given that there's no HA in SNO.

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-07-07-021823

How reproducible:
Always

Steps to Reproduce:
1. Install a SNO cluster using the latest nightly payload.
2. Trigger kube-apiserver rollout or outage which lasts for at least 60 seconds ( kube-apiserver rollout on a cluster built using payload with https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 should take ~60 seconds ) - $ oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATIONX"}}' where X can be 1,2...n
3. Observe the state of Machine-api-operator.

Actual results:
Machine-api-operator goes through leader election and restarts.

Expected results:
Machine-api-operator should handle the API rollout/outage gracefully as the lease duration is set to 90 seconds.

Additional information:
Logs including must-gather are at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/machine-api-operator-leader-election/.

Comment 1 Joel Speed 2021-07-12 13:34:45 UTC

>  We can see that the machine-api-operator restarted at 18:57:37 UTC and it was trying to acquire the lease looking at the exited container on the node during that time: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/machine-api-operator-leader-election/machine-api-operator-container.log. 

Looking at this log, I can see that the leader election was lost 60 seconds after the first error that is logged.
I think this is entirely reasonable behaviour as the lease may have already been renewed 30 seconds before, this seems very much like a timing issue.
After we lose the election, the only safe thing to do is to panic, we have to ensure the controller shuts down to prevent duplicate controllers running at the same time.

The numbers for the lease duration/refresh have all been chosen based on other components within openshift and some advice from Clayton after we initially set these too aggressively (we were refreshing every 15 seconds IIRC).
We were causing too much load on the API server so had to back this off to what we have now.

I don't think there's actually an issue here, or at least, I don't think there's one we can safely protect against without extending the lease considerably, and ensure that renewal happens before 60 seconds is remaining on the lease.

Comment 2 Naga Ravi Chaitanya Elluri 2021-07-13 03:42:31 UTC

Thanks for taking a look Joel. One of the goals for the SNO is to bring down the API rollout time to 60 seconds and have various cluster operators and respective components handle the downtime of API gracefully: https://github.com/openshift/library-go/pull/1104. Is it possible to wait for 90 seconds after the error/hitting an issue talking to the API before triggering the election and restarting to handle it gracefully instead? Thoughts?

Comment 3 Joel Speed 2021-07-13 13:06:43 UTC

> Is it possible to wait for 90 seconds after the error/hitting an issue talking to the API before triggering the election and restarting to handle it gracefully instead? Thoughts?


No, this is not safe. If we can't access the API, we have to assume it's our fault or the node that we are running on has become partitioned. In this scenario, another pod would be scheduled elsewhere and will start up and take the lease from us as soon as it has expired. Waiting beyond the lease is not safe.

The only possible option here is extending the lease durations and refresh periods, but the values we are using I believe are standard across OCP. Are you not seeing similar issues in other components?

Also, given that our component will come back and work again after the restart, is this really an issue?

Comment 4 Rom Freiman 2021-07-13 13:12:11 UTC

@jspeed you can find more info here: https://github.com/openshift/enhancements/pull/832
For SNO, our goal is for all the operators to be able to tolerate 60s of api downtime.

Comment 5 Joel Speed 2021-07-14 15:32:39 UTC

I've reviewed the enhancement, I think the proposed changes are acceptable to our team. 
Will make sure that our components get updated to the recommended values from the enhancement.

For reference, these are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.

Comment 15 Milind Yadav 2021-08-16 02:16:06 UTC

As per the discussions in comments and validations done , We are able to confirm no panics and leader elections happening , after the fix merged . Moving to VERIFIED .

Comment 18 errata-xmlrpc 2021-10-18 17:39:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759