Bug 1984683

Summary:	sdn-controller needs to handle 60 seconds downtime of API server gracefully in SNO
Product:	OpenShift Container Platform	Reporter:	Naga Ravi Chaitanya Elluri <nelluri>
Component:	Networking	Assignee:	Christoph Stäbler <cstabler>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	anbhat, astoycos, cstabler, danw, mcurry, nelluri, rfreiman, zzhao
Version:	4.9
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Linux
Whiteboard:	chaos
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-18 17:40:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1984730

Description Naga Ravi Chaitanya Elluri 2021-07-21 20:36:21 UTC

Description of problem:
sdn-controller is going through leader elections and restarting during the kube-apiserver rollout which currently takes around 60 seconds with shutdown-delay-duration and gracefulTerminationDuration is now set to 0 and 15 seconds ( https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 and https://github.com/openshift/library-go/pull/1104 ). sdn-controller leader election lease duration needs to be set to > 60 seconds to handle the downtime gracefully in SNO.

Recommended lease duration values to be considered for reference as noted in https://github.com/openshift/enhancements/pull/832/files#diff-2e28754e69aa417e5b6d89e99e42f05bfb6330800fa823753383db1d170fbc2fR183:

LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.
These are the configurable values in k8s.io/client-go based leases and controller-runtime exposes them.
This gives us
1. clock skew tolerance == 30s
2. kube-apiserver downtime tolerance == 78s
3. worst non-graceful lease reacquisition == 163s
4. worst graceful lease reacquisition == 26s

Here is the trace of the events during the rollout: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/sdn-controller/cerberus_api_rollout_trace.jsonWe can see that leader lease failures in the log: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/sdn-controller/sdn-controller.log. The leader election can also be disabled given that there's no HA in SNO.

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-07-19-192457

How reproducible:
Always

Steps to Reproduce:
1. Install a SNO cluster using the latest nightly payload.
2. Trigger kube-apiserver rollout or outage which lasts for at least 60 seconds ( kube-apiserver rollout on a cluster built using payload with https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 should take ~60 seconds ) - $oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATIONX"}}' where X can be 1,2...n
3. Observe the state of sdn-controller.

Actual results:
sdn-controller goes through leader election and restarts.

Expected results:
sdn-controller should handle the API rollout/outage gracefully.

Additional info:
Logs including must-gather: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/sdn-controller/

Comment 2 Dan Winship 2021-07-22 19:10:05 UTC

If we change the timeouts then won't that make it take longer to recover in "real" clusters?

It seems like we should just make it not do leader election at all in SNO / if there's only a single master.

Comment 3 Naga Ravi Chaitanya Elluri 2021-07-23 00:39:26 UTC

Right, leader elections are not needed in SNO given that there's no HA. We can take the route of detecting an SNO deployment and flipping the leader-elect flag if it's exposed as an option and is a feasible change for the 4.9 time frame. Thoughts?

Comment 13 zhaozhanqi 2021-08-04 07:21:49 UTC

Move this to verified according comment 8 9 10.

Comment 16 errata-xmlrpc 2021-10-18 17:40:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 17 Red Hat Bugzilla 2023-09-15 01:11:50 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days