Bug 1984683 - sdn-controller needs to handle 60 seconds downtime of API server gracefully in SNO
Summary: sdn-controller needs to handle 60 seconds downtime of API server gracefully i...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: Unspecified
OS: Linux
Target Milestone: ---
: 4.9.0
Assignee: Christoph Stäbler
QA Contact: zhaozhanqi
Whiteboard: chaos
Depends On:
Blocks: 1984730
TreeView+ depends on / blocked
Reported: 2021-07-21 20:36 UTC by Naga Ravi Chaitanya Elluri
Modified: 2023-09-15 01:11 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2021-10-18 17:40:27 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift sdn pull 328 0 None open Bug 1984683: use new default leader election values to handle apiserver rollout on SNO 2021-07-30 06:46:42 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:40:57 UTC

Description Naga Ravi Chaitanya Elluri 2021-07-21 20:36:21 UTC
Description of problem:
sdn-controller is going through leader elections and restarting during the kube-apiserver rollout which currently takes around 60 seconds with shutdown-delay-duration and gracefulTerminationDuration is now set to 0 and 15 seconds ( https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 and https://github.com/openshift/library-go/pull/1104 ). sdn-controller leader election lease duration needs to be set to > 60 seconds to handle the downtime gracefully in SNO.

Recommended lease duration values to be considered for reference as noted in https://github.com/openshift/enhancements/pull/832/files#diff-2e28754e69aa417e5b6d89e99e42f05bfb6330800fa823753383db1d170fbc2fR183:

LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.
These are the configurable values in k8s.io/client-go based leases and controller-runtime exposes them.
This gives us
   1. clock skew tolerance == 30s
   2. kube-apiserver downtime tolerance == 78s
   3. worst non-graceful lease reacquisition == 163s
   4. worst graceful lease reacquisition == 26s

Here is the trace of the events during the rollout: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/sdn-controller/cerberus_api_rollout_trace.jsonWe can see that leader lease failures in the log: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/sdn-controller/sdn-controller.log. The leader election can also be disabled given that there's no HA in SNO.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Install a SNO cluster using the latest nightly payload.
2. Trigger kube-apiserver rollout or outage which lasts for at least 60 seconds ( kube-apiserver rollout on a cluster built using payload with https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 should take ~60 seconds ) - $oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATIONX"}}' where X can be 1,2...n
3. Observe the state of sdn-controller.

Actual results:
sdn-controller goes through leader election and restarts.

Expected results:
sdn-controller should handle the API rollout/outage gracefully.

Additional info:
Logs including must-gather: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/sdn-controller/

Comment 2 Dan Winship 2021-07-22 19:10:05 UTC
If we change the timeouts then won't that make it take longer to recover in "real" clusters?

It seems like we should just make it not do leader election at all in SNO / if there's only a single master.

Comment 3 Naga Ravi Chaitanya Elluri 2021-07-23 00:39:26 UTC
Right, leader elections are not needed in SNO given that there's no HA. We can take the route of detecting an SNO deployment and flipping the leader-elect flag if it's exposed as an option and is a feasible change for the 4.9 time frame. Thoughts?

Comment 13 zhaozhanqi 2021-08-04 07:21:49 UTC
Move this to verified according comment 8 9 10.

Comment 16 errata-xmlrpc 2021-10-18 17:40:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Comment 17 Red Hat Bugzilla 2023-09-15 01:11:50 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.