1981055 – ovn-kubernetes-master need to handle 60 seconds downtime of API server gracefully in SNO

Bug 1981055 - ovn-kubernetes-master need to handle 60 seconds downtime of API server gracefully in SNO

Summary: ovn-kubernetes-master need to handle 60 seconds downtime of API server gracef...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Christoph Stäbler
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:	chaos
Depends On:
Blocks:	1984730
TreeView+	depends on / blocked

Reported:	2021-07-11 01:35 UTC by Naga Ravi Chaitanya Elluri
Modified:	2023-09-15 01:11 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:39:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 1154	0	None	open	Bug 1981055: ovnkube-master handle 60 seconds downtime of API server gracefully	2021-07-15 14:59:36 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:39:37 UTC

Description Naga Ravi Chaitanya Elluri 2021-07-11 01:35:58 UTC

Description of problem:
ovn-kube-master leader election lease duration is set to 60 seconds which is causing it to go through leader election and restart during the kube-apiserver rollout which currently takes around 60 seconds with shutdown-delay-duration and gracefulTerminationDuration is now set to 0 and 15 seconds ( https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 and https://github.com/openshift/library-go/pull/1104 ). ovn-kube-master leader election timeout should be set to > 60 seconds ( 90 seconds is what we are thinking for rest of the components ) to handle the downtime gracefully in SNO.

Recommended lease duration values to be considered for reference as noted in https://github.com/openshift/enhancements/pull/832/files#diff-2e28754e69aa417e5b6d89e99e42f05bfb6330800fa823753383db1d170fbc2fR183:

LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.
These are the configurable values in k8s.io/client-go based leases and controller-runtime exposes them.
This gives us
1. clock skew tolerance == 30s
2. kube-apiserver downtime tolerance == 78s
3. worst non-graceful lease reacquisition == 163s
4. worst graceful lease reacquisition == 26s

Here is the trace of the events during the rollout: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/ovn-kube-master-leader-election/. we can see that ovn-kube-master restarted at 2021-07-11 00:37:11 and that maps to the leader election from the ovn-kube-master logs: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/ovn-kube-master-leader-election/ovn-kube-master.log. The leader election can also be disabled given that there's no HA in SNO.

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-07-07-021823

How reproducible:
Always

Steps to Reproduce:
1. Install a SNO cluster using the latest nightly payload.
2. Trigger kube-apiserver rollout or outage which lasts for at least 60 seconds ( kube-apiserver rollout on a cluster built using payload with https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 should take ~60 seconds ) - $oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATIONX"}}' where X can be 1,2...n
3. Observe the state of ovn-kube-master.

Actual results:
ovn-kube-master goes through leader election and restarts.

Expected results:
ovn-kube-master should handle the API rollout/outage gracefully.

Additional info:
Logs including must-gather: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/ovn-kube-master-leader-election/

Comment 4 zhaozhanqi 2021-08-04 08:15:38 UTC

Hi, Naga

Could you help verified this bug?

Comment 5 zhaozhanqi 2021-08-05 01:08:28 UTC

Verified this bug on 4.9.0-0.nightly-2021-08-03-200806

after force reploy kubepai, ovnkube-master did not be restarted

$ oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"something1"}}'
kubeapiserver.operator.openshift.io/cluster patched
$ oc get pod -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-lmtrw   6/6     Running   0          34m
ovnkube-node-422vg     4/4     Running   1          34m
$ oc get pod -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-lmtrw   6/6     Running   0          35m
ovnkube-node-422vg     4/4     Running   1          35m
$ oc get pod -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-lmtrw   6/6     Running   0          35m
ovnkube-node-422vg     4/4     Running   1          35m
$ oc get pod -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-lmtrw   6/6     Running   0          35m
ovnkube-node-422vg     4/4     Running   1          35m
]$ oc get pod -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-lmtrw   6/6     Running   0          35m
ovnkube-node-422vg     4/4     Running   1          35m
$ oc get pod -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-lmtrw   6/6     Running   0          36m
ovnkube-node-422vg     4/4     Running   1          36m

Comment 8 errata-xmlrpc 2021-10-18 17:39:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 9 Red Hat Bugzilla 2023-09-15 01:11:14 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.