Bug 1981055 - ovn-kubernetes-master need to handle 60 seconds downtime of API server gracefully in SNO
Summary: ovn-kubernetes-master need to handle 60 seconds downtime of API server gracef...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: ---
: 4.9.0
Assignee: Christoph Stäbler
QA Contact: Anurag saxena
URL:
Whiteboard: chaos
Depends On:
Blocks: 1984730
TreeView+ depends on / blocked
 
Reported: 2021-07-11 01:35 UTC by Naga Ravi Chaitanya Elluri
Modified: 2023-09-15 01:11 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:39:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1154 0 None open Bug 1981055: ovnkube-master handle 60 seconds downtime of API server gracefully 2021-07-15 14:59:36 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:39:37 UTC

Description Naga Ravi Chaitanya Elluri 2021-07-11 01:35:58 UTC
Description of problem:
ovn-kube-master leader election lease duration is set to 60 seconds which is causing it to go through leader election and restart during the kube-apiserver rollout which currently takes around 60 seconds with shutdown-delay-duration and gracefulTerminationDuration is now set to 0 and 15 seconds ( https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 and https://github.com/openshift/library-go/pull/1104 ). ovn-kube-master leader election timeout should be set to > 60 seconds ( 90 seconds is what we are thinking for rest of the components ) to handle the downtime gracefully in SNO.

Recommended lease duration values to be considered for reference as noted in https://github.com/openshift/enhancements/pull/832/files#diff-2e28754e69aa417e5b6d89e99e42f05bfb6330800fa823753383db1d170fbc2fR183:

LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.
These are the configurable values in k8s.io/client-go based leases and controller-runtime exposes them.
This gives us
   1. clock skew tolerance == 30s
   2. kube-apiserver downtime tolerance == 78s
   3. worst non-graceful lease reacquisition == 163s
   4. worst graceful lease reacquisition == 26s

Here is the trace of the events during the rollout: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/ovn-kube-master-leader-election/. we can see that ovn-kube-master restarted at 2021-07-11 00:37:11 and that maps to the leader election from the ovn-kube-master logs: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/ovn-kube-master-leader-election/ovn-kube-master.log. The leader election can also be disabled given that there's no HA in SNO.

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-07-07-021823

How reproducible:
Always

Steps to Reproduce:
1. Install a SNO cluster using the latest nightly payload.
2. Trigger kube-apiserver rollout or outage which lasts for at least 60 seconds ( kube-apiserver rollout on a cluster built using payload with https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 should take ~60 seconds ) - $oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATIONX"}}' where X can be 1,2...n
3. Observe the state of ovn-kube-master.

Actual results:
ovn-kube-master goes through leader election and restarts.

Expected results:
ovn-kube-master should handle the API rollout/outage gracefully.

Additional info:
Logs including must-gather: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/ovn-kube-master-leader-election/

Comment 4 zhaozhanqi 2021-08-04 08:15:38 UTC
Hi, Naga

Could you help verified this bug?

Comment 5 zhaozhanqi 2021-08-05 01:08:28 UTC
Verified this bug on 4.9.0-0.nightly-2021-08-03-200806

after force reploy kubepai, ovnkube-master did not be restarted

$ oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"something1"}}'
kubeapiserver.operator.openshift.io/cluster patched
$ oc get pod -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-lmtrw   6/6     Running   0          34m
ovnkube-node-422vg     4/4     Running   1          34m
$ oc get pod -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-lmtrw   6/6     Running   0          35m
ovnkube-node-422vg     4/4     Running   1          35m
$ oc get pod -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-lmtrw   6/6     Running   0          35m
ovnkube-node-422vg     4/4     Running   1          35m
$ oc get pod -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-lmtrw   6/6     Running   0          35m
ovnkube-node-422vg     4/4     Running   1          35m
]$ oc get pod -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-lmtrw   6/6     Running   0          35m
ovnkube-node-422vg     4/4     Running   1          35m
$ oc get pod -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-lmtrw   6/6     Running   0          36m
ovnkube-node-422vg     4/4     Running   1          36m

Comment 8 errata-xmlrpc 2021-10-18 17:39:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 9 Red Hat Bugzilla 2023-09-15 01:11:14 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.