Bug 1775224

Summary:	kube-apiserver-operator doesn't release lock when being gracefully terminated
Product:	OpenShift Container Platform	Reporter:	Tomáš Nožička <tnozicka>
Component:	kube-apiserver	Assignee:	Michal Fojtik <mfojtik>
Status:	CLOSED ERRATA	QA Contact:	Ke Wang <kewang>
Severity:	low	Docs Contact:
Priority:	medium
Version:	4.3.0	CC:	aos-bugs, mfojtik, sttts, vareti, xxia
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The leader election setup in operators was not using the "ReleaseOnCancel" option which releases the lock when the operator receive an UNIX signal to shutdown. Consequence: When rolling new version of operators, it might took minute or two until the lock is released and the new version of operator can continue. Fix: The graceful shutdown was refactored for control plane operators to respect the graceful termination period and the operators are not guaranteed to shutdown in clean way. This allowed us to enable the "ReleaseOnCancel" option. Result: The operators now don't want for the lock to be released on startup and the operator rollout time improved significantly.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-13 17:12:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Tomáš Nožička 2019-11-21 15:26:34 UTC

Description of problem:
kube-apiserver-operator doesn't release lock when being gracefully terminated

On updating the deployment

I1121 15:12:45.877442       1 leaderelection.go:241] attempting to acquire leader lease  openshift-kube-apiserver-operator/kube-apiserver-operator-lock...
I1121 15:12:45.877726       1 secure_serving.go:123] Serving securely on 0.0.0.0:8443
I1121 15:14:02.635392       1 leaderelection.go:251] successfully acquired lease openshift-kube-apiserver-operator/kube-apiserver-operator-lock
I1121 15:14:02.636117       1 event.go:255] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator-lock", UID:"fa08f764-098c-4011-908e-cbec13df30aa", APIVersion:"v1", ResourceVersion:"549614", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' 8ee103ba-6331-48a3-968e-997965147c16 became leader

With recreate strategy the old pod is gone before the new one start so this is likely waiting for leader election timeout.

Version-Release number of selected component (if applicable):
oc version
Client Version: v4.2.0-alpha.0-274-g876ed13
Server Version: 4.3.0-0.ci-2019-11-20-022156
Kubernetes Version: v1.16.2

How reproducible:
always

Steps to Reproduce:
1.trigger a rollout of KAO
2.
3.

Actual results:
Waits on leader election


Expected results:
Acquires the lock immediately since the terminating pod will release it first.


Additional info:
This will speed cluster create, upgrades and dev flow.

Comment 3 Michal Fojtik 2020-05-12 10:32:45 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale" and decreasing severity from "medium" to "low".

If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 4 Michal Fojtik 2020-05-12 12:30:44 UTC

This has been fixed in factory.

Comment 8 Ke Wang 2020-05-14 07:26:56 UTC

Verified with OCP  4.5.0-0.nightly-2020-05-13-221558,

$ oc delete -n openshift-kube-apiserver-operator pod kube-apiserver-operator-745f6658c8-jn4d5 --force --grace-period=0

$ oc -n openshift-kube-apiserver-operator get pods
NAME                                       READY   STATUS    RESTARTS   AGE
kube-apiserver-operator-745f6658c8-c9ggm   1/1     Running   0          12m
                                                                       
$ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-745f6658c8-c9ggm | grep -n -A2 'attempting to acquire leader'
11:I0514 07:11:05.508187       1 leaderelection.go:242] attempting to acquire leader lease  openshift-kube-apiserver-operator/kube-apiserver-operator-lock...
12-I0514 07:11:05.517100       1 leaderelection.go:252] successfully acquired lease openshift-kube-apiserver-operator/kube-apiserver-operator-lock
13-I0514 07:11:05.517355       1 event.go:278] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator-lock", UID:"048bbd57-0c2f-4b87-b86b-233a0b1c7ff5", APIVersion:"v1", ResourceVersion:"100923", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' 04c1434e-949c-4963-99ed-44a4ef6bfd40 became leader

We can see the new rolled out openshift-kube-apiserver-operator logs, kube-apiserver-operator acquires the lock immediately when being gracefully terminated as expected.

Comment 10 errata-xmlrpc 2020-07-13 17:12:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409