Bug 1775224 - kube-apiserver-operator doesn't release lock when being gracefully terminated
Summary: kube-apiserver-operator doesn't release lock when being gracefully terminated
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.5.0
Assignee: Michal Fojtik
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-21 15:26 UTC by Tomáš Nožička
Modified: 2020-07-13 17:12 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The leader election setup in operators was not using the "ReleaseOnCancel" option which releases the lock when the operator receive an UNIX signal to shutdown. Consequence: When rolling new version of operators, it might took minute or two until the lock is released and the new version of operator can continue. Fix: The graceful shutdown was refactored for control plane operators to respect the graceful termination period and the operators are not guaranteed to shutdown in clean way. This allowed us to enable the "ReleaseOnCancel" option. Result: The operators now don't want for the lock to be released on startup and the operator rollout time improved significantly.
Clone Of:
Environment:
Last Closed: 2020-07-13 17:12:14 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:2409 None None None 2020-07-13 17:12:31 UTC

Description Tomáš Nožička 2019-11-21 15:26:34 UTC
Description of problem:
kube-apiserver-operator doesn't release lock when being gracefully terminated

On updating the deployment

I1121 15:12:45.877442       1 leaderelection.go:241] attempting to acquire leader lease  openshift-kube-apiserver-operator/kube-apiserver-operator-lock...
I1121 15:12:45.877726       1 secure_serving.go:123] Serving securely on 0.0.0.0:8443
I1121 15:14:02.635392       1 leaderelection.go:251] successfully acquired lease openshift-kube-apiserver-operator/kube-apiserver-operator-lock
I1121 15:14:02.636117       1 event.go:255] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator-lock", UID:"fa08f764-098c-4011-908e-cbec13df30aa", APIVersion:"v1", ResourceVersion:"549614", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' 8ee103ba-6331-48a3-968e-997965147c16 became leader

With recreate strategy the old pod is gone before the new one start so this is likely waiting for leader election timeout.

Version-Release number of selected component (if applicable):
oc version
Client Version: v4.2.0-alpha.0-274-g876ed13
Server Version: 4.3.0-0.ci-2019-11-20-022156
Kubernetes Version: v1.16.2

How reproducible:
always

Steps to Reproduce:
1.trigger a rollout of KAO
2.
3.

Actual results:
Waits on leader election


Expected results:
Acquires the lock immediately since the terminating pod will release it first.


Additional info:
This will speed cluster create, upgrades and dev flow.

Comment 3 Michal Fojtik 2020-05-12 10:32:45 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale" and decreasing severity from "medium" to "low".

If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 4 Michal Fojtik 2020-05-12 12:30:44 UTC
This has been fixed in factory.

Comment 8 Ke Wang 2020-05-14 07:26:56 UTC
Verified with OCP  4.5.0-0.nightly-2020-05-13-221558,

$ oc delete -n openshift-kube-apiserver-operator pod kube-apiserver-operator-745f6658c8-jn4d5 --force --grace-period=0

$ oc -n openshift-kube-apiserver-operator get pods
NAME                                       READY   STATUS    RESTARTS   AGE
kube-apiserver-operator-745f6658c8-c9ggm   1/1     Running   0          12m
                                                                       
$ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-745f6658c8-c9ggm | grep -n -A2 'attempting to acquire leader'
11:I0514 07:11:05.508187       1 leaderelection.go:242] attempting to acquire leader lease  openshift-kube-apiserver-operator/kube-apiserver-operator-lock...
12-I0514 07:11:05.517100       1 leaderelection.go:252] successfully acquired lease openshift-kube-apiserver-operator/kube-apiserver-operator-lock
13-I0514 07:11:05.517355       1 event.go:278] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator-lock", UID:"048bbd57-0c2f-4b87-b86b-233a0b1c7ff5", APIVersion:"v1", ResourceVersion:"100923", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' 04c1434e-949c-4963-99ed-44a4ef6bfd40 became leader

We can see the new rolled out openshift-kube-apiserver-operator logs, kube-apiserver-operator acquires the lock immediately when being gracefully terminated as expected.

Comment 10 errata-xmlrpc 2020-07-13 17:12:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.