Bug 1775224
| Summary: | kube-apiserver-operator doesn't release lock when being gracefully terminated | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Tomáš Nožička <tnozicka> |
| Component: | kube-apiserver | Assignee: | Michal Fojtik <mfojtik> |
| Status: | CLOSED ERRATA | QA Contact: | Ke Wang <kewang> |
| Severity: | low | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.3.0 | CC: | aos-bugs, mfojtik, sttts, vareti, xxia |
| Target Milestone: | --- | ||
| Target Release: | 4.5.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: The leader election setup in operators was not using the "ReleaseOnCancel" option which releases the lock when the operator receive an UNIX signal to shutdown.
Consequence: When rolling new version of operators, it might took minute or two until the lock is released and the new version of operator can continue.
Fix: The graceful shutdown was refactored for control plane operators to respect the graceful termination period and the operators are not guaranteed to shutdown in clean way. This allowed us to enable the "ReleaseOnCancel" option.
Result: The operators now don't want for the lock to be released on startup and the operator rollout time improved significantly.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-07-13 17:12:14 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing severity from "medium" to "low". If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. This has been fixed in factory. Verified with OCP 4.5.0-0.nightly-2020-05-13-221558,
$ oc delete -n openshift-kube-apiserver-operator pod kube-apiserver-operator-745f6658c8-jn4d5 --force --grace-period=0
$ oc -n openshift-kube-apiserver-operator get pods
NAME READY STATUS RESTARTS AGE
kube-apiserver-operator-745f6658c8-c9ggm 1/1 Running 0 12m
$ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-745f6658c8-c9ggm | grep -n -A2 'attempting to acquire leader'
11:I0514 07:11:05.508187 1 leaderelection.go:242] attempting to acquire leader lease openshift-kube-apiserver-operator/kube-apiserver-operator-lock...
12-I0514 07:11:05.517100 1 leaderelection.go:252] successfully acquired lease openshift-kube-apiserver-operator/kube-apiserver-operator-lock
13-I0514 07:11:05.517355 1 event.go:278] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator-lock", UID:"048bbd57-0c2f-4b87-b86b-233a0b1c7ff5", APIVersion:"v1", ResourceVersion:"100923", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' 04c1434e-949c-4963-99ed-44a4ef6bfd40 became leader
We can see the new rolled out openshift-kube-apiserver-operator logs, kube-apiserver-operator acquires the lock immediately when being gracefully terminated as expected.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |
Description of problem: kube-apiserver-operator doesn't release lock when being gracefully terminated On updating the deployment I1121 15:12:45.877442 1 leaderelection.go:241] attempting to acquire leader lease openshift-kube-apiserver-operator/kube-apiserver-operator-lock... I1121 15:12:45.877726 1 secure_serving.go:123] Serving securely on 0.0.0.0:8443 I1121 15:14:02.635392 1 leaderelection.go:251] successfully acquired lease openshift-kube-apiserver-operator/kube-apiserver-operator-lock I1121 15:14:02.636117 1 event.go:255] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator-lock", UID:"fa08f764-098c-4011-908e-cbec13df30aa", APIVersion:"v1", ResourceVersion:"549614", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' 8ee103ba-6331-48a3-968e-997965147c16 became leader With recreate strategy the old pod is gone before the new one start so this is likely waiting for leader election timeout. Version-Release number of selected component (if applicable): oc version Client Version: v4.2.0-alpha.0-274-g876ed13 Server Version: 4.3.0-0.ci-2019-11-20-022156 Kubernetes Version: v1.16.2 How reproducible: always Steps to Reproduce: 1.trigger a rollout of KAO 2. 3. Actual results: Waits on leader election Expected results: Acquires the lock immediately since the terminating pod will release it first. Additional info: This will speed cluster create, upgrades and dev flow.