Bug 1809719 - During rolling update, OCM doesn't release its lease
Summary: During rolling update, OCM doesn't release its lease
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-controller-manager
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.5.0
Assignee: Gabe Montero
QA Contact: wewang
URL:
Whiteboard: devex
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-03 18:18 UTC by Clayton Coleman
Modified: 2020-07-13 17:18 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: the openshift controller manager was not using the ReleaseOnCancel option on its kube leader election configuration. Consequence: leader establishment during a rolling update could be delayed as it could take more time for the new leader to obtain a lease since to old leader did not proactively release it prior to its shutdown Fix: ReleaseOnCancel is not set Result: leader establishment during a rolling update of the openshift controller manager deployment should proceed more consistenly
Clone Of:
Environment:
Last Closed: 2020-07-13 17:17:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-controller-manager pull 107 0 None closed Bug 1809719: use ReleaseOnCancel for leader election config 2020-06-23 08:58:08 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:18:07 UTC

Description Clayton Coleman 2020-03-03 18:18:06 UTC
If you update the OCM pods, they don't release their lease on shutdown.  The kube election library has been updated to make this possible (ReleaseOnCancel, see k8s.io/client-go/examples/leader-election/main.go) but it requires changes to your controllers to shutdown gracefully before the lock is released. By releasing the lease you minimize the time no controller is running.

If possible to fix this easily (to ensure the client is shutdown) we should implement it because it reduces the duration in a failure before we recover.  If it is complex or requires rewiring the controller our current logic is fine.

Comment 4 wewang 2020-05-26 08:36:14 UTC
@gabe if it's ok to verfiy the bug with follow steps, it cost about 51 seconds to new pod running

Steps:
[wewang@wangwen work]$ oc get configmap openshift-master-controllers -oyaml -n openshift-controller-manager
apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"controller-manager-srqzr","leaseDurationSeconds":60,"acquireTime":"2020-05-26T07:58:09Z","renewTime":"2020-05-26T08:03:09Z","leaderTransitions":3}'
  creationTimestamp: "2020-05-26T06:37:10Z"
  name: openshift-master-controllers
  namespace: openshift-controller-manager
  resourceVersion: "56364"
  selfLink: /api/v1/namespaces/openshift-controller-manager/configmaps/openshift-master-controllers
  uid: 34654c77-4b7d-4004-a61c-84bc584d0024

[wewang@wangwen work]$ date ; oc delete pod controller-manager-srqzr -n openshift-controller-manager ; date; oc get pods -n openshift-controller-manager
Tue May 26 16:08:50 CST 2020
pod "controller-manager-srqzr" deleted
Tue May 26 16:09:41 CST 2020
NAME                       READY   STATUS    RESTARTS   AGE
controller-manager-k5ckk   1/1     Running   0          91m
controller-manager-lxsj8   1/1     Running   0          91m
controller-manager-xgt6h   1/1     Running   0          7s

Comment 5 Gabe Montero 2020-05-26 12:51:00 UTC
Perfect @Wen ... looks good

marking verified

Comment 7 errata-xmlrpc 2020-07-13 17:17:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.