Bug 1809719

Summary: During rolling update, OCM doesn't release its lease
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: openshift-controller-managerAssignee: Gabe Montero <gmontero>
Status: CLOSED ERRATA QA Contact: wewang <wewang>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.3.zCC: aos-bugs, gmontero, mfojtik
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: devex
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: the openshift controller manager was not using the ReleaseOnCancel option on its kube leader election configuration. Consequence: leader establishment during a rolling update could be delayed as it could take more time for the new leader to obtain a lease since to old leader did not proactively release it prior to its shutdown Fix: ReleaseOnCancel is not set Result: leader establishment during a rolling update of the openshift controller manager deployment should proceed more consistenly
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:17:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2020-03-03 18:18:06 UTC
If you update the OCM pods, they don't release their lease on shutdown.  The kube election library has been updated to make this possible (ReleaseOnCancel, see k8s.io/client-go/examples/leader-election/main.go) but it requires changes to your controllers to shutdown gracefully before the lock is released. By releasing the lease you minimize the time no controller is running.

If possible to fix this easily (to ensure the client is shutdown) we should implement it because it reduces the duration in a failure before we recover.  If it is complex or requires rewiring the controller our current logic is fine.

Comment 4 wewang 2020-05-26 08:36:14 UTC
@gabe if it's ok to verfiy the bug with follow steps, it cost about 51 seconds to new pod running

Steps:
[wewang@wangwen work]$ oc get configmap openshift-master-controllers -oyaml -n openshift-controller-manager
apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"controller-manager-srqzr","leaseDurationSeconds":60,"acquireTime":"2020-05-26T07:58:09Z","renewTime":"2020-05-26T08:03:09Z","leaderTransitions":3}'
  creationTimestamp: "2020-05-26T06:37:10Z"
  name: openshift-master-controllers
  namespace: openshift-controller-manager
  resourceVersion: "56364"
  selfLink: /api/v1/namespaces/openshift-controller-manager/configmaps/openshift-master-controllers
  uid: 34654c77-4b7d-4004-a61c-84bc584d0024

[wewang@wangwen work]$ date ; oc delete pod controller-manager-srqzr -n openshift-controller-manager ; date; oc get pods -n openshift-controller-manager
Tue May 26 16:08:50 CST 2020
pod "controller-manager-srqzr" deleted
Tue May 26 16:09:41 CST 2020
NAME                       READY   STATUS    RESTARTS   AGE
controller-manager-k5ckk   1/1     Running   0          91m
controller-manager-lxsj8   1/1     Running   0          91m
controller-manager-xgt6h   1/1     Running   0          7s

Comment 5 Gabe Montero 2020-05-26 12:51:00 UTC
Perfect @Wen ... looks good

marking verified

Comment 7 errata-xmlrpc 2020-07-13 17:17:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409