1809719 – During rolling update, OCM doesn't release its lease

Bug 1809719 - During rolling update, OCM doesn't release its lease

Summary: During rolling update, OCM doesn't release its lease

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	openshift-controller-manager
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Gabe Montero
QA Contact:	wewang
Docs Contact:
URL:
Whiteboard:	devex
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-03 18:18 UTC by Clayton Coleman
Modified:	2020-07-13 17:18 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: the openshift controller manager was not using the ReleaseOnCancel option on its kube leader election configuration. Consequence: leader establishment during a rolling update could be delayed as it could take more time for the new leader to obtain a lease since to old leader did not proactively release it prior to its shutdown Fix: ReleaseOnCancel is not set Result: leader establishment during a rolling update of the openshift controller manager deployment should proceed more consistenly
Clone Of:
Environment:
Last Closed:	2020-07-13 17:17:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-controller-manager pull 107	0	None	closed	Bug 1809719: use ReleaseOnCancel for leader election config	2020-06-23 08:58:08 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:18:07 UTC

Description Clayton Coleman 2020-03-03 18:18:06 UTC

If you update the OCM pods, they don't release their lease on shutdown.  The kube election library has been updated to make this possible (ReleaseOnCancel, see k8s.io/client-go/examples/leader-election/main.go) but it requires changes to your controllers to shutdown gracefully before the lock is released. By releasing the lease you minimize the time no controller is running.

If possible to fix this easily (to ensure the client is shutdown) we should implement it because it reduces the duration in a failure before we recover.  If it is complex or requires rewiring the controller our current logic is fine.

Comment 4 wewang 2020-05-26 08:36:14 UTC

@gabe if it's ok to verfiy the bug with follow steps, it cost about 51 seconds to new pod running

Steps:
[wewang@wangwen work]$ oc get configmap openshift-master-controllers -oyaml -n openshift-controller-manager
apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"controller-manager-srqzr","leaseDurationSeconds":60,"acquireTime":"2020-05-26T07:58:09Z","renewTime":"2020-05-26T08:03:09Z","leaderTransitions":3}'
  creationTimestamp: "2020-05-26T06:37:10Z"
  name: openshift-master-controllers
  namespace: openshift-controller-manager
  resourceVersion: "56364"
  selfLink: /api/v1/namespaces/openshift-controller-manager/configmaps/openshift-master-controllers
  uid: 34654c77-4b7d-4004-a61c-84bc584d0024

[wewang@wangwen work]$ date ; oc delete pod controller-manager-srqzr -n openshift-controller-manager ; date; oc get pods -n openshift-controller-manager
Tue May 26 16:08:50 CST 2020
pod "controller-manager-srqzr" deleted
Tue May 26 16:09:41 CST 2020
NAME                       READY   STATUS    RESTARTS   AGE
controller-manager-k5ckk   1/1     Running   0          91m
controller-manager-lxsj8   1/1     Running   0          91m
controller-manager-xgt6h   1/1     Running   0          7s

Comment 5 Gabe Montero 2020-05-26 12:51:00 UTC

Perfect @Wen ... looks good

marking verified

Comment 7 errata-xmlrpc 2020-07-13 17:17:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.