Bug 1877791

Summary:	KCM doesn't gracefully terminate when rolling out
Product:	OpenShift Container Platform	Reporter:	Tomáš Nožička <tnozicka>
Component:	kube-controller-manager	Assignee:	Tomáš Nožička <tnozicka>
Status:	CLOSED ERRATA	QA Contact:	RamaKasturi <knarra>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.6	CC:	aos-bugs, knarra, mfojtik
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:39:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1881351

Description Tomáš Nožička 2020-09-10 12:58:05 UTC

KCM need to gracefully terminate so the next replica can take over during a rollout. Graceful termination is important for giving up the lease, so another replica can become the leader without waiting 60s for the lease to expire.

KCM is especially important to be available as much as we can because it runs the endpoints controller which has to notice pods going down / rolling out and update service ASAP to stop sending traffic there.

Comment 2 RamaKasturi 2020-09-22 12:30:44 UTC

Moving the bug to verified state as i see that when KCM terminates other replica becomes leader in ~2 seconds.

[ramakasturinarra@dhcp35-60 verification-tests]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-22-011738   True        False         5h34m   Cluster version is 4.6.0-0.nightly-2020-09-22-011738

steps followed to verify the bug:
===================================
1) check KCM leader
2) move kube-controller-manager-config.yaml from /etc/kubernetes/manifests to some other directory
3) check logs of all kcm pods

Node on which file was moved you will see a message as below and with in 10 seconds you should see another replica becoming kcm leader. In my case i see that with in 2 seconds another replica become kcm leader.

I0922 09:54:26.950366       1 controllermanager.go:320] Requested to terminate. Exiting. (First KCM leader getting terminated)

I0922 09:54:28.625305       1 leaderelection.go:253] successfully acquired lease kube-system/kube-controller-manager ( Another replica acquiring the lease and becoming leader)
I0922 09:54:28.625405       1 controllermanager.go:245] using legacy client builder
I0922 09:54:28.625407       1 event.go:291] "Event occurred" object="kube-system/kube-controller-manager" kind="ConfigMap" apiVersion="v1" type="Normal" reason="LeaderElection" message="ip-1
0-0-172-204_ff2b0d0f-d145-4cc3-b89a-1bc049f0371a became leader"

Comment 5 errata-xmlrpc 2020-10-27 16:39:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196