1881351 – KCM and KS don't gracefully terminate

Bug 1881351 - KCM and KS don't gracefully terminate

Summary: KCM and KS don't gracefully terminate

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.z
Assignee:	Tomáš Nožička
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:	1877791 1877793
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-22 08:51 UTC by Tomáš Nožička
Modified:	2020-11-10 14:54 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-10 14:53:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 25536	0	None	closed	[release-4.5] Bug 1881351: KCM and KS graceful termination	2021-01-29 08:17:45 UTC
Red Hat Product Errata	RHBA-2020:4425	0	None	None	None	2020-11-10 14:54:10 UTC

Description Tomáš Nožička 2020-09-22 08:51:43 UTC

KCM and KS need to gracefully terminate so the next replica can take over during a rollout. Graceful termination is important for giving up the lease, so another replica can become the leader without waiting for the lease to expire.

KCM is especially important to be available as much as we can because it runs the endpoints controller which has to notice pods going down / rolling out and update service ASAP to stop sending traffic there.

Comment 1 Maciej Szulik 2020-10-01 14:07:57 UTC

The PR is already in the queue.

Comment 2 Tomáš Nožička 2020-10-22 17:20:09 UTC

PR is awaiting QA pre-verification https://github.com/openshift/origin/pull/25536#issuecomment-714639556

Comment 3 zhou ying 2020-10-23 08:56:24 UTC

Checked with 4.5.0-0.ci.test-2020-10-23-075611-ci-ln-j1qrj4k, the KS will renew lead within 10s .

I1023 08:51:01.440871       1 server.go:253] Requested to terminate. Exiting.
I1023 08:51:04.103558       1 leaderelection.go:252] successfully acquired lease openshift-kube-scheduler/kube-scheduler

I1023 08:55:35.733465       1 server.go:253] Requested to terminate. Exiting.
I1023 08:55:37.539562       1 leaderelection.go:252] successfully acquired lease openshift-kube-scheduler/kube-scheduler

Comment 4 zhou ying 2020-10-23 09:07:18 UTC

Checked with 4.5.0-0.ci.test-2020-10-23-075611-ci-ln-j1qrj4k, the KCM will renew lead within 10s .

steps followed to verify the bug:
===================================
1) check KCM leader
2) move kube-controller-manager-config.yaml from /etc/kubernetes/manifests to some other directory
3) check logs of all kcm pods

Node on which file was moved you will see a message as below and with in 10 seconds you should see another replica becoming kcm leader.

I1023 09:00:38.120509       1 controllermanager.go:301] Requested to terminate. Exiting.
I1023 09:00:42.292918       1 leaderelection.go:252] successfully acquired lease kube-system/kube-controller-manager


I1023 09:04:15.560848       1 controllermanager.go:301] Requested to terminate. Exiting.
I1023 09:04:16.609500       1 leaderelection.go:252] successfully acquired lease kube-system/kube-controller-manager

Comment 8 errata-xmlrpc 2020-11-10 14:53:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.18 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4425

Note You need to log in before you can comment on or make changes to this bug.