1817075 – MCC & MCO don't free leader leases during shut down -> 10 minutes of leader election timeouts

Bug 1817075 - MCC & MCO don't free leader leases during shut down -> 10 minutes of leader election timeouts

Summary: MCC & MCO don't free leader leases during shut down -> 10 minutes of leader e...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	John Kyros
QA Contact:	Rio Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1943566
TreeView+	depends on / blocked

Reported:	2020-03-25 14:17 UTC by Scott Dodson
Modified:	2022-08-10 10:36 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:35:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2603	None	Merged	Bug 1817075: operator/controller pods: faster leader elections	2022-05-09 17:30:22 UTC
Github	openshift machine-config-operator pull 2604	None	Merged	Bug 1817075: Revert "operator/controller pods: faster leader elections"	2022-05-09 17:30:24 UTC
Github	openshift machine-config-operator pull 3185	None	Merged	Bug 1817075: MCC & MCO don't free leader leases during shut down -> 10 minutes of leader election timeouts	2022-06-23 16:10:52 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 10:36:06 UTC

Description Scott Dodson 2020-03-25 14:17:16 UTC

Description of problem:
I was paying close attention to 4.3.8 to 4.3.9 upgrade for the first time and I noticed that the MCO was quite slow in making progress. When I observed the MCC pods I noticed there were at least 5 leader election cycles where it took 2 minutes each to acquire the leader. This is aggrevated by the fact that the controller will be restarted potentially several times while hosts are rebooted.

There are other components with similar bugs and proposed solutions. See https://bugzilla.redhat.com/show_bug.cgi?id=1775224

I only looked at MCC, perhaps other components of MCO exhibit the same behavior.

Comment 3 Antonio Murdaca 2020-06-16 12:58:00 UTC

Adding UpcomingSprint as this won't make the current sprint. We'll try to work on this bug in the next sprint.

Comment 4 Antonio Murdaca 2020-07-10 09:40:40 UTC

Adding UpcomingSprint as I won't be able to finish this by the current sprint. I'll revisit from next week.

Comment 5 Antonio Murdaca 2020-08-02 20:30:21 UTC

Adding UpcomingSprint as I've been busy with 4.6 features delivery. We'll attempt this next sprint.

Comment 6 Michelle Krejci 2020-10-05 20:32:10 UTC

This issue has been moved over several releases. I am closing for now. Please open up if this remains an issue that needs our attention.

Comment 7 Scott Dodson 2020-10-21 16:09:56 UTC

I've just confirmed that this is present in 4.5 to 4.6 upgrades and I suspect nothing has changed in 4.7. This accounts for at least eight minutes of lease timeouts where little or no forward progress in the upgrade.

Here's 2 minutes acquiring the lease prior to pivoting to 4.6 MCC, I'm uncertain what triggered the pod to restart at this phase.

$ oc logs -n openshift-machine-config-operator pod/machine-config-controller-586b58dcdc-z79mn -f
I1021 15:45:45.142412       1 start.go:50] Version: v4.5.0-202010160047.p0-dirty (cdce2822a6b3bff31b5aafc23b773f7dcbea2caa)
I1021 15:45:45.146977       1 leaderelection.go:242] attempting to acquire leader lease  openshift-machine-config-operator/machine-config-controller...
E1021 15:47:43.143360       1 event.go:316] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"machine-config-controller", GenerateName:"", Namespace:"openshift-machine-config-operator", SelfLink:"/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config-controller", UID:"6004c3b4-3386-441c-a282-c8c68c8b1ee7", ResourceVersion:"64565", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63738886455, loc:(*time.Location)(0x274ef00)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"machine-config-controller-586b58dcdc-z79mn_5720d5d5-3b59-43f5-98b5-992c218c86a6\",\"leaseDurationSeconds\":90,\"acquireTime\":\"2020-10-21T15:47:43Z\",\"renewTime\":\"2020-10-21T15:47:43Z\",\"leaderTransitions\":2}"}, OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"machine-config-controller", Operation:"Update", APIVersion:"v1", Time:(*v1.Time)(0xc0001544a0), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc0001544c0)}}}, Immutable:(*bool)(nil), Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "github.com/openshift/machine-config-operator/cmd/common/helpers.go:30"'. Will not report event: 'Normal' 'LeaderElection' 'machine-config-controller-586b58dcdc-z79mn_5720d5d5-3b59-43f5-98b5-992c218c86a6 became leader'
I1021 15:47:43.143515       1 leaderelection.go:252] successfully acquired lease openshift-machine-config-operator/machine-config-controller

And another 2 minutes acquiring the lease after pivoting to 4.6 MCC
$ oc logs -n openshift-machine-config-operator pod/machine-config-controller-6dcfbcb884-ppjx5
I1021 15:47:43.826406       1 start.go:50] Version: machine-config-daemon-4.6.0-202006240615.p0-330-gfb9bf08c-dirty (fb9bf08c0021b614f6bb56ea3839295382f949b3)
I1021 15:47:43.843313       1 leaderelection.go:243] attempting to acquire leader lease  openshift-machine-config-operator/machine-config-controller...
I1021 15:49:39.525173       1 leaderelection.go:253] successfully acquired lease openshift-machine-config-operator/machine-config-controller

Another 2 minutes after pod restart, at this point nodes have started to cycle so there may be a bit of slack time in here where MCC couldn't actually be making progress regardless of whether or not it has the lease, but still another lease timeout.

$ oc logs -n openshift-machine-config-operator pod/machine-config-controller-6dcfbcb884-p7cmx -f
I1021 15:54:34.350374       1 start.go:50] Version: machine-config-daemon-4.6.0-202006240615.p0-330-gfb9bf08c-dirty (fb9bf08c0021b614f6bb56ea3839295382f949b3)
I1021 15:54:34.358107       1 leaderelection.go:243] attempting to acquire leader lease  openshift-machine-config-operator/machine-config-controller...
E1021 15:56:30.043987       1 leaderelection.go:321] error retrieving resource lock openshift-machine-config-operator/machine-config-controller: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config-controller": dial tcp 172.30.0.1:443: connect: connection refused
I1021 15:57:23.991212       1 leaderelection.go:253] successfully acquired lease openshift-machine-config-operator/machine-config-controller

Another restart and 2 minutes

$ oc logs -n openshift-machine-config-operator pod/machine-config-controller-6dcfbcb884-sqnr4 -f
I1021 15:59:08.179732       1 start.go:50] Version: machine-config-daemon-4.6.0-202006240615.p0-330-gfb9bf08c-dirty (fb9bf08c0021b614f6bb56ea3839295382f949b3)
I1021 15:59:08.188700       1 leaderelection.go:243] attempting to acquire leader lease  openshift-machine-config-operator/machine-config-controller...
I1021 16:01:03.867513       1 leaderelection.go:253] successfully acquired lease openshift-machine-config-operator/machine-config-controller

Ater this point all masters have been updated and the restarting of MCC ends.

Comment 8 Colin Walters 2020-10-21 16:59:15 UTC

Yeah, I'm doing a 4.5 -> 4.6 upgrade now on an empty cluster and it's like watching paint dry.  There are multiple factors here even before getting to the MCO but this is definitely noticeable.

Comment 9 Scott Dodson 2021-02-17 16:30:41 UTC

Confirmed in 4.6.16 to 4.7.0-rc.2 upgrade on both MCO and MCC.

Trevor would this perhaps be something that you could look at fixing as part of the effort to analyze and improve overall upgrade duration? I'd like to consider this a blocker for 4.8 GA and we should backport this to 4.6 once fixes have been verified, so marking blocker+ based on that.

$ oc logs -f -n openshift-machine-config-operator pod/machine-config-operator-68fc64f774-zxfk5
I0217 16:00:17.340344       1 start.go:43] Version: 4.7.0-rc.2 (Raw: v4.7.0-202102110027.p0-dirty, Hash: 0023e696058bbdf6e14504117bfc31f208125c47)
I0217 16:00:17.343612       1 leaderelection.go:243] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
I0217 16:02:13.000168       1 leaderelection.go:253] successfully acquired lease openshift-machine-config-operator/machine-config

$ oc logs -f -n openshift-machine-config-operator pod/machine-config-controller-6d959d5c89-kg6vb
I0217 16:04:19.323410       1 start.go:50] Version: v4.7.0-202102110027.p0-dirty (0023e696058bbdf6e14504117bfc31f208125c47)
I0217 16:04:19.326795       1 leaderelection.go:243] attempting to acquire leader lease openshift-machine-config-operator/machine-config-controller...
I0217 16:06:14.978778       1 leaderelection.go:253] successfully acquired lease openshift-machine-config-operator/machine-config-controller

$ oc logs -f -n openshift-machine-config-operator pod/machine-config-controller-6d959d5c89-69k2w
I0217 16:19:19.132940       1 start.go:50] Version: v4.7.0-202102110027.p0-dirty (0023e696058bbdf6e14504117bfc31f208125c47)
I0217 16:19:19.135465       1 leaderelection.go:243] attempting to acquire leader lease openshift-machine-config-operator/machine-config-controller...
I0217 16:21:14.787067       1 leaderelection.go:253] successfully acquired lease openshift-machine-config-operator/machine-config-controller

Comment 10 W. Trevor King 2021-02-19 06:35:48 UTC

I'm happy to help, since I've been fighting with graceful-leader-releases in the CVO since 4.6.  Hopefully we're squashing out the last few bugs there now.  But given the fiddlyness with the CVO, I may end up needing more hand-holding from folks who understand the MCO code better.  Assigning to myself for now, and we'll see how far I get ;).

Comment 11 Yu Qi Zhang 2021-03-08 21:55:39 UTC

Had a chat with Trevor today, will take over this BZ for now. Thanks!

Comment 16 Sinny Kumari 2021-06-08 14:56:25 UTC

Moving to assigned because there are some concern with merged fix - https://github.com/openshift/machine-config-operator/pull/2603 . 
From David's comment on PR:
In order to handle short-term network (or other) disruption, all clients (including those using leases) should be able to make it at least 60 seconds without the kube-apiserver available. To have rapid hand-off in normal cases, clients using leases need to make use of graceful release.

Since, https://github.com/openshift/machine-config-operator/pull/2603 has already got merged, putting a revert PR to undo applied changes https://github.com/openshift/machine-config-operator/pull/2604.

Comment 17 Yu Qi Zhang 2021-06-08 23:59:37 UTC

Moving back to new due to the PR being reverted

Comment 22 Sergio 2022-06-24 10:08:06 UTC

Verified using IPI on AWS version:
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-24-041539   True        False         14m     Cluster version is 4.11.0-0.nightly-2022-06-24-041539


Verification steps:

1. When an operator pod or a controller pod is deleted we can see in the logs that the pod is releasing the leader

In controller:
624 08:17:01.964417       1 helpers.go:86] Shutting down due to: terminated
I0624 08:17:01.964534       1 helpers.go:89] Context cancelled
I0624 08:17:01.965078       1 kubelet_config_controller.go:204] Shutting down MachineConfigController-KubeletConfigController
I0624 08:17:01.965104       1 template_controller.go:246] Shutting down MachineConfigController-TemplateController
I0624 08:17:01.965114       1 container_runtime_config_controller.go:195] Shutting down MachineConfigController-ContainerRuntimeConfigController
I0624 08:17:01.965125       1 render_controller.go:132] Shutting down MachineConfigController-RenderController
I0624 08:17:01.965134       1 node_controller.go:176] Shutting down MachineConfigController-NodeController
I0624 08:17:01.965144       1 drain_controller.go:142] Shutting down MachineConfigController-DrainController
I0624 08:17:01.966579       1 metrics.go:70] Metrics listener successfully stopped
I0624 08:17:01.980918       1 start.go:107] Stopped leading. Terminating.
rpc error: code = NotFound desc = could not find container "99a9bcb2df8fbffb05fbc622b6f675127b97e659d6b1f83def35cc17893e9fcc": container with ID starting with 99a9bcb2df8fbffb05fbc622b6f675127b97e659d6b1f83def35cc17893e9fcc not found: ID does not exist


In operator:
E0624 07:48:58.806320       1 status.go:323] Error checking version skew: kube-apiserver does not yet have a version, kubelet skew status: KubeletSkewUnchecked, status reason: KubeletSkewUnchecked, status message: An error occurred when checking kubelet version skew: kube-apiserver does not yet have a version
I0624 08:13:57.772153       1 helpers.go:86] Shutting down due to: terminated
I0624 08:13:57.772289       1 helpers.go:89] Context cancelled
I0624 08:13:57.772579       1 operator.go:286] Shutting down MachineConfigOperator
I0624 08:13:57.777221       1 start.go:115] Stopped leading. Terminating.


2. When the new operator and controller pods are created, they can get the leader lease immediately

In controller:
I0624 08:17:04.594148       1 start.go:57] Version: v4.11.0-202206232036.p0.g4aa21da.assembly.stream-dirty (4aa21da3efc9b49ff4d8fec0dfe5a8598a19332a)
I0624 08:17:05.612863       1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config-controller...
I0624 08:17:05.620895       1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config-controller
I0624 08:17:05.621557       1 metrics.go:47] Registering Prometheus metrics


In operator:
I0624 08:14:00.164480       1 start.go:47] Version: 4.11.0-0.nightly-2022-06-24-041539 (Raw: v4.11.0-202206232036.p0.g4aa21da.assembly.stream-dirty, Hash: 4aa21da3efc9b49ff4d8fec0dfe5a8598a19332a)
I0624 08:14:01.229006       1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
I0624 08:14:01.236637       1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config
I0624 08:14:01.663584       1 operator.go:276] Starting MachineConfigOperator


3. When we upgrade from 4.10.20 to 4.11.0-0.nightly-2022-06-24-041539 in every pod restart the leader lease can be acquired immediately.
$ oc logs machine-config-controller-7475bfd755-sz7nk |grep lease
Defaulted container "machine-config-controller" out of: machine-config-controller, oauth-proxy
I0624 09:39:25.290558       1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config-controller...
I0624 09:39:25.303325       1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config-controller

$ oc logs machine-config-operator-694fd88f4f-llqvs |grep lease
I0624 09:28:22.223462       1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
I0624 09:28:22.230751       1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config


We move the status to VERIFIED

Comment 24 errata-xmlrpc 2022-08-10 10:35:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.