Description of problem: I was paying close attention to 4.3.8 to 4.3.9 upgrade for the first time and I noticed that the MCO was quite slow in making progress. When I observed the MCC pods I noticed there were at least 5 leader election cycles where it took 2 minutes each to acquire the leader. This is aggrevated by the fact that the controller will be restarted potentially several times while hosts are rebooted. There are other components with similar bugs and proposed solutions. See https://bugzilla.redhat.com/show_bug.cgi?id=1775224 I only looked at MCC, perhaps other components of MCO exhibit the same behavior.
Adding UpcomingSprint as this won't make the current sprint. We'll try to work on this bug in the next sprint.
Adding UpcomingSprint as I won't be able to finish this by the current sprint. I'll revisit from next week.
Adding UpcomingSprint as I've been busy with 4.6 features delivery. We'll attempt this next sprint.
This issue has been moved over several releases. I am closing for now. Please open up if this remains an issue that needs our attention.
I've just confirmed that this is present in 4.5 to 4.6 upgrades and I suspect nothing has changed in 4.7. This accounts for at least eight minutes of lease timeouts where little or no forward progress in the upgrade. Here's 2 minutes acquiring the lease prior to pivoting to 4.6 MCC, I'm uncertain what triggered the pod to restart at this phase. $ oc logs -n openshift-machine-config-operator pod/machine-config-controller-586b58dcdc-z79mn -f I1021 15:45:45.142412 1 start.go:50] Version: v4.5.0-202010160047.p0-dirty (cdce2822a6b3bff31b5aafc23b773f7dcbea2caa) I1021 15:45:45.146977 1 leaderelection.go:242] attempting to acquire leader lease openshift-machine-config-operator/machine-config-controller... E1021 15:47:43.143360 1 event.go:316] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"machine-config-controller", GenerateName:"", Namespace:"openshift-machine-config-operator", SelfLink:"/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config-controller", UID:"6004c3b4-3386-441c-a282-c8c68c8b1ee7", ResourceVersion:"64565", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63738886455, loc:(*time.Location)(0x274ef00)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"machine-config-controller-586b58dcdc-z79mn_5720d5d5-3b59-43f5-98b5-992c218c86a6\",\"leaseDurationSeconds\":90,\"acquireTime\":\"2020-10-21T15:47:43Z\",\"renewTime\":\"2020-10-21T15:47:43Z\",\"leaderTransitions\":2}"}, OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"machine-config-controller", Operation:"Update", APIVersion:"v1", Time:(*v1.Time)(0xc0001544a0), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc0001544c0)}}}, Immutable:(*bool)(nil), Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "github.com/openshift/machine-config-operator/cmd/common/helpers.go:30"'. Will not report event: 'Normal' 'LeaderElection' 'machine-config-controller-586b58dcdc-z79mn_5720d5d5-3b59-43f5-98b5-992c218c86a6 became leader' I1021 15:47:43.143515 1 leaderelection.go:252] successfully acquired lease openshift-machine-config-operator/machine-config-controller And another 2 minutes acquiring the lease after pivoting to 4.6 MCC $ oc logs -n openshift-machine-config-operator pod/machine-config-controller-6dcfbcb884-ppjx5 I1021 15:47:43.826406 1 start.go:50] Version: machine-config-daemon-4.6.0-202006240615.p0-330-gfb9bf08c-dirty (fb9bf08c0021b614f6bb56ea3839295382f949b3) I1021 15:47:43.843313 1 leaderelection.go:243] attempting to acquire leader lease openshift-machine-config-operator/machine-config-controller... I1021 15:49:39.525173 1 leaderelection.go:253] successfully acquired lease openshift-machine-config-operator/machine-config-controller Another 2 minutes after pod restart, at this point nodes have started to cycle so there may be a bit of slack time in here where MCC couldn't actually be making progress regardless of whether or not it has the lease, but still another lease timeout. $ oc logs -n openshift-machine-config-operator pod/machine-config-controller-6dcfbcb884-p7cmx -f I1021 15:54:34.350374 1 start.go:50] Version: machine-config-daemon-4.6.0-202006240615.p0-330-gfb9bf08c-dirty (fb9bf08c0021b614f6bb56ea3839295382f949b3) I1021 15:54:34.358107 1 leaderelection.go:243] attempting to acquire leader lease openshift-machine-config-operator/machine-config-controller... E1021 15:56:30.043987 1 leaderelection.go:321] error retrieving resource lock openshift-machine-config-operator/machine-config-controller: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config-controller": dial tcp 172.30.0.1:443: connect: connection refused I1021 15:57:23.991212 1 leaderelection.go:253] successfully acquired lease openshift-machine-config-operator/machine-config-controller Another restart and 2 minutes $ oc logs -n openshift-machine-config-operator pod/machine-config-controller-6dcfbcb884-sqnr4 -f I1021 15:59:08.179732 1 start.go:50] Version: machine-config-daemon-4.6.0-202006240615.p0-330-gfb9bf08c-dirty (fb9bf08c0021b614f6bb56ea3839295382f949b3) I1021 15:59:08.188700 1 leaderelection.go:243] attempting to acquire leader lease openshift-machine-config-operator/machine-config-controller... I1021 16:01:03.867513 1 leaderelection.go:253] successfully acquired lease openshift-machine-config-operator/machine-config-controller Ater this point all masters have been updated and the restarting of MCC ends.
Yeah, I'm doing a 4.5 -> 4.6 upgrade now on an empty cluster and it's like watching paint dry. There are multiple factors here even before getting to the MCO but this is definitely noticeable.
Confirmed in 4.6.16 to 4.7.0-rc.2 upgrade on both MCO and MCC. Trevor would this perhaps be something that you could look at fixing as part of the effort to analyze and improve overall upgrade duration? I'd like to consider this a blocker for 4.8 GA and we should backport this to 4.6 once fixes have been verified, so marking blocker+ based on that. $ oc logs -f -n openshift-machine-config-operator pod/machine-config-operator-68fc64f774-zxfk5 I0217 16:00:17.340344 1 start.go:43] Version: 4.7.0-rc.2 (Raw: v4.7.0-202102110027.p0-dirty, Hash: 0023e696058bbdf6e14504117bfc31f208125c47) I0217 16:00:17.343612 1 leaderelection.go:243] attempting to acquire leader lease openshift-machine-config-operator/machine-config... I0217 16:02:13.000168 1 leaderelection.go:253] successfully acquired lease openshift-machine-config-operator/machine-config $ oc logs -f -n openshift-machine-config-operator pod/machine-config-controller-6d959d5c89-kg6vb I0217 16:04:19.323410 1 start.go:50] Version: v4.7.0-202102110027.p0-dirty (0023e696058bbdf6e14504117bfc31f208125c47) I0217 16:04:19.326795 1 leaderelection.go:243] attempting to acquire leader lease openshift-machine-config-operator/machine-config-controller... I0217 16:06:14.978778 1 leaderelection.go:253] successfully acquired lease openshift-machine-config-operator/machine-config-controller $ oc logs -f -n openshift-machine-config-operator pod/machine-config-controller-6d959d5c89-69k2w I0217 16:19:19.132940 1 start.go:50] Version: v4.7.0-202102110027.p0-dirty (0023e696058bbdf6e14504117bfc31f208125c47) I0217 16:19:19.135465 1 leaderelection.go:243] attempting to acquire leader lease openshift-machine-config-operator/machine-config-controller... I0217 16:21:14.787067 1 leaderelection.go:253] successfully acquired lease openshift-machine-config-operator/machine-config-controller
I'm happy to help, since I've been fighting with graceful-leader-releases in the CVO since 4.6. Hopefully we're squashing out the last few bugs there now. But given the fiddlyness with the CVO, I may end up needing more hand-holding from folks who understand the MCO code better. Assigning to myself for now, and we'll see how far I get ;).
Had a chat with Trevor today, will take over this BZ for now. Thanks!
Moving to assigned because there are some concern with merged fix - https://github.com/openshift/machine-config-operator/pull/2603 . From David's comment on PR: In order to handle short-term network (or other) disruption, all clients (including those using leases) should be able to make it at least 60 seconds without the kube-apiserver available. To have rapid hand-off in normal cases, clients using leases need to make use of graceful release. Since, https://github.com/openshift/machine-config-operator/pull/2603 has already got merged, putting a revert PR to undo applied changes https://github.com/openshift/machine-config-operator/pull/2604.
Moving back to new due to the PR being reverted
Verified using IPI on AWS version: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-24-041539 True False 14m Cluster version is 4.11.0-0.nightly-2022-06-24-041539 Verification steps: 1. When an operator pod or a controller pod is deleted we can see in the logs that the pod is releasing the leader In controller: 624 08:17:01.964417 1 helpers.go:86] Shutting down due to: terminated I0624 08:17:01.964534 1 helpers.go:89] Context cancelled I0624 08:17:01.965078 1 kubelet_config_controller.go:204] Shutting down MachineConfigController-KubeletConfigController I0624 08:17:01.965104 1 template_controller.go:246] Shutting down MachineConfigController-TemplateController I0624 08:17:01.965114 1 container_runtime_config_controller.go:195] Shutting down MachineConfigController-ContainerRuntimeConfigController I0624 08:17:01.965125 1 render_controller.go:132] Shutting down MachineConfigController-RenderController I0624 08:17:01.965134 1 node_controller.go:176] Shutting down MachineConfigController-NodeController I0624 08:17:01.965144 1 drain_controller.go:142] Shutting down MachineConfigController-DrainController I0624 08:17:01.966579 1 metrics.go:70] Metrics listener successfully stopped I0624 08:17:01.980918 1 start.go:107] Stopped leading. Terminating. rpc error: code = NotFound desc = could not find container "99a9bcb2df8fbffb05fbc622b6f675127b97e659d6b1f83def35cc17893e9fcc": container with ID starting with 99a9bcb2df8fbffb05fbc622b6f675127b97e659d6b1f83def35cc17893e9fcc not found: ID does not exist In operator: E0624 07:48:58.806320 1 status.go:323] Error checking version skew: kube-apiserver does not yet have a version, kubelet skew status: KubeletSkewUnchecked, status reason: KubeletSkewUnchecked, status message: An error occurred when checking kubelet version skew: kube-apiserver does not yet have a version I0624 08:13:57.772153 1 helpers.go:86] Shutting down due to: terminated I0624 08:13:57.772289 1 helpers.go:89] Context cancelled I0624 08:13:57.772579 1 operator.go:286] Shutting down MachineConfigOperator I0624 08:13:57.777221 1 start.go:115] Stopped leading. Terminating. 2. When the new operator and controller pods are created, they can get the leader lease immediately In controller: I0624 08:17:04.594148 1 start.go:57] Version: v4.11.0-202206232036.p0.g4aa21da.assembly.stream-dirty (4aa21da3efc9b49ff4d8fec0dfe5a8598a19332a) I0624 08:17:05.612863 1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config-controller... I0624 08:17:05.620895 1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config-controller I0624 08:17:05.621557 1 metrics.go:47] Registering Prometheus metrics In operator: I0624 08:14:00.164480 1 start.go:47] Version: 4.11.0-0.nightly-2022-06-24-041539 (Raw: v4.11.0-202206232036.p0.g4aa21da.assembly.stream-dirty, Hash: 4aa21da3efc9b49ff4d8fec0dfe5a8598a19332a) I0624 08:14:01.229006 1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config... I0624 08:14:01.236637 1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config I0624 08:14:01.663584 1 operator.go:276] Starting MachineConfigOperator 3. When we upgrade from 4.10.20 to 4.11.0-0.nightly-2022-06-24-041539 in every pod restart the leader lease can be acquired immediately. $ oc logs machine-config-controller-7475bfd755-sz7nk |grep lease Defaulted container "machine-config-controller" out of: machine-config-controller, oauth-proxy I0624 09:39:25.290558 1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config-controller... I0624 09:39:25.303325 1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config-controller $ oc logs machine-config-operator-694fd88f4f-llqvs |grep lease I0624 09:28:22.223462 1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config... I0624 09:28:22.230751 1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config We move the status to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069