1934216 – machineset-controller stuck in CrashLoopBackOff after upgrade to 4.7.0

Bug 1934216 - machineset-controller stuck in CrashLoopBackOff after upgrade to 4.7.0

Summary: machineset-controller stuck in CrashLoopBackOff after upgrade to 4.7.0

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.7
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Michael Gugino
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1934798
TreeView+	depends on / blocked

Reported:	2021-03-02 17:39 UTC by David Hernández Fernández
Modified:	2024-06-14 00:36 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Multiple clientsets created inside machineset controller caused slow startup time resulting in pod failing readiness checks in certain large clusters. Consequence: Unable to complete upgrade or restart this pod successfully in 4.7.0. Fix: machineset controller uses single client set. Result: machineset controller behaves as expected.
Clone Of:
Clones:	1934555 (view as bug list)
Environment:
Last Closed:	2021-07-27 22:49:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 817	0	None	closed	Bug 1934216: reuse mgr client in webhooks	2021-03-04 05:14:28 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:51:31 UTC

Comment 5 Michael Gugino 2021-03-02 21:43:43 UTC

Redacted some URLs from the lab cluster.

4.6 has a 14 second delay between "Registering Components" whereas 4.7 has a 42 second delay.  This delay is just above the 3 x 10 seconds (30) liveness probe timeout on the pod, which causes endless failure and restart.

A potential workaround for the stuck cluster is to copy the machine-controllers deployment, remove all containers except the machineset controller, adjust the liveness probe to 6 x 30, rename, and apply.  This should restore the machineset capabilities while we work on a fix.


I added some debug statements to the binaries in between those "Registering Components" calls.


$ ./machineset-debug-47
2021/03/02 16:30:19 Registering Components.
2021/03/02 16:30:19 after get config
I0302 16:30:21.360936 4015643 request.go:655] Throttling request took 1.043307661s, request: GET:https://my-cluster.example.redhat.com:6443/apis/snapshot.storage.k8s.io/v1beta1?timeout=32s
I0302 16:30:31.410138 4015643 request.go:655] Throttling request took 4.949716425s, request: GET:https://my-cluster.example.redhat.com:6443/apis/authorization.openshift.io/v1?timeout=32s
2021/03/02 16:30:33 After new manager call
2021/03/02 16:30:33 After new defaulter call
I0302 16:30:41.577767 4015643 request.go:655] Throttling request took 1.001682443s, request: GET:https://my-cluster.example.redhat.com:6443/apis/apiregistration.k8s.io/v1beta1?timeout=32s
2021/03/02 16:30:47 After new validator call
2021/03/02 16:30:47 After new msdefaulter call
I0302 16:30:51.625198 4015643 request.go:655] Throttling request took 3.149199313s, request: GET:https://my-cluster.example.redhat.com:6443/apis/coordination.k8s.io/v1?timeout=32s
I0302 16:31:01.652524 4015643 request.go:655] Throttling request took 6.905982449s, request: GET:https://my-cluster.example.redhat.com:6443/apis/operator.openshift.io/v1?timeout=32s
2021/03/02 16:31:01 After new msvalidator call
2021/03/02 16:31:01 Registering Components.
2021/03/02 16:31:01 Starting the Cmd.

$ ./machineset-debug-46
2021/03/02 16:34:23 Registering Components.
I0302 16:34:25.690275 4016198 request.go:645] Throttling request took 1.046962653s, request: GET:https://my-cluster.example.redhat.com:6443/apis/hco.kubevirt.io/v1alpha1?timeout=32s
I0302 16:34:35.740404 4016198 request.go:645] Throttling request took 4.851744443s, request: GET:https://my-cluster.example.redhat.com:6443/apis/operator.cert-manager.io/v1alpha1?timeout=32s
2021/03/02 16:34:37 after manager
2021/03/02 16:34:38 after defaulter
2021/03/02 16:34:38 after validator
2021/03/02 16:34:38 after msdefaulter
2021/03/02 16:34:38 after msvalidator
2021/03/02 16:34:38 Registering Components.
2021/03/02 16:34:38 Starting the Cmd.


So, either something changed in our webhooks to require the extra calls to the API, or something changed in sig controller runtime.  Here's the diff with the logging statements for future reference: https://github.com/openshift/machine-api-operator/pull/816/files

Comment 6 Michael Gugino 2021-03-02 22:37:24 UTC

The commit that introduced the slowdown: 755c811ec6196a474501679e427ef67b8ec7f16c

https://github.com/openshift/machine-api-operator/pull/673

Comment 7 Michael Gugino 2021-03-02 22:55:07 UTC

Fix for master: https://github.com/openshift/machine-api-operator/pull/817

Comment 8 Stephen Cuppett 2021-03-02 23:43:54 UTC

Asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous 4.6. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. 

Who is impacted?
What is the impact?
How involved is remediation?
Is this a regression?

Depending on the answers to the above questions we can add/decide UpgradeBlocker keyword.

Comment 9 Michael Gugino 2021-03-03 00:09:44 UTC

Who is impacted?

Potentially any cluster upgrading to any currently released branch of 4.7.  The two (known) affected clusters have been vSphere, but the code path is not specific to them.


What is the impact?

The impact is the unavailability of the machineset controller due to failing liveness probes.  This prevents machinesets from scaling up and down in order to add and remove machines to the cluster.  Since the operator goes degraded due to this state, the upgrade is hung and does not progress.


How involved is remediation?

A workaround exists to restore machineset-controller functionality.  Unless there is a way to exclude the machine-api from being watched/upgraded during upgrades, there is no way to unblock upgrades on clusters with this condition.


Is this a regression?

Yes, new code was added in 4.7 that triggered this bug.

Comment 10 Lalatendu Mohanty 2021-03-03 01:18:07 UTC

> A workaround exists to restore machineset-controller functionality.  Unless there is a way to exclude the machine-api from being watched/upgraded during upgrades, there is no way to unblock upgrades on clusters with this condition.

Can you explain the steps to restore the machineset-controller functionality?

Comment 11 David Hernández Fernández 2021-03-03 10:24:08 UTC

Could the following be tested by QA before suggesting it to customers that face this situation? I tried and feels simple but I do not have a similar environment with the same issue so I'm not able to certify "if this should restore it" as per your notes:
"A potential workaround for the stuck cluster is to copy the machine-controllers deployment, remove all containers except the machineset controller, adjust the liveness probe to 6 x 30, rename, and apply.  This should restore the machineset capabilities while we work on a fix."

After it is tested I could create a KCS for the people who try to upgrade to 4.7.0.

Comment 13 W. Trevor King 2021-03-03 19:19:15 UTC

Working up mitigation steps:

  $ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version
  4.7.0

Add an override, so we can touch the machine-API operator without the cluster-version operator stomping us:

  $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/overrides", "value": [{"kind": "Deployment", "group": "apps/v1", "namespace": "openshift-machine-api", "name": "machine-api-operator", "unmanaged": true}]}]'

If you already have overrides in place, you'd want:

  $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/overrides/-", "value": {"kind": "Deploy
ment", "group": "apps/v1", "namespace": "openshift-machine-api", "name": "machine-api-operator", "unmanaged": true}}]'

Scale down the machine-API operator, so we can touch machine-api-controllers without getting stomped (this exposes us to unwanted drift, but should be relatively safe in the short term):

  $ oc -n openshift-machine-api scale --replicas=0 deployment/machine-api-operator

The probes we need to bump are:

  $ oc -n openshift-machine-api get -o json deployment machine-api-controllers | jq -r '.spec.template.spec.containers[] | select(.livenessProbe != null) | .name + " " + (.livenessProbe | tostring)'
  machineset-controller {"failureThreshold":3,"httpGet": {"path":"/readyz","port":"healthz","scheme":"HTTP"},"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1}
  machine-controller {"failureThreshold":3,"httpGet": {"path":"/readyz","port":"healthz","scheme":"HTTP"},"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1}
  machine-healthcheck-controller {"failureThreshold":3,"httpGet": {"path":"/readyz","port":"healthz","scheme":"HTTP"},"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1}

We could probably work up a one-liner 'oc patch' for those, but for now just:

  $ oc -n openshift-machine-api edit deployment machine-api-controllers

and set 'failureThreshold: 9' for the three livenessProbes manually.  That should be enough to recover from the crashlooping.

If you happen to be blocked mid-update, you may also need to tell the cluster-version operator to ignore the machine-api ClusterOperator:

  $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/overrides/-", "value": {"kind": "ClusterOperator", "group": "config.openshift.io/v1", "namespace": "", "name": "machine-api", "unmanaged": true}}]'

Once you update to a patched version, you can remove the overrides.  If you have no other overrides that need preserving, you can do that with:

  $ oc patch clusterversion version --type json -p '[{"op": "remove", "path": "/spec/overrides"}]'

Comment 16 Michael Gugino 2021-03-03 20:51:32 UTC

Steps to QA:

Prior to the fix, grep the logs for the machineset-controller for 'Registering Components.'  Note timestamp difference between the first two entries.  EX:

oc logs machine-api-controllers-659d5c8494-zxq6x -n openshift-machine-api -c machineset-controller | grep 'Registering Components.'
2021/03/03 15:25:32 Registering Components.
2021/03/03 15:25:40 Registering Components.


(Remember to use your cluster's pod name instead of machine-api-controllers-659d5c8494-zxq6x)

After upgrading to the patch in question, this should be reduced to approximately 1/3 of the original amount of time.

Comment 19 W. Trevor King 2021-03-04 05:24:37 UTC

One angle at ~verification is the hosted-Loki run from the PR's update job [1,2].  Running:

  {invoker="openshift-internal-ci/pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade/1366935850604040192",pod_name=~"machine-api-controllers-.*"} |= "Registering Components"

gives:

  2021-03-03 02:37:02	2021/03/03 02:37:02 Registering Components.
  2021-03-03 02:37:13	I0303 02:37:13.698984       1 main.go:89] Registering Components.
  2021-03-03 02:37:13	I0303 02:37:13.864228       1 main.go:104] Registering Components.
  2021-03-03 02:37:18	2021/03/03 02:37:18 Registering Components.
  2021-03-03 03:53:00	2021/03/03 03:53:00 Registering Components.
  2021-03-03 03:53:03	2021/03/03 03:53:03 Registering Components.
  2021-03-03 03:53:09	I0303 03:53:09.461317       1 main.go:89] Registering Components.
  2021-03-03 03:53:09	I0303 03:53:09.764114       1 main.go:104] Registering Components.

So that's ~16s in the initial version, which is before the update, and ~3s after the update, which sounds good.  I dunno what the logs that start with I0303 are about, but they don't match the comment 16 pattern.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/817/pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade/1366935850604040192
[2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/817/pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade/1366935850604040192/artifacts/e2e-aws-upgrade/ipi-install-hosted-loki/build-log.txt

Comment 20 Milind Yadav 2021-03-04 09:43:42 UTC

Thanks Trevor .

Here are the results :

[4.8.0-0.nightly-2021-03-03-192757]
[miyadav@miyadav vsphere]$ oc get pods
NAME                                           READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-84d9bc49d5-jmdnh   2/2     Running   0          109m
cluster-baremetal-operator-7545d7c54f-f89hc    2/2     Running   0          109m
machine-api-controllers-5dfd4f885b-md6f4       7/7     Running   0          109m
machine-api-operator-67965db89d-zp64l          2/2     Running   0          109m
[miyadav@miyadav vsphere]$ oc logs machine-api-controllers-5dfd4f885b-md6f4 -c machineset-controller | grep 'Registering Components'
2021/03/04 07:11:45 Registering Components.
2021/03/04 07:11:53 Registering Components.

After upgrade :

[miyadav@miyadav vsphere]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-03-03-220107   True        False         39s     Cluster version is 4.8.0-0.nightly-2021-03-03-220107
[miyadav@miyadav vsphere]$ oc get pods
NAME                                          READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-9958dcb58-9hcj9   2/2     Running   0          6m22s
cluster-baremetal-operator-869f7994b-hf92j    2/2     Running   0          8m1s
machine-api-controllers-75fc98cc4-6mt75       7/7     Running   0          8m19s
machine-api-operator-5d6c9dfb94-sjb5d         2/2     Running   0          10m
[miyadav@miyadav vsphere]$ oc logs machine-api-controllers-75fc98cc4-6mt75 -c machineset-controller | grep 'Registering Components'
2021/03/04 09:23:05 Registering Components.
2021/03/04 09:23:08 Registering Components.
[miyadav@miyadav vsphere]$ 


Additional Info:
[miyadav@miyadav vsphere]$ oc describe pod machine-api-controllers-75fc98cc4-6mt75 
.
.

Events:
  Type    Reason          Age   From               Message
  ----    ------          ----  ----               -------
  Normal  Scheduled       15m   default-scheduler  Successfully assigned openshift-machine-api/machine-api-controllers-75fc98cc4-6mt75 to control-plane-2
  Normal  AddedInterface  15m   multus             Add eth0 [10.128.0.74/23]
  Normal  Pulled          15m   kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9fe75f5a7f10e14a65b4b4d7654d4af91c21d0841b7c717cb988640394b4d3bb" already present on machine
  Normal  Created         15m   kubelet            Created container machineset-controller
  Normal  Started         15m   kubelet            Started container machineset-controller
  Normal  Pulled          15m   kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9fe75f5a7f10e14a65b4b4d7654d4af91c21d0841b7c717cb988640394b4d3bb" already present on machine
  Normal  Pulled          15m   kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:47ac7d69fd6a0daf57a3ca3ebf4d40b30a4b77abbec2fd138767abdd195f88bc" already present on machine
  Normal  Started         15m   kubelet            Started container machine-controller
  Normal  Pulled          15m   kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9fe75f5a7f10e14a65b4b4d7654d4af91c21d0841b7c717cb988640394b4d3bb" already present on machine
  Normal  Created         15m   kubelet            Created container nodelink-controller
  Normal  Started         15m   kubelet            Started container nodelink-controller
  Normal  Created         15m   kubelet            Created container machine-controller
  Normal  Created         15m   kubelet            Created container machine-healthcheck-controller
  Normal  Started         15m   kubelet            Started container machine-healthcheck-controller
  Normal  Pulled          15m   kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:47ac7d69fd6a0daf57a3ca3ebf4d40b30a4b77abbec2fd138767abdd195f88bc" already present on machine
  Normal  Created         15m   kubelet            Created container kube-rbac-proxy-machineset-mtrc
  Normal  Started         15m   kubelet            Started container kube-rbac-proxy-machineset-mtrc
  Normal  Pulled          15m   kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9fe75f5a7f10e14a65b4b4d7654d4af91c21d0841b7c717cb988640394b4d3bb" already present on machine
  Normal  Created         15m   kubelet            Created container kube-rbac-proxy-machine-mtrc
  Normal  Started         15m   kubelet            Started container kube-rbac-proxy-machine-mtrc
  Normal  Pulled          15m   kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:47ac7d69fd6a0daf57a3ca3ebf4d40b30a4b77abbec2fd138767abdd195f88bc" already present on machine
  Normal  Created         15m   kubelet            Created container kube-rbac-proxy-mhc-mtrc
  Normal  Started         15m   kubelet            Started container kube-rbac-proxy-mhc-mtrc


Moving it to VERIFIED

Comment 21 Michael Gugino 2021-03-09 15:03:45 UTC

Seeing:

    - lastTransitionTime: "2021-03-09T14:54:45Z"
      message: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.
      reason: ClusterVersionOverridesSet
      status: "False"
      type: Upgradeable

Overrides disable upgrades, so the procedure is not unsticking the upgrades.

Comment 22 W. Trevor King 2021-03-10 15:09:13 UTC

Hrm. Easiest solution is probably updating to 4.7.2 once that comes out with the fix (you can use 'oc adm upgrade --to 4.7.2' to retarget your existing update). You'll have to drop your overrides to get that retarget to 4.7.2 accepted, but I don't think the Upgradeable=False should be blocking you mid-update. Checking with a cluster-bot cluster:

$ oc get -o 'jsonpath={.status.desired.version}{"\n"}' clusterversion version
4.7.0

Kick off an update:

$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "candidate-4.7"}]'
$ oc adm upgrade --to 4.7.1

Give it a minute or so to get into the new CVO pod:

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.7.1: 70 of 668 done (10% complete)

No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss.

Override the machine API operator:

$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/overrides", "value": [{"kind": "Deployment", "group": "apps/v1", "namespace": "openshift-machine-api", "name": "machine-api-operator", "unmanaged": true}]}]'

Confirm we got it pre-update:

$ oc get clusteroperator machine-api
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
machine-api 4.7.0 True False False 16m

Tell the CVO to not care about the machine-api ClusterOperator either, now that nobody will be bumping its version:

$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/overrides/-", "value": {"kind": "ClusterOperator", "group": "config.openshift.io/v1", "namespace": "", "name": "machine-api", "unmanaged": true}}]'

After another minute or two, the ClusterVersion Upgradeable=False condition gets set:

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.7.1: 70 of 668 done (10% complete)

Upgradeable=False

Reason: ClusterVersionOverridesSet
Message: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.

No updates available. You may force an upgrade to a specific release image, but
doing so may not be supported and result in downtime or data loss.

But the current update carries on:

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.7.1: 96 of 668 done (14% complete)

Upgradeable=False

Reason: ClusterVersionOverridesSet
Message: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.

No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss.

And the update eventually completes. Logs in [1].

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1369526868608290816

Comment 23 Andrea Cervesato 2021-03-15 14:25:58 UTC

Just to clarify. 
ATM if a customer is in mid upgrade there is no workaround that can fix it.
Am I right?

Comment 24 W. Trevor King 2021-03-16 17:18:00 UTC

Comment 13 was my recommended workaround.  Comment 21 claimed it did not unstick the update.  Comment 22 was me failing to reproduce the stick, and having my update complete after the comment 13 procedure.  So I'm still not clear on what was causing the stick in comment 21.

But 4.7.2 is supported since a few hours ago [1], and the release is in fast-4.7 [2], so folks who are stuck should be able to recover by dropping their spec.overrides from ClusterVersion, setting their channel to fast-4.7, and updating to 4.7.2.  Fast edges are fully supported [3].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1934798#c5
[2]: https://github.com/openshift/cincinnati-graph-data/pull/688#event-4465196412
[3]: https://docs.openshift.com/container-platform/4.7/updating/updating-cluster-between-minor.html#fast-4-7-channel

Comment 27 errata-xmlrpc 2021-07-27 22:49:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.