Hide Forgot
Redacted some URLs from the lab cluster. 4.6 has a 14 second delay between "Registering Components" whereas 4.7 has a 42 second delay. This delay is just above the 3 x 10 seconds (30) liveness probe timeout on the pod, which causes endless failure and restart. A potential workaround for the stuck cluster is to copy the machine-controllers deployment, remove all containers except the machineset controller, adjust the liveness probe to 6 x 30, rename, and apply. This should restore the machineset capabilities while we work on a fix. I added some debug statements to the binaries in between those "Registering Components" calls. $ ./machineset-debug-47 2021/03/02 16:30:19 Registering Components. 2021/03/02 16:30:19 after get config I0302 16:30:21.360936 4015643 request.go:655] Throttling request took 1.043307661s, request: GET:https://my-cluster.example.redhat.com:6443/apis/snapshot.storage.k8s.io/v1beta1?timeout=32s I0302 16:30:31.410138 4015643 request.go:655] Throttling request took 4.949716425s, request: GET:https://my-cluster.example.redhat.com:6443/apis/authorization.openshift.io/v1?timeout=32s 2021/03/02 16:30:33 After new manager call 2021/03/02 16:30:33 After new defaulter call I0302 16:30:41.577767 4015643 request.go:655] Throttling request took 1.001682443s, request: GET:https://my-cluster.example.redhat.com:6443/apis/apiregistration.k8s.io/v1beta1?timeout=32s 2021/03/02 16:30:47 After new validator call 2021/03/02 16:30:47 After new msdefaulter call I0302 16:30:51.625198 4015643 request.go:655] Throttling request took 3.149199313s, request: GET:https://my-cluster.example.redhat.com:6443/apis/coordination.k8s.io/v1?timeout=32s I0302 16:31:01.652524 4015643 request.go:655] Throttling request took 6.905982449s, request: GET:https://my-cluster.example.redhat.com:6443/apis/operator.openshift.io/v1?timeout=32s 2021/03/02 16:31:01 After new msvalidator call 2021/03/02 16:31:01 Registering Components. 2021/03/02 16:31:01 Starting the Cmd. $ ./machineset-debug-46 2021/03/02 16:34:23 Registering Components. I0302 16:34:25.690275 4016198 request.go:645] Throttling request took 1.046962653s, request: GET:https://my-cluster.example.redhat.com:6443/apis/hco.kubevirt.io/v1alpha1?timeout=32s I0302 16:34:35.740404 4016198 request.go:645] Throttling request took 4.851744443s, request: GET:https://my-cluster.example.redhat.com:6443/apis/operator.cert-manager.io/v1alpha1?timeout=32s 2021/03/02 16:34:37 after manager 2021/03/02 16:34:38 after defaulter 2021/03/02 16:34:38 after validator 2021/03/02 16:34:38 after msdefaulter 2021/03/02 16:34:38 after msvalidator 2021/03/02 16:34:38 Registering Components. 2021/03/02 16:34:38 Starting the Cmd. So, either something changed in our webhooks to require the extra calls to the API, or something changed in sig controller runtime. Here's the diff with the logging statements for future reference: https://github.com/openshift/machine-api-operator/pull/816/files
The commit that introduced the slowdown: 755c811ec6196a474501679e427ef67b8ec7f16c https://github.com/openshift/machine-api-operator/pull/673
Fix for master: https://github.com/openshift/machine-api-operator/pull/817
Asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous 4.6. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Who is impacted? What is the impact? How involved is remediation? Is this a regression? Depending on the answers to the above questions we can add/decide UpgradeBlocker keyword.
Who is impacted? Potentially any cluster upgrading to any currently released branch of 4.7. The two (known) affected clusters have been vSphere, but the code path is not specific to them. What is the impact? The impact is the unavailability of the machineset controller due to failing liveness probes. This prevents machinesets from scaling up and down in order to add and remove machines to the cluster. Since the operator goes degraded due to this state, the upgrade is hung and does not progress. How involved is remediation? A workaround exists to restore machineset-controller functionality. Unless there is a way to exclude the machine-api from being watched/upgraded during upgrades, there is no way to unblock upgrades on clusters with this condition. Is this a regression? Yes, new code was added in 4.7 that triggered this bug.
> A workaround exists to restore machineset-controller functionality. Unless there is a way to exclude the machine-api from being watched/upgraded during upgrades, there is no way to unblock upgrades on clusters with this condition. Can you explain the steps to restore the machineset-controller functionality?
Could the following be tested by QA before suggesting it to customers that face this situation? I tried and feels simple but I do not have a similar environment with the same issue so I'm not able to certify "if this should restore it" as per your notes: "A potential workaround for the stuck cluster is to copy the machine-controllers deployment, remove all containers except the machineset controller, adjust the liveness probe to 6 x 30, rename, and apply. This should restore the machineset capabilities while we work on a fix." After it is tested I could create a KCS for the people who try to upgrade to 4.7.0.
Working up mitigation steps: $ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version 4.7.0 Add an override, so we can touch the machine-API operator without the cluster-version operator stomping us: $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/overrides", "value": [{"kind": "Deployment", "group": "apps/v1", "namespace": "openshift-machine-api", "name": "machine-api-operator", "unmanaged": true}]}]' If you already have overrides in place, you'd want: $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/overrides/-", "value": {"kind": "Deploy ment", "group": "apps/v1", "namespace": "openshift-machine-api", "name": "machine-api-operator", "unmanaged": true}}]' Scale down the machine-API operator, so we can touch machine-api-controllers without getting stomped (this exposes us to unwanted drift, but should be relatively safe in the short term): $ oc -n openshift-machine-api scale --replicas=0 deployment/machine-api-operator The probes we need to bump are: $ oc -n openshift-machine-api get -o json deployment machine-api-controllers | jq -r '.spec.template.spec.containers[] | select(.livenessProbe != null) | .name + " " + (.livenessProbe | tostring)' machineset-controller {"failureThreshold":3,"httpGet": {"path":"/readyz","port":"healthz","scheme":"HTTP"},"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1} machine-controller {"failureThreshold":3,"httpGet": {"path":"/readyz","port":"healthz","scheme":"HTTP"},"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1} machine-healthcheck-controller {"failureThreshold":3,"httpGet": {"path":"/readyz","port":"healthz","scheme":"HTTP"},"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1} We could probably work up a one-liner 'oc patch' for those, but for now just: $ oc -n openshift-machine-api edit deployment machine-api-controllers and set 'failureThreshold: 9' for the three livenessProbes manually. That should be enough to recover from the crashlooping. If you happen to be blocked mid-update, you may also need to tell the cluster-version operator to ignore the machine-api ClusterOperator: $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/overrides/-", "value": {"kind": "ClusterOperator", "group": "config.openshift.io/v1", "namespace": "", "name": "machine-api", "unmanaged": true}}]' Once you update to a patched version, you can remove the overrides. If you have no other overrides that need preserving, you can do that with: $ oc patch clusterversion version --type json -p '[{"op": "remove", "path": "/spec/overrides"}]'
Steps to QA: Prior to the fix, grep the logs for the machineset-controller for 'Registering Components.' Note timestamp difference between the first two entries. EX: oc logs machine-api-controllers-659d5c8494-zxq6x -n openshift-machine-api -c machineset-controller | grep 'Registering Components.' 2021/03/03 15:25:32 Registering Components. 2021/03/03 15:25:40 Registering Components. (Remember to use your cluster's pod name instead of machine-api-controllers-659d5c8494-zxq6x) After upgrading to the patch in question, this should be reduced to approximately 1/3 of the original amount of time.
One angle at ~verification is the hosted-Loki run from the PR's update job [1,2]. Running: {invoker="openshift-internal-ci/pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade/1366935850604040192",pod_name=~"machine-api-controllers-.*"} |= "Registering Components" gives: 2021-03-03 02:37:02 2021/03/03 02:37:02 Registering Components. 2021-03-03 02:37:13 I0303 02:37:13.698984 1 main.go:89] Registering Components. 2021-03-03 02:37:13 I0303 02:37:13.864228 1 main.go:104] Registering Components. 2021-03-03 02:37:18 2021/03/03 02:37:18 Registering Components. 2021-03-03 03:53:00 2021/03/03 03:53:00 Registering Components. 2021-03-03 03:53:03 2021/03/03 03:53:03 Registering Components. 2021-03-03 03:53:09 I0303 03:53:09.461317 1 main.go:89] Registering Components. 2021-03-03 03:53:09 I0303 03:53:09.764114 1 main.go:104] Registering Components. So that's ~16s in the initial version, which is before the update, and ~3s after the update, which sounds good. I dunno what the logs that start with I0303 are about, but they don't match the comment 16 pattern. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/817/pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade/1366935850604040192 [2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/817/pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade/1366935850604040192/artifacts/e2e-aws-upgrade/ipi-install-hosted-loki/build-log.txt
Thanks Trevor . Here are the results : [4.8.0-0.nightly-2021-03-03-192757] [miyadav@miyadav vsphere]$ oc get pods NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-84d9bc49d5-jmdnh 2/2 Running 0 109m cluster-baremetal-operator-7545d7c54f-f89hc 2/2 Running 0 109m machine-api-controllers-5dfd4f885b-md6f4 7/7 Running 0 109m machine-api-operator-67965db89d-zp64l 2/2 Running 0 109m [miyadav@miyadav vsphere]$ oc logs machine-api-controllers-5dfd4f885b-md6f4 -c machineset-controller | grep 'Registering Components' 2021/03/04 07:11:45 Registering Components. 2021/03/04 07:11:53 Registering Components. After upgrade : [miyadav@miyadav vsphere]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-03-03-220107 True False 39s Cluster version is 4.8.0-0.nightly-2021-03-03-220107 [miyadav@miyadav vsphere]$ oc get pods NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-9958dcb58-9hcj9 2/2 Running 0 6m22s cluster-baremetal-operator-869f7994b-hf92j 2/2 Running 0 8m1s machine-api-controllers-75fc98cc4-6mt75 7/7 Running 0 8m19s machine-api-operator-5d6c9dfb94-sjb5d 2/2 Running 0 10m [miyadav@miyadav vsphere]$ oc logs machine-api-controllers-75fc98cc4-6mt75 -c machineset-controller | grep 'Registering Components' 2021/03/04 09:23:05 Registering Components. 2021/03/04 09:23:08 Registering Components. [miyadav@miyadav vsphere]$ Additional Info: [miyadav@miyadav vsphere]$ oc describe pod machine-api-controllers-75fc98cc4-6mt75 . . Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 15m default-scheduler Successfully assigned openshift-machine-api/machine-api-controllers-75fc98cc4-6mt75 to control-plane-2 Normal AddedInterface 15m multus Add eth0 [10.128.0.74/23] Normal Pulled 15m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9fe75f5a7f10e14a65b4b4d7654d4af91c21d0841b7c717cb988640394b4d3bb" already present on machine Normal Created 15m kubelet Created container machineset-controller Normal Started 15m kubelet Started container machineset-controller Normal Pulled 15m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9fe75f5a7f10e14a65b4b4d7654d4af91c21d0841b7c717cb988640394b4d3bb" already present on machine Normal Pulled 15m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:47ac7d69fd6a0daf57a3ca3ebf4d40b30a4b77abbec2fd138767abdd195f88bc" already present on machine Normal Started 15m kubelet Started container machine-controller Normal Pulled 15m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9fe75f5a7f10e14a65b4b4d7654d4af91c21d0841b7c717cb988640394b4d3bb" already present on machine Normal Created 15m kubelet Created container nodelink-controller Normal Started 15m kubelet Started container nodelink-controller Normal Created 15m kubelet Created container machine-controller Normal Created 15m kubelet Created container machine-healthcheck-controller Normal Started 15m kubelet Started container machine-healthcheck-controller Normal Pulled 15m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:47ac7d69fd6a0daf57a3ca3ebf4d40b30a4b77abbec2fd138767abdd195f88bc" already present on machine Normal Created 15m kubelet Created container kube-rbac-proxy-machineset-mtrc Normal Started 15m kubelet Started container kube-rbac-proxy-machineset-mtrc Normal Pulled 15m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9fe75f5a7f10e14a65b4b4d7654d4af91c21d0841b7c717cb988640394b4d3bb" already present on machine Normal Created 15m kubelet Created container kube-rbac-proxy-machine-mtrc Normal Started 15m kubelet Started container kube-rbac-proxy-machine-mtrc Normal Pulled 15m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:47ac7d69fd6a0daf57a3ca3ebf4d40b30a4b77abbec2fd138767abdd195f88bc" already present on machine Normal Created 15m kubelet Created container kube-rbac-proxy-mhc-mtrc Normal Started 15m kubelet Started container kube-rbac-proxy-mhc-mtrc Moving it to VERIFIED
Seeing: - lastTransitionTime: "2021-03-09T14:54:45Z" message: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing. reason: ClusterVersionOverridesSet status: "False" type: Upgradeable Overrides disable upgrades, so the procedure is not unsticking the upgrades.
Hrm. Easiest solution is probably updating to 4.7.2 once that comes out with the fix (you can use 'oc adm upgrade --to 4.7.2' to retarget your existing update). You'll have to drop your overrides to get that retarget to 4.7.2 accepted, but I don't think the Upgradeable=False should be blocking you mid-update. Checking with a cluster-bot cluster: $ oc get -o 'jsonpath={.status.desired.version}{"\n"}' clusterversion version 4.7.0 Kick off an update: $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "candidate-4.7"}]' $ oc adm upgrade --to 4.7.1 Give it a minute or so to get into the new CVO pod: $ oc adm upgrade info: An upgrade is in progress. Working towards 4.7.1: 70 of 668 done (10% complete) No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss. Override the machine API operator: $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/overrides", "value": [{"kind": "Deployment", "group": "apps/v1", "namespace": "openshift-machine-api", "name": "machine-api-operator", "unmanaged": true}]}]' Confirm we got it pre-update: $ oc get clusteroperator machine-api NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-api 4.7.0 True False False 16m Tell the CVO to not care about the machine-api ClusterOperator either, now that nobody will be bumping its version: $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/overrides/-", "value": {"kind": "ClusterOperator", "group": "config.openshift.io/v1", "namespace": "", "name": "machine-api", "unmanaged": true}}]' After another minute or two, the ClusterVersion Upgradeable=False condition gets set: $ oc adm upgrade info: An upgrade is in progress. Working towards 4.7.1: 70 of 668 done (10% complete) Upgradeable=False Reason: ClusterVersionOverridesSet Message: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing. No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss. But the current update carries on: $ oc adm upgrade info: An upgrade is in progress. Working towards 4.7.1: 96 of 668 done (14% complete) Upgradeable=False Reason: ClusterVersionOverridesSet Message: Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing. No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss. And the update eventually completes. Logs in [1]. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1369526868608290816
Just to clarify. ATM if a customer is in mid upgrade there is no workaround that can fix it. Am I right?
Comment 13 was my recommended workaround. Comment 21 claimed it did not unstick the update. Comment 22 was me failing to reproduce the stick, and having my update complete after the comment 13 procedure. So I'm still not clear on what was causing the stick in comment 21. But 4.7.2 is supported since a few hours ago [1], and the release is in fast-4.7 [2], so folks who are stuck should be able to recover by dropping their spec.overrides from ClusterVersion, setting their channel to fast-4.7, and updating to 4.7.2. Fast edges are fully supported [3]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1934798#c5 [2]: https://github.com/openshift/cincinnati-graph-data/pull/688#event-4465196412 [3]: https://docs.openshift.com/container-platform/4.7/updating/updating-cluster-between-minor.html#fast-4-7-channel
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438