Hide Forgot
Description of problem: While deploying a cluster with featureGate enabling the TechPreviewNoUpgrade featureSet, it is observed that some machine resources are going to Failed status and they are never reconcilied. $ oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE ostest-c5hkx-master-0 Running 71m ostest-c5hkx-master-1 Running 71m ostest-c5hkx-master-2 Running 71m ostest-c5hkx-worker-0-d2lh5 Failed 48m ostest-c5hkx-worker-0-qpmmz Running 62m ostest-c5hkx-worker-0-sghfp Running 48m On the machine-controller container, it is observed below lines regarding the failed worker: $ oc logs -n openshift-machine-api machine-api-controllers-7d79b76cd7-6glz9 -c machine-controller | grep ostest-c5hkx-worker-0-d2lh5 I0103 09:29:54.778487 1 controller.go:175] ostest-c5hkx-worker-0-d2lh5: reconciling Machine I0103 09:30:58.152298 1 controller.go:175] ostest-c5hkx-worker-0-d2lh5: reconciling Machine I0103 09:31:08.920209 1 controller.go:358] ostest-c5hkx-worker-0-d2lh5: setting phase to Provisioning and requeuing I0103 09:31:08.920229 1 controller.go:483] ostest-c5hkx-worker-0-d2lh5: going into phase "Provisioning" I0103 09:31:40.672498 1 controller.go:175] ostest-c5hkx-worker-0-d2lh5: reconciling Machine I0103 09:31:41.696294 1 controller.go:365] ostest-c5hkx-worker-0-d2lh5: reconciling machine triggers idempotent create E0103 09:32:23.356103 1 actuator.go:406] Machine error ostest-c5hkx-worker-0-d2lh5: error setting provider ID: etcdserver: request timed out W0103 09:32:23.356155 1 controller.go:367] ostest-c5hkx-worker-0-d2lh5: failed to create machine: error setting provider ID: etcdserver: request timed out E0103 09:32:23.356231 1 controller.go:304] controller/machine_controller "msg"="Reconciler error" "error"="error setting provider ID: etcdserver: request timed out" "name"="ostest-c5hkx-worker-0-d2lh5" "namespace"="openshift-machine-api" I0103 09:32:28.851987 1 controller.go:175] ostest-c5hkx-worker-0-d2lh5: reconciling Machine I0103 09:32:29.974108 1 controller.go:483] ostest-c5hkx-worker-0-d2lh5: going into phase "Failed" I0103 09:32:32.510839 1 controller.go:175] ostest-c5hkx-worker-0-d2lh5: reconciling Machine W0103 09:32:32.510851 1 controller.go:276] ostest-c5hkx-worker-0-d2lh5: machine has gone "Failed" phase. It won't reconcile I0103 09:37:13.835043 1 controller.go:175] ostest-c5hkx-worker-0-d2lh5: reconciling Machine W0103 09:37:13.835101 1 controller.go:276] ostest-c5hkx-worker-0-d2lh5: machine has gone "Failed" phase. It won't reconcile Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2021-12-21-130047 RHOS-16.1-RHEL-8-20210903.n.0 How reproducible: Sometimes Steps to Reproduce: 1. Install OCP4.10 latest nightly setting the featureGate: kind: FeatureGate metadata: annotations: include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" release.openshift.io/create-only: "true" creationTimestamp: "2022-01-03T09:06:36Z" generation: 1 name: cluster resourceVersion: "1360" uid: cc6eed62-b26f-4ed0-91fd-34309dfe6e49 spec: featureSet: TechPreviewNoUpgrade Actual results: Installation fails and/or not all the required workers are deployed. Expected results: Despite the possible glitch on the etcd server, MAPO should try to set the provider ID more than once or it should try to reconcile.
Lets defer judgement on this until https://github.com/openshift/machine-api-operator/pull/975 and https://github.com/openshift/release/pull/24933 merge. These patches fully enable the MAPO TP feature flag, and allow you to test the most recent changes to MAPO, which are very significant.
@jspeed would this be an MAO bug?
Do we know if this error is coming from MAPO or CAPO? Looks like this could be MAPO: https://github.com/openshift/machine-api-provider-openstack/blob/2401f74dba8112e0dd4a4a1bb77c26887d5da3f4/pkg/machine/actuator.go#L194-L204 Has anyone tried to reproduce this as it seems like this could use some debug logging? What looks to me like has happened is that the MAPO code created the machine, attempted to set the provider ID, got an error back and then attempted to delete the instance Meanwhile, I think the update to set the provider ID was actually successful, etcd timed out because not all members acked the update, but it did eventually get stored. So next reconcile, the Machine controller sees that the instance has a provider ID, but it was deleted on OpenStack, so it rightfully fails the machine I would say the issue here is, why are we deleting the instance in OpenStack if we can't set the provider ID?
(In reply to Joel Speed from comment #3) > Do we know if this error is coming from MAPO or CAPO? > > Looks like this could be MAPO: > https://github.com/openshift/machine-api-provider-openstack/blob/ > 2401f74dba8112e0dd4a4a1bb77c26887d5da3f4/pkg/machine/actuator.go#L194-L204 > > Has anyone tried to reproduce this as it seems like this could use some > debug logging? > > What looks to me like has happened is that the MAPO code created the > machine, attempted to set the provider ID, got an error back and then > attempted to delete the instance > > Meanwhile, I think the update to set the provider ID was actually > successful, etcd timed out because not all members acked the update, but it > did eventually get stored. > > So next reconcile, the Machine controller sees that the instance has a > provider ID, but it was deleted on OpenStack, so it rightfully fails the > machine > > I would say the issue here is, why are we deleting the instance in OpenStack > if we can't set the provider ID? The reason is in the comment above: we're treating ProviderID as the canonical source of truth for the creation of the machine. Therefore if we created the machine but weren't able to record that we need to delete the machine, because next time round, with no ProviderID, we're going to create another one and leak a VM. Incidentally, note that I considered the possibility that we might race with another process that shouldn't be running, but it didn't occur to me that etcd would return an error for a transaction that later succeeded. I'm... not embarrassed about that. How on earth are you supposed to work with that deterministically? I think this effectively forces us to fall back to matching on name rather than UUID, which comes with its own problems. I can't see that etcd can ever be canonical for anything with these semantics.
I am bumping the priority because I just seen it in the CI, meaning that it might be more common than we think.
This should be one of the things fixed by https://github.com/openshift/machine-api-provider-openstack/pull/26
Verified on 4.11.0-0.nightly-2022-03-23-132952 on top of RHOS-16.2-RHEL-8-20220311.n.1 with featureGate enabling MAPO: apiVersion: config.openshift.io/v1 kind: FeatureGate metadata: annotations: include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" release.openshift.io/create-only: "true" name: cluster spec: customNoUpgrade: enabled: - MachineAPIProviderOpenStack featureSet: CustomNoUpgrade We run 9 OCP installations (3 installation types * 3 networkTypes) and the workers were successfully deployed in all the cases.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069