+++ This bug was initially created as a clone of Bug #1840552 +++ Description of problem: Machine stuck in "Provisioning" when creating a machine with invalid ami, machine controller continuously reconciling status Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-05-27-000432 How reproducible: Always Steps to Reproduce: 1. Creating a new machineset with invalid ami spec: metadata: {} providerSpec: value: ami: id: ami-0254f6b4eaa3b9d0f-invalid 2. Check machines and logs Actual results: Machine stuck in Provisioning status, machine controller continuously reconciling status $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunaws527-cwprz-master-0 Running m4.xlarge us-east-2 us-east-2a 4h54m zhsunaws527-cwprz-master-1 Running m4.xlarge us-east-2 us-east-2b 4h54m zhsunaws527-cwprz-master-2 Running m4.xlarge us-east-2 us-east-2c 4h54m zhsunaws527-cwprz-worker-us-east-2a-xxxkq Running m4.large us-east-2 us-east-2a 4h45m zhsunaws527-cwprz-worker-us-east-2b-smbs5 Running m4.large us-east-2 us-east-2b 4h45m zhsunaws527-cwprz-worker-us-east-2c-9mnfv Running m4.large us-east-2 us-east-2c 4h45m zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs Provisioning 79m status: errorMessage: "zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: reconciler failed to Create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0254f6b4eaa3b9d0f-invalid\"\n\tstatus code: 400, request id: e80cef1f-3de7-4987-abf2-a81d2dc6ef1f" lastUpdated: "2020-05-27T06:02:40Z" phase: Provisioning providerStatus: conditions: - lastTransitionTime: "2020-05-27T06:02:41Z" message: "error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0254f6b4eaa3b9d0f-invalid\"\n\tstatus code: 400, request id: 8e5f963e-e4e8-4036-a212-c25e75bdac33" reason: MachineCreationFailed status: "False" type: MachineCreation 27 07:25:34.584813 1 reconciler.go:68] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: error creating machine: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0254f6b4eaa3b9d0f-invalid" status code: 400, request id: 0ecd4a64-2e37-46e8-86e1-f8e7383aee26 I0527 07:25:34.584829 1 machine_scope.go:138] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: Updating status I0527 07:25:34.584839 1 machine_scope.go:165] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: finished calculating AWS status I0527 07:25:34.584854 1 machine_scope.go:84] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: patching machine E0527 07:25:34.608643 1 actuator.go:66] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs error: zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: reconciler failed to Create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0254f6b4eaa3b9d0f-invalid" status code: 400, request id: 0ecd4a64-2e37-46e8-86e1-f8e7383aee26 W0527 07:25:34.608699 1 controller.go:315] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: failed to create machine: zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: reconciler failed to Create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0254f6b4eaa3b9d0f-invalid" status code: 400, request id: 0ecd4a64-2e37-46e8-86e1-f8e7383aee26 I0527 07:25:34.608717 1 controller.go:415] Actuator returned invalid configuration error: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0254f6b4eaa3b9d0f-invalid" status code: 400, request id: 0ecd4a64-2e37-46e8-86e1-f8e7383aee26 I0527 07:25:34.608730 1 controller.go:424] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: going into phase "Failed" I0527 07:25:34.610156 1 recorder.go:52] controller-runtime/manager/events "msg"="Warning" "message"="zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: reconciler failed to Create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0254f6b4eaa3b9d0f-invalid\"\n\tstatus code: 400, request id: 0ecd4a64-2e37-46e8-86e1-f8e7383aee26" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs","uid":"ea3b9ea5-b5a1-4b0f-ae23-1dac1ceca0b6","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"142183"} "reason"="FailedCreate" I0527 07:25:34.629960 1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled" "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs"} I0527 07:25:34.630026 1 controller.go:169] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: reconciling Machine Expected results: The machine phase is set "Failed", machine controller stop reconciling status Additional info: --- Additional comment from Joel Speed on 2020-05-27 08:25:21 UTC --- @sunzhaohua Did this eventually go failed or was it stuck in provisioning forever? It appears that the code is following the right/expected paths, but the status update seems to have failed You can see in the logs that we have reached controller.go:424 where is is going into phase Failed, so the error is being returned and recognised correctly, and for some reason setPhase is failing to actually update the API. I would have expected this to be a transient issue. https://github.com/openshift/cluster-api-provider-aws/blob/5e266b553d8e7c5809d94653e2531167c865a762/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go#L422-L447 Could you also check the annotations on the Machine when this happened? There should also be "machine.openshift.io/instance-state": "unknown" in the annotations at this point --- Additional comment from sunzhaohua on 2020-05-27 09:00:40 UTC --- @joel speed It stuck in Provisioning, I tried several times, all stuck in Provisioning, the longest time is 89m. "machine.openshift.io/instance-state": "unknown" annotations could see in the machine. My cluster is here: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/95078/artifact/workdir/install-dir/auth/kubeconfig/*view*/ in case you need to debug, machine still in Provisioning
Verified Clusterversion: 4.5.0-0.nightly-2020-06-02-174025 create a new machine with invalid image, machine became to Failed. $ oc get machine NAME PHASE TYPE REGION ZONE AGE hongli-pl025-vqdlx-master-0 Running n1-standard-4 us-central1 us-central1-a 5h40m hongli-pl025-vqdlx-master-1 Running n1-standard-4 us-central1 us-central1-b 5h40m hongli-pl025-vqdlx-master-2 Running n1-standard-4 us-central1 us-central1-c 5h40m hongli-pl025-vqdlx-worker-a-xnwrk Running n1-standard-4 us-central1 us-central1-a 5h35m hongli-pl025-vqdlx-worker-b-zwz4s Running n1-standard-4 us-central1 us-central1-b 5h35m hongli-pl025-vqdlx-worker-c-9srjr Running n1-standard-4 us-central1 us-central1-c 5h35m hongli-pl025-vqdlx-worker-f-5rk7j Failed 10s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409