Description of problem: Machine stuck in "Provisioning" when creating a machine with invalid ami, machine controller continuously reconciling status Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-05-27-000432 How reproducible: Always Steps to Reproduce: 1. Creating a new machineset with invalid ami spec: metadata: {} providerSpec: value: ami: id: ami-0254f6b4eaa3b9d0f-invalid 2. Check machines and logs Actual results: Machine stuck in Provisioning status, machine controller continuously reconciling status $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunaws527-cwprz-master-0 Running m4.xlarge us-east-2 us-east-2a 4h54m zhsunaws527-cwprz-master-1 Running m4.xlarge us-east-2 us-east-2b 4h54m zhsunaws527-cwprz-master-2 Running m4.xlarge us-east-2 us-east-2c 4h54m zhsunaws527-cwprz-worker-us-east-2a-xxxkq Running m4.large us-east-2 us-east-2a 4h45m zhsunaws527-cwprz-worker-us-east-2b-smbs5 Running m4.large us-east-2 us-east-2b 4h45m zhsunaws527-cwprz-worker-us-east-2c-9mnfv Running m4.large us-east-2 us-east-2c 4h45m zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs Provisioning 79m status: errorMessage: "zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: reconciler failed to Create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0254f6b4eaa3b9d0f-invalid\"\n\tstatus code: 400, request id: e80cef1f-3de7-4987-abf2-a81d2dc6ef1f" lastUpdated: "2020-05-27T06:02:40Z" phase: Provisioning providerStatus: conditions: - lastTransitionTime: "2020-05-27T06:02:41Z" message: "error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0254f6b4eaa3b9d0f-invalid\"\n\tstatus code: 400, request id: 8e5f963e-e4e8-4036-a212-c25e75bdac33" reason: MachineCreationFailed status: "False" type: MachineCreation 27 07:25:34.584813 1 reconciler.go:68] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: error creating machine: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0254f6b4eaa3b9d0f-invalid" status code: 400, request id: 0ecd4a64-2e37-46e8-86e1-f8e7383aee26 I0527 07:25:34.584829 1 machine_scope.go:138] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: Updating status I0527 07:25:34.584839 1 machine_scope.go:165] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: finished calculating AWS status I0527 07:25:34.584854 1 machine_scope.go:84] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: patching machine E0527 07:25:34.608643 1 actuator.go:66] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs error: zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: reconciler failed to Create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0254f6b4eaa3b9d0f-invalid" status code: 400, request id: 0ecd4a64-2e37-46e8-86e1-f8e7383aee26 W0527 07:25:34.608699 1 controller.go:315] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: failed to create machine: zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: reconciler failed to Create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0254f6b4eaa3b9d0f-invalid" status code: 400, request id: 0ecd4a64-2e37-46e8-86e1-f8e7383aee26 I0527 07:25:34.608717 1 controller.go:415] Actuator returned invalid configuration error: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0254f6b4eaa3b9d0f-invalid" status code: 400, request id: 0ecd4a64-2e37-46e8-86e1-f8e7383aee26 I0527 07:25:34.608730 1 controller.go:424] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: going into phase "Failed" I0527 07:25:34.610156 1 recorder.go:52] controller-runtime/manager/events "msg"="Warning" "message"="zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: reconciler failed to Create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0254f6b4eaa3b9d0f-invalid\"\n\tstatus code: 400, request id: 0ecd4a64-2e37-46e8-86e1-f8e7383aee26" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs","uid":"ea3b9ea5-b5a1-4b0f-ae23-1dac1ceca0b6","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"142183"} "reason"="FailedCreate" I0527 07:25:34.629960 1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled" "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs"} I0527 07:25:34.630026 1 controller.go:169] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: reconciling Machine Expected results: The machine phase is set "Failed", machine controller stop reconciling status Additional info:
@sunzhaohua Did this eventually go failed or was it stuck in provisioning forever? It appears that the code is following the right/expected paths, but the status update seems to have failed You can see in the logs that we have reached controller.go:424 where is is going into phase Failed, so the error is being returned and recognised correctly, and for some reason setPhase is failing to actually update the API. I would have expected this to be a transient issue. https://github.com/openshift/cluster-api-provider-aws/blob/5e266b553d8e7c5809d94653e2531167c865a762/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go#L422-L447 Could you also check the annotations on the Machine when this happened? There should also be "machine.openshift.io/instance-state": "unknown" in the annotations at this point
@joel speed It stuck in Provisioning, I tried several times, all stuck in Provisioning, the longest time is 89m. "machine.openshift.io/instance-state": "unknown" annotations could see in the machine. My cluster is here: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/95078/artifact/workdir/install-dir/auth/kubeconfig/*view*/ in case you need to debug, machine still in Provisioning
Verified clusterversion: 4.5.0-0.nightly-2020-05-30-025738 create a machine with invalid ami $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun601aws-5w9lj-master-0 Running m4.xlarge us-east-2 us-east-2a 129m zhsun601aws-5w9lj-master-1 Running m4.xlarge us-east-2 us-east-2b 129m zhsun601aws-5w9lj-master-2 Running m4.xlarge us-east-2 us-east-2c 129m zhsun601aws-5w9lj-worker-us-east-2a-xfchg Running m4.large us-east-2 us-east-2a 119m zhsun601aws-5w9lj-worker-us-east-2b-b8x9c Running m4.large us-east-2 us-east-2b 119m zhsun601aws-5w9lj-worker-us-east-2c-ph88r Failed 98s zhsun601aws-5w9lj-worker-us-east-2c-vv8wv Running m4.large us-east-2 us-east-2c 93m、 I0601 03:31:14.031906 1 controller.go:169] zhsun601aws-5w9lj-worker-us-east-2c-ph88r: reconciling Machine W0601 03:31:14.031921 1 controller.go:266] zhsun601aws-5w9lj-worker-us-east-2c-ph88r: machine has gone "Failed" phase. It won't reconcile
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409