Description of problem: In OCP cluster setup on Azure, some machine objects are showing failed phase. Checking OCP node status shows the respective node is in ready state. From Azure perspective VM is running with correct VMId(ProviderID). ~~~ # oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE awe-pp-xxx-ah97u-master-0 Failed Standard_D8s_v3 westeurope 1 88d awe-pp-xxx-ah97u-master-1 Running Standard_D8s_v3 westeurope 2 88d awe-pp-xxx-ah97u-master-2 Running Standard_D8s_v3 westeurope 3 88d awe-pp-xxx-ah97u-worker-westeurope1-fmm25 Failed Standard_F32s_v2 westeurope 1 88d awe-pp-xxx-ah97u-worker-westeurope2-w22ln Running Standard_F32s_v2 westeurope 2 52d awe-pp-xxx-ah97u-worker-westeurope3-s7vll Running Standard_F32s_v2 westeurope 3 88d # oc get nodes NAME STATUS ROLES AGE VERSION awe-pp-xxx-ah97u-master-0 Ready master 88d v1.19.0+2f3101c awe-pp-xxx-ah97u-master-1 Ready master 88d v1.19.0+2f3101c awe-pp-xxx-ah97u-master-2 Ready master 88d v1.19.0+2f3101c awe-pp-xxx-ah97u-worker-westeurope1-fmm25 Ready worker 88d v1.19.0+2f3101c awe-pp-xxx-ah97u-worker-westeurope2-w22ln Ready worker 52d v1.19.0+2f3101c awe-pp-xxx-ah97u-worker-westeurope3-s7vll Ready worker 88d v1.19.0+2f3101c ~~~ - machine awe-pp-xxx-ah97u-master-0 is in phase: Failed Machine YAML: errorMessage: Can't find created instance. - machine-api-controllers logs shows: ~~~ 2021-04-26T00:58:33.013973757Z I0426 00:58:33.013892 1 controller.go:170] awe-pp-xxx-ah97u-master-0: reconciling Machine 2021-04-26T00:58:33.013973757Z W0426 00:58:33.013918 1 controller.go:267] awe-pp-xxx-ah97u-master-0: machine has gone "Failed" phase. It won't reconcile 2021-04-26T00:58:33.014001757Z I0426 00:58:33.013974 1 controller.go:261] controller "msg"="Successfully Reconciled" "controller"="machine_controller" "name"="awe-pp-xxx-ah97u-master-0" "namespace"="openshift-machine-api" ~~~ - From Azure point of view, ProviderID from node description is matching with actual VM. Version-Release number of selected component (if applicable): OCP 4.6.21 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Machine is in failed phase. Expected results: Machine should be in Running phase. Additional info: Issue is seen after updating OCP from 4.6.12 to 4.6.21 at the time the node was rebooted.
This reminds me of a previous bug, but I can't find it right now. What I believe is happening here is that the VM state is going through something that we aren't aware of in our "exists" check, and therefore it is going to failed. We need to work out exactly what that state is and add it as an allowed state in https://github.com/openshift/cluster-api-provider-azure/blob/b2eda16dd665ab39459c0b686c88ce2d0b97ec6a/pkg/cloud/azure/actuators/machine/reconciler.go#L386-L400
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438