Description of problem: example failing CI run: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2470/pull-ci-openshift-installer-master-e2e-azure/213 the test failed with MachineWithNoRunningPhase Alerts firing and the machines seem to be up.
https://github.com/openshift/cluster-api-provider-azure/pull/85
MachineWithNoRunningPhase is fired when a machine is provisioned. The alert is soon cleared after machine reached 'Running' phase. I think this is working correctly, so moving to verified.
There's something preventing those machines from being fully gracefully terminated. That might be e.g PDBs or something else. Can we please get must gather logs?
Hey Vedanti, the alert is legitimately triggering since machines are stuck "deleting". From the logs it seems the secret referenced by those machines has no perms to perform the deletion operation 2020-06-03T05:11:20.873424112Z E0603 05:11:20.873381 1 actuator.go:84] Machine error: failed to delete machine "cluster-wtcln-worker-eastus-4shjw": failed to delete machine: failed to delete vm cluster-wtcln-worker-eastus-4shjw in resource group cluster-wtcln-rg: compute.VirtualMachinesClient#Delete: Failure sending request: StatusCode=403 -- Original Error: Code="AuthorizationFailed" Message="The client '9b1463d5-687a-4397-a17d-b53e20961ee7' with object id '9b1463d5-687a-4397-a17d-b53e20961ee7' does not have authorization to perform action 'Microsoft.Compute/virtualMachines/delete' over scope '/subscriptions/117eba7c-c7b7-43a6-b9d6-d0f257dd71a5/resourceGroups/cluster-wtcln-rg/providers/Microsoft.Compute/virtualMachines/cluster-wtcln-worker-eastus-4shjw' or the scope is invalid. If access was recently granted, please refresh your credentials." 2020-06-03T05:11:30.956944884Z E0603 05:11:30.956903 1 actuator.go:84] Machine error: failed to delete machine "cluster-wtcln-worker-eastus-fnhsj": failed to delete machine: failed to delete vm cluster-wtcln-worker-eastus-fnhsj in resource group cluster-wtcln-rg: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/117eba7c-c7b7-43a6-b9d6-d0f257dd71a5/resourceGroups/cluster-wtcln-rg/providers/Microsoft.Compute/virtualMachines/cluster-wtcln-worker-eastus-fnhsj?api-version=2018-10-01: StatusCode=0 -- Original Error: adal: Failed to execute the refresh request. Error = 'Post https://login.microsoftonline.com/4087a2a7-6506-40c1-86b4-1d0404c4969e/oauth2/token?api-version=1.0: dial tcp: lookup login.microsoftonline.com on 172.30.0.10:53: read udp 10.128.0.4:59119->172.30.0.10:53: i/o timeout' Have the perms referenced by azure-cloud-credentials been manipulated out of band? Can you please open a new BZ to track and discuss this?
The "If access was recently granted, please refresh your credentials" error seems to have moved to bug 1846292.