Bug 1759659

Summary: [azure] MachineWithNoRunningPhase firing even when machines are running
Product: OpenShift Container Platform Reporter: Abhinav Dahiya <adahiya>
Component: Cloud ComputeAssignee: Alberto <agarcial>
Cloud Compute sub component: Other Providers QA Contact: Jianwei Hou <jhou>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: medium    
Priority: unspecified CC: agarcial, inecas, jhou, vjaypurk, vlaad, wking, xtian, zhsun
Version: 4.3.0   
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-14 14:20:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Abhinav Dahiya 2019-10-08 19:27:16 UTC
Description of problem:


example failing CI run: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2470/pull-ci-openshift-installer-master-e2e-azure/213

the test failed with MachineWithNoRunningPhase Alerts firing and the machines seem to be up.

Comment 3 Jianwei Hou 2019-10-22 04:59:05 UTC
MachineWithNoRunningPhase is fired when a machine is provisioned. The alert is soon cleared after machine reached 'Running' phase.

I think this is working correctly, so moving to verified.

Comment 10 Alberto 2020-05-22 07:01:31 UTC
There's something preventing those machines from being fully gracefully terminated. That might be e.g PDBs or something else.
Can we please get must gather logs?

Comment 13 Alberto 2020-06-08 09:19:54 UTC
Hey Vedanti, the alert is legitimately triggering since machines are stuck "deleting". From the logs it seems the secret referenced by those machines has no perms to perform the deletion operation

2020-06-03T05:11:20.873424112Z E0603 05:11:20.873381       1 actuator.go:84] Machine error: failed to delete machine "cluster-wtcln-worker-eastus-4shjw": failed to delete machine: failed to delete vm cluster-wtcln-worker-eastus-4shjw in resource group cluster-wtcln-rg: compute.VirtualMachinesClient#Delete: Failure sending request: StatusCode=403 -- Original Error: Code="AuthorizationFailed" Message="The client '9b1463d5-687a-4397-a17d-b53e20961ee7' with object id '9b1463d5-687a-4397-a17d-b53e20961ee7' does not have authorization to perform action 'Microsoft.Compute/virtualMachines/delete' over scope '/subscriptions/117eba7c-c7b7-43a6-b9d6-d0f257dd71a5/resourceGroups/cluster-wtcln-rg/providers/Microsoft.Compute/virtualMachines/cluster-wtcln-worker-eastus-4shjw' or the scope is invalid. If access was recently granted, please refresh your credentials."

2020-06-03T05:11:30.956944884Z E0603 05:11:30.956903       1 actuator.go:84] Machine error: failed to delete machine "cluster-wtcln-worker-eastus-fnhsj": failed to delete machine: failed to delete vm cluster-wtcln-worker-eastus-fnhsj in resource group cluster-wtcln-rg: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/117eba7c-c7b7-43a6-b9d6-d0f257dd71a5/resourceGroups/cluster-wtcln-rg/providers/Microsoft.Compute/virtualMachines/cluster-wtcln-worker-eastus-fnhsj?api-version=2018-10-01: StatusCode=0 -- Original Error: adal: Failed to execute the refresh request. Error = 'Post https://login.microsoftonline.com/4087a2a7-6506-40c1-86b4-1d0404c4969e/oauth2/token?api-version=1.0: dial tcp: lookup login.microsoftonline.com on 172.30.0.10:53: read udp 10.128.0.4:59119->172.30.0.10:53: i/o timeout'

Have the perms referenced by azure-cloud-credentials been manipulated out of band?

Can you please open a new BZ to track and discuss this?

Comment 15 W. Trevor King 2020-06-26 18:31:35 UTC
The "If access was recently granted, please refresh your credentials" error seems to have moved to bug 1846292.