Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1759659

Summary:	[azure] MachineWithNoRunningPhase firing even when machines are running
Product:	OpenShift Container Platform	Reporter:	Abhinav Dahiya <adahiya>
Component:	Cloud Compute	Assignee:	Alberto <agarcial>
Cloud Compute sub component:	Other Providers	QA Contact:	Jianwei Hou <jhou>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	agarcial, inecas, jhou, vjaypurk, vlaad, wking, xtian, zhsun
Version:	4.3.0
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-14 14:20:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Abhinav Dahiya 2019-10-08 19:27:16 UTC

Description of problem:


example failing CI run: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2470/pull-ci-openshift-installer-master-e2e-azure/213

the test failed with MachineWithNoRunningPhase Alerts firing and the machines seem to be up.

Comment 1 Alberto 2019-10-08 21:11:29 UTC

https://github.com/openshift/cluster-api-provider-azure/pull/85

Comment 3 Jianwei Hou 2019-10-22 04:59:05 UTC

MachineWithNoRunningPhase is fired when a machine is provisioned. The alert is soon cleared after machine reached 'Running' phase.

I think this is working correctly, so moving to verified.

Comment 10 Alberto 2020-05-22 07:01:31 UTC

There's something preventing those machines from being fully gracefully terminated. That might be e.g PDBs or something else.
Can we please get must gather logs?

Comment 13 Alberto 2020-06-08 09:19:54 UTC

Hey Vedanti, the alert is legitimately triggering since machines are stuck "deleting". From the logs it seems the secret referenced by those machines has no perms to perform the deletion operation

2020-06-03T05:11:20.873424112Z E0603 05:11:20.873381       1 actuator.go:84] Machine error: failed to delete machine "cluster-wtcln-worker-eastus-4shjw": failed to delete machine: failed to delete vm cluster-wtcln-worker-eastus-4shjw in resource group cluster-wtcln-rg: compute.VirtualMachinesClient#Delete: Failure sending request: StatusCode=403 -- Original Error: Code="AuthorizationFailed" Message="The client '9b1463d5-687a-4397-a17d-b53e20961ee7' with object id '9b1463d5-687a-4397-a17d-b53e20961ee7' does not have authorization to perform action 'Microsoft.Compute/virtualMachines/delete' over scope '/subscriptions/117eba7c-c7b7-43a6-b9d6-d0f257dd71a5/resourceGroups/cluster-wtcln-rg/providers/Microsoft.Compute/virtualMachines/cluster-wtcln-worker-eastus-4shjw' or the scope is invalid. If access was recently granted, please refresh your credentials."

2020-06-03T05:11:30.956944884Z E0603 05:11:30.956903       1 actuator.go:84] Machine error: failed to delete machine "cluster-wtcln-worker-eastus-fnhsj": failed to delete machine: failed to delete vm cluster-wtcln-worker-eastus-fnhsj in resource group cluster-wtcln-rg: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/117eba7c-c7b7-43a6-b9d6-d0f257dd71a5/resourceGroups/cluster-wtcln-rg/providers/Microsoft.Compute/virtualMachines/cluster-wtcln-worker-eastus-fnhsj?api-version=2018-10-01: StatusCode=0 -- Original Error: adal: Failed to execute the refresh request. Error = 'Post https://login.microsoftonline.com/4087a2a7-6506-40c1-86b4-1d0404c4969e/oauth2/token?api-version=1.0: dial tcp: lookup login.microsoftonline.com on 172.30.0.10:53: read udp 10.128.0.4:59119->172.30.0.10:53: i/o timeout'

Have the perms referenced by azure-cloud-credentials been manipulated out of band?

Can you please open a new BZ to track and discuss this?

Comment 15 W. Trevor King 2020-06-26 18:31:35 UTC

The "If access was recently granted, please refresh your credentials" error seems to have moved to bug 1846292.