Bug 1759659 - [azure] MachineWithNoRunningPhase firing even when machines are running
Summary: [azure] MachineWithNoRunningPhase firing even when machines are running
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.3.0
Assignee: Alberto
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-08 19:27 UTC by Abhinav Dahiya
Modified: 2023-10-06 18:38 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-14 14:20:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Abhinav Dahiya 2019-10-08 19:27:16 UTC
Description of problem:


example failing CI run: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2470/pull-ci-openshift-installer-master-e2e-azure/213

the test failed with MachineWithNoRunningPhase Alerts firing and the machines seem to be up.

Comment 3 Jianwei Hou 2019-10-22 04:59:05 UTC
MachineWithNoRunningPhase is fired when a machine is provisioned. The alert is soon cleared after machine reached 'Running' phase.

I think this is working correctly, so moving to verified.

Comment 10 Alberto 2020-05-22 07:01:31 UTC
There's something preventing those machines from being fully gracefully terminated. That might be e.g PDBs or something else.
Can we please get must gather logs?

Comment 13 Alberto 2020-06-08 09:19:54 UTC
Hey Vedanti, the alert is legitimately triggering since machines are stuck "deleting". From the logs it seems the secret referenced by those machines has no perms to perform the deletion operation

2020-06-03T05:11:20.873424112Z E0603 05:11:20.873381       1 actuator.go:84] Machine error: failed to delete machine "cluster-wtcln-worker-eastus-4shjw": failed to delete machine: failed to delete vm cluster-wtcln-worker-eastus-4shjw in resource group cluster-wtcln-rg: compute.VirtualMachinesClient#Delete: Failure sending request: StatusCode=403 -- Original Error: Code="AuthorizationFailed" Message="The client '9b1463d5-687a-4397-a17d-b53e20961ee7' with object id '9b1463d5-687a-4397-a17d-b53e20961ee7' does not have authorization to perform action 'Microsoft.Compute/virtualMachines/delete' over scope '/subscriptions/117eba7c-c7b7-43a6-b9d6-d0f257dd71a5/resourceGroups/cluster-wtcln-rg/providers/Microsoft.Compute/virtualMachines/cluster-wtcln-worker-eastus-4shjw' or the scope is invalid. If access was recently granted, please refresh your credentials."

2020-06-03T05:11:30.956944884Z E0603 05:11:30.956903       1 actuator.go:84] Machine error: failed to delete machine "cluster-wtcln-worker-eastus-fnhsj": failed to delete machine: failed to delete vm cluster-wtcln-worker-eastus-fnhsj in resource group cluster-wtcln-rg: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/117eba7c-c7b7-43a6-b9d6-d0f257dd71a5/resourceGroups/cluster-wtcln-rg/providers/Microsoft.Compute/virtualMachines/cluster-wtcln-worker-eastus-fnhsj?api-version=2018-10-01: StatusCode=0 -- Original Error: adal: Failed to execute the refresh request. Error = 'Post https://login.microsoftonline.com/4087a2a7-6506-40c1-86b4-1d0404c4969e/oauth2/token?api-version=1.0: dial tcp: lookup login.microsoftonline.com on 172.30.0.10:53: read udp 10.128.0.4:59119->172.30.0.10:53: i/o timeout'

Have the perms referenced by azure-cloud-credentials been manipulated out of band?

Can you please open a new BZ to track and discuss this?

Comment 15 W. Trevor King 2020-06-26 18:31:35 UTC
The "If access was recently granted, please refresh your credentials" error seems to have moved to bug 1846292.


Note You need to log in before you can comment on or make changes to this bug.