Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1884247

Summary: Master node machine's gone Phase Failed
Product: OpenShift Container Platform Reporter: Lorenzo Dalrio <lorenzo.dalrio>
Component: Cloud ComputeAssignee: Alberto <agarcial>
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: unspecified    
Version: 4.5   
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-01 12:53:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lorenzo Dalrio 2020-10-01 12:42:44 UTC
User-Agent:       Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36
Build Identifier: 

After a failed VM restart from the Azure console, machine-api reports one of the master nodes as Phase Failed:

$ oc get machine -n openshift-machine-api
NAME                                            PHASE     TYPE              REGION       ZONE   AGE
ocp-dev-westeu-9qm9r-master-0                   Failed    Standard_D4s_v3   westeurope   1      309d
ocp-dev-westeu-9qm9r-master-1                   Running   Standard_D4s_v3   westeurope   3      309d
ocp-dev-westeu-9qm9r-master-2                   Running   Standard_D4s_v3   westeurope   2      309d
ocp-dev-westeu-9qm9r-worker-westeurope1-mwft7   Running   Standard_D4s_v3   westeurope   1      309d
ocp-dev-westeu-9qm9r-worker-westeurope2-qk2rc   Running   Standard_D4s_v3   westeurope   2      309d
ocp-dev-westeu-9qm9r-worker-westeurope3-9npcw   Running   Standard_D4s_v3   westeurope   3      309d

On the machine-controller container's log in the openshift-machine-api ns we found this:

I1001 11:59:15.041452       1 controller.go:169] ocp-dev-westeu-9qm9r-master-0: reconciling Machine
W1001 11:59:15.041551       1 controller.go:266] ocp-dev-westeu-9qm9r-master-0: machine has gone "Failed" phase. It won't reconcile
I1001 11:59:15.041776       1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled"  "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"ocp-dev-westeu-9qm9r-master-0"}

The node is working as expected though:

$ oc get node
NAME                                            STATUS   ROLES    AGE    VERSION
ocp-dev-westeu-9qm9r-master-0                   Ready    master   309d   v1.18.3+47c0e71
ocp-dev-westeu-9qm9r-master-1                   Ready    master   309d   v1.18.3+47c0e71
ocp-dev-westeu-9qm9r-master-2                   Ready    master   309d   v1.18.3+47c0e71
ocp-dev-westeu-9qm9r-worker-westeurope1-mwft7   Ready    worker   309d   v1.18.3+47c0e71
ocp-dev-westeu-9qm9r-worker-westeurope2-qk2rc   Ready    worker   309d   v1.18.3+47c0e71
ocp-dev-westeu-9qm9r-worker-westeurope3-9npcw   Ready    worker   309d   v1.18.3+47c0e71

Reproducible: Always




IPI cluster on Azure westeurope region.

Comment 1 Joel Speed 2020-10-01 12:48:14 UTC
This seems to be pretty much identical to https://bugzilla.redhat.com/show_bug.cgi?id=1882169, are you happy to mark this as a duplicate? (I think the phase transition is because we are seeing some other state than the ones that are currently allowed)

Comment 2 Lorenzo Dalrio 2020-10-01 12:53:37 UTC
I agree with you, closing as a duplicate of #1882169.

*** This bug has been marked as a duplicate of bug 1882169 ***