1884247 – Master node machine's gone Phase Failed

Bug 1884247 - Master node machine's gone Phase Failed

Summary: Master node machine's gone Phase Failed

Keywords:
Status:	CLOSED DUPLICATE of bug 1882169
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Alberto
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-01 12:42 UTC by Lorenzo Dalrio
Modified:	2020-10-01 12:53 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-01 12:53:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Lorenzo Dalrio 2020-10-01 12:42:44 UTC

User-Agent:       Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36
Build Identifier: 

After a failed VM restart from the Azure console, machine-api reports one of the master nodes as Phase Failed:

$ oc get machine -n openshift-machine-api
NAME                                            PHASE     TYPE              REGION       ZONE   AGE
ocp-dev-westeu-9qm9r-master-0                   Failed    Standard_D4s_v3   westeurope   1      309d
ocp-dev-westeu-9qm9r-master-1                   Running   Standard_D4s_v3   westeurope   3      309d
ocp-dev-westeu-9qm9r-master-2                   Running   Standard_D4s_v3   westeurope   2      309d
ocp-dev-westeu-9qm9r-worker-westeurope1-mwft7   Running   Standard_D4s_v3   westeurope   1      309d
ocp-dev-westeu-9qm9r-worker-westeurope2-qk2rc   Running   Standard_D4s_v3   westeurope   2      309d
ocp-dev-westeu-9qm9r-worker-westeurope3-9npcw   Running   Standard_D4s_v3   westeurope   3      309d

On the machine-controller container's log in the openshift-machine-api ns we found this:

I1001 11:59:15.041452       1 controller.go:169] ocp-dev-westeu-9qm9r-master-0: reconciling Machine
W1001 11:59:15.041551       1 controller.go:266] ocp-dev-westeu-9qm9r-master-0: machine has gone "Failed" phase. It won't reconcile
I1001 11:59:15.041776       1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled"  "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"ocp-dev-westeu-9qm9r-master-0"}

The node is working as expected though:

$ oc get node
NAME                                            STATUS   ROLES    AGE    VERSION
ocp-dev-westeu-9qm9r-master-0                   Ready    master   309d   v1.18.3+47c0e71
ocp-dev-westeu-9qm9r-master-1                   Ready    master   309d   v1.18.3+47c0e71
ocp-dev-westeu-9qm9r-master-2                   Ready    master   309d   v1.18.3+47c0e71
ocp-dev-westeu-9qm9r-worker-westeurope1-mwft7   Ready    worker   309d   v1.18.3+47c0e71
ocp-dev-westeu-9qm9r-worker-westeurope2-qk2rc   Ready    worker   309d   v1.18.3+47c0e71
ocp-dev-westeu-9qm9r-worker-westeurope3-9npcw   Ready    worker   309d   v1.18.3+47c0e71

Reproducible: Always




IPI cluster on Azure westeurope region.

Comment 1 Joel Speed 2020-10-01 12:48:14 UTC

This seems to be pretty much identical to https://bugzilla.redhat.com/show_bug.cgi?id=1882169, are you happy to mark this as a duplicate? (I think the phase transition is because we are seeing some other state than the ones that are currently allowed)

Comment 2 Lorenzo Dalrio 2020-10-01 12:53:37 UTC

I agree with you, closing as a duplicate of #1882169.

*** This bug has been marked as a duplicate of bug 1882169 ***

Note You need to log in before you can comment on or make changes to this bug.