Bug 1884247 - Master node machine's gone Phase Failed
Summary: Master node machine's gone Phase Failed
Keywords:
Status: CLOSED DUPLICATE of bug 1882169
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.5
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Alberto
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-01 12:42 UTC by Lorenzo Dalrio
Modified: 2020-10-01 12:53 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-01 12:53:37 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Lorenzo Dalrio 2020-10-01 12:42:44 UTC
User-Agent:       Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36
Build Identifier: 

After a failed VM restart from the Azure console, machine-api reports one of the master nodes as Phase Failed:

$ oc get machine -n openshift-machine-api
NAME                                            PHASE     TYPE              REGION       ZONE   AGE
ocp-dev-westeu-9qm9r-master-0                   Failed    Standard_D4s_v3   westeurope   1      309d
ocp-dev-westeu-9qm9r-master-1                   Running   Standard_D4s_v3   westeurope   3      309d
ocp-dev-westeu-9qm9r-master-2                   Running   Standard_D4s_v3   westeurope   2      309d
ocp-dev-westeu-9qm9r-worker-westeurope1-mwft7   Running   Standard_D4s_v3   westeurope   1      309d
ocp-dev-westeu-9qm9r-worker-westeurope2-qk2rc   Running   Standard_D4s_v3   westeurope   2      309d
ocp-dev-westeu-9qm9r-worker-westeurope3-9npcw   Running   Standard_D4s_v3   westeurope   3      309d

On the machine-controller container's log in the openshift-machine-api ns we found this:

I1001 11:59:15.041452       1 controller.go:169] ocp-dev-westeu-9qm9r-master-0: reconciling Machine
W1001 11:59:15.041551       1 controller.go:266] ocp-dev-westeu-9qm9r-master-0: machine has gone "Failed" phase. It won't reconcile
I1001 11:59:15.041776       1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled"  "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"ocp-dev-westeu-9qm9r-master-0"}

The node is working as expected though:

$ oc get node
NAME                                            STATUS   ROLES    AGE    VERSION
ocp-dev-westeu-9qm9r-master-0                   Ready    master   309d   v1.18.3+47c0e71
ocp-dev-westeu-9qm9r-master-1                   Ready    master   309d   v1.18.3+47c0e71
ocp-dev-westeu-9qm9r-master-2                   Ready    master   309d   v1.18.3+47c0e71
ocp-dev-westeu-9qm9r-worker-westeurope1-mwft7   Ready    worker   309d   v1.18.3+47c0e71
ocp-dev-westeu-9qm9r-worker-westeurope2-qk2rc   Ready    worker   309d   v1.18.3+47c0e71
ocp-dev-westeu-9qm9r-worker-westeurope3-9npcw   Ready    worker   309d   v1.18.3+47c0e71

Reproducible: Always




IPI cluster on Azure westeurope region.

Comment 1 Joel Speed 2020-10-01 12:48:14 UTC
This seems to be pretty much identical to https://bugzilla.redhat.com/show_bug.cgi?id=1882169, are you happy to mark this as a duplicate? (I think the phase transition is because we are seeing some other state than the ones that are currently allowed)

Comment 2 Lorenzo Dalrio 2020-10-01 12:53:37 UTC
I agree with you, closing as a duplicate of #1882169.

*** This bug has been marked as a duplicate of bug 1882169 ***


Note You need to log in before you can comment on or make changes to this bug.