1957349 – [Azure] Machine object showing Failed phase even node is ready and VM is running properly

Bug 1957349 - [Azure] Machine object showing Failed phase even node is ready and VM is running properly

Summary: [Azure] Machine object showing Failed phase even node is ready and VM is runn...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	dmoiseev
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1989524
TreeView+	depends on / blocked

Reported:	2021-05-05 16:08 UTC by Aditya Deshpande
Modified:	2024-10-01 18:06 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, due to strict check of vm's 'ProvisioningState' value machine may undesirably go to 'Failed' phase during its existence check. This check was relaxed, now only actually deleted machines goes into 'Failed' phase during existence check procedure.
Clone Of:
Environment:
Last Closed:	2021-07-27 23:06:36 UTC
Target Upstream Version:
Embargoed:
Flags:	dmoiseev: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-azure pull 219	0	None	open	Bug 1957349: Avoid machines going into failed phase if unexpected ProvisioningState appears	2021-05-27 15:02:02 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:06:57 UTC

Description Aditya Deshpande 2021-05-05 16:08:35 UTC

Description of problem:
In OCP cluster setup on Azure, some machine objects are showing failed phase. 
Checking OCP node status shows the respective node is in ready state. From Azure perspective VM is running with correct VMId(ProviderID).

~~~
# oc get machines -n openshift-machine-api
NAME                                       PHASE    TYPE              REGION      ZONE  AGE
awe-pp-xxx-ah97u-master-0                  Failed   Standard_D8s_v3   westeurope  1     88d
awe-pp-xxx-ah97u-master-1                  Running  Standard_D8s_v3   westeurope  2     88d
awe-pp-xxx-ah97u-master-2                  Running  Standard_D8s_v3   westeurope  3     88d
awe-pp-xxx-ah97u-worker-westeurope1-fmm25  Failed   Standard_F32s_v2  westeurope  1     88d
awe-pp-xxx-ah97u-worker-westeurope2-w22ln  Running  Standard_F32s_v2  westeurope  2     52d
awe-pp-xxx-ah97u-worker-westeurope3-s7vll  Running  Standard_F32s_v2  westeurope  3     88d

# oc get nodes
NAME                                       STATUS  ROLES   AGE  VERSION
awe-pp-xxx-ah97u-master-0                  Ready   master  88d  v1.19.0+2f3101c
awe-pp-xxx-ah97u-master-1                  Ready   master  88d  v1.19.0+2f3101c
awe-pp-xxx-ah97u-master-2                  Ready   master  88d  v1.19.0+2f3101c
awe-pp-xxx-ah97u-worker-westeurope1-fmm25  Ready   worker  88d  v1.19.0+2f3101c
awe-pp-xxx-ah97u-worker-westeurope2-w22ln  Ready   worker  52d  v1.19.0+2f3101c
awe-pp-xxx-ah97u-worker-westeurope3-s7vll  Ready   worker  88d  v1.19.0+2f3101c

~~~

- machine awe-pp-xxx-ah97u-master-0 is in phase: Failed
  Machine YAML:
    errorMessage: Can't find created instance.

- machine-api-controllers logs shows: 
~~~
2021-04-26T00:58:33.013973757Z I0426 00:58:33.013892       1 controller.go:170] awe-pp-xxx-ah97u-master-0: reconciling Machine
2021-04-26T00:58:33.013973757Z W0426 00:58:33.013918       1 controller.go:267] awe-pp-xxx-ah97u-master-0: machine has gone "Failed" phase. It won't reconcile
2021-04-26T00:58:33.014001757Z I0426 00:58:33.013974       1 controller.go:261] controller "msg"="Successfully Reconciled" "controller"="machine_controller" "name"="awe-pp-xxx-ah97u-master-0" "namespace"="openshift-machine-api"
~~~

- From Azure point of view, ProviderID from node description is matching with actual VM. 


Version-Release number of selected component (if applicable):
OCP 4.6.21

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
Machine is in failed phase.

Expected results:
Machine should be in Running phase.

Additional info:
Issue is seen after updating OCP from 4.6.12 to 4.6.21 at the time the node was rebooted.

Comment 2 Joel Speed 2021-05-06 10:34:47 UTC

This reminds me of a previous bug, but I can't find it right now.

What I believe is happening here is that the VM state is going through something that we aren't aware of in our "exists" check, and therefore it is going to failed.

We need to work out exactly what that state is and add it as an allowed state in https://github.com/openshift/cluster-api-provider-azure/blob/b2eda16dd665ab39459c0b686c88ce2d0b97ec6a/pkg/cloud/azure/actuators/machine/reconciler.go#L386-L400

Comment 34 errata-xmlrpc 2021-07-27 23:06:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.