1851977 – [vsphere]Machines phase should become 'Failed' when its instance is deleted frome vsphere client

Bug 1851977 - [vsphere]Machines phase should become 'Failed' when its instance is deleted frome vsphere client

Summary: [vsphere]Machines phase should become 'Failed' when its instance is deleted f...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Alexander Demicev
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1869320
TreeView+	depends on / blocked

Reported:	2020-06-29 14:39 UTC by sunzhaohua
Modified:	2020-12-18 01:43 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1869320 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:10:28 UTC
Target Upstream Version:
Embargoed:
Flags:	agarcial: needinfo+

Attachments	(Terms of Use)
machine controller log (40.14 KB, text/plain) 2020-07-06 03:17 UTC, sunzhaohua	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 604	0	None	closed	[vSphere] Reduce sync period to 10 minutes	2021-01-28 11:31:49 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:10:55 UTC

Description sunzhaohua 2020-06-29 14:39:12 UTC

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-06-26-215024

How reproducible:
Always

Steps to Reproduce:
1. Power off an vm from vsphere client 
2. Delete from disk
3. Check machine phase 

Actual results:
Machine is still in Running phase.

$ oc get node
NAME                                  STATUS     ROLES    AGE    VERSION
zhsunvsphere629-z75kh-master-0        Ready      master   130m   v1.18.3+f291db1
zhsunvsphere629-z75kh-master-1        Ready      master   130m   v1.18.3+f291db1
zhsunvsphere629-z75kh-master-2        Ready      master   130m   v1.18.3+f291db1
zhsunvsphere629-z75kh-worker-gm4x9    Ready      worker   80m    v1.18.3+f291db1
zhsunvsphere629-z75kh-worker-p7xlp    Ready      worker   80m    v1.18.3+f291db1
zhsunvsphere629-z75kh-worker1-vspkl   NotReady   worker   37m    v1.18.3+f291db1

$ oc get machine -o wide
NAME                                  PHASE     TYPE   REGION   ZONE   AGE    NODE                                  PROVIDERID                                       STATE
zhsunvsphere629-z75kh-master-0        Running                          158m   zhsunvsphere629-z75kh-master-0        vsphere://420bf774-c6d5-efeb-e0ad-23d35172b2ac   poweredOn
zhsunvsphere629-z75kh-master-1        Running                          158m   zhsunvsphere629-z75kh-master-1        vsphere://420baa76-6f3b-d0aa-3e33-96060c60cb89   poweredOn
zhsunvsphere629-z75kh-master-2        Running                          158m   zhsunvsphere629-z75kh-master-2        vsphere://420ba873-ec4e-b7b2-867f-62349371f0b3   poweredOn
zhsunvsphere629-z75kh-worker-gm4x9    Running                          114m   zhsunvsphere629-z75kh-worker-gm4x9    vsphere://420bff1c-9f5f-e217-2f7c-04bd9fd618f5   poweredOn
zhsunvsphere629-z75kh-worker-p7xlp    Running                          114m   zhsunvsphere629-z75kh-worker-p7xlp    vsphere://420bdd2f-d235-57db-3fce-b55505111cb8   poweredOn
zhsunvsphere629-z75kh-worker1-vspkl   Running                          67m    zhsunvsphere629-z75kh-worker1-vspkl   vsphere://420bba46-a730-2953-de19-67e0e78897b3   poweredOff


Expected results:
The machine phase is set "Failed"

Additional info:

Comment 1 Alexander Demicev 2020-07-01 09:00:42 UTC

Hi,
I can't reproduce this bug, all works as expected. Can you make sure that deleting VM from the disk machine controller reconciled machine? It should reconcile the machine about every 15 minutes.

Comment 2 sunzhaohua 2020-07-06 03:16:29 UTC

4.5.0-0.nightly-2020-07-02-190154

Power off an vm from vsphere client, machine controller reconciled machine.
Deleting VM from the disk machine controller didn't reconcile machine. I waited for about 30 mins.


# oc get machine -o wide
NAME                                PHASE     TYPE   REGION   ZONE   AGE     NODE                                PROVIDERID                                       STATE
zhsunvshpere73-gfbpr-master-0       Running                          2d17h   zhsunvshpere73-gfbpr-master-0       vsphere://422b8f34-078c-1798-b4f4-65580e1516d2   poweredOn
zhsunvshpere73-gfbpr-master-1       Running                          2d17h   zhsunvshpere73-gfbpr-master-1       vsphere://422b24b4-ede7-0a8f-3cef-f21f6c1cb45f   poweredOn
zhsunvshpere73-gfbpr-master-2       Running                          2d17h   zhsunvshpere73-gfbpr-master-2       vsphere://422ba8b4-3d4f-875b-e08a-20c6e33167f6   poweredOn
zhsunvshpere73-gfbpr-worker-dx9j8   Running                          2d17h   zhsunvshpere73-gfbpr-worker-dx9j8   vsphere://422b38f2-35c6-9a79-4339-0c3e750e4e0c   poweredOn
zhsunvshpere73-gfbpr-worker-gvn7j   Running                          2d17h   zhsunvshpere73-gfbpr-worker-gvn7j   vsphere://422bc235-ceeb-b3b1-a28a-bf3cb6310c1a   poweredOff

Comment 3 sunzhaohua 2020-07-06 03:17:39 UTC

Created attachment 1699980 [details]
machine controller log

Comment 4 Alberto 2020-07-09 16:26:22 UTC

Planning to clarify on this during the next sprint.

Comment 6 Alexander Demicev 2020-08-04 12:51:34 UTC

Is it possible to check if this bug appears on 4.6? Can I get access to your test environent because I can't repoduce the bug. Machine fails when I delete the VM from disk.

I0804 12:39:52.456421       1 reconciler.go:268] ademicev-6hc7q-worker-q7pzt: reconciling powerstate annotation
I0804 12:39:52.457774       1 reconciler.go:708] ademicev-6hc7q-worker-q7pzt: Updating provider status
I0804 12:39:52.464439       1 machine_scope.go:102] ademicev-6hc7q-worker-q7pzt: patching machine

I0804 12:49:26.494885       1 controller.go:169] ademicev-6hc7q-worker-q7pzt: reconciling Machine
I0804 12:49:26.494915       1 actuator.go:83] ademicev-6hc7q-worker-q7pzt: actuator checking if machine exists
I0804 12:49:31.604472       1 session.go:113] Find template by instance uuid: 30d9461a-cae9-42d7-8369-859a421c6a3a
I0804 12:49:31.623139       1 reconciler.go:175] ademicev-6hc7q-worker-q7pzt: does not exist
I0804 12:49:31.623159       1 controller.go:424] ademicev-6hc7q-worker-q7pzt: going into phase "Failed"
I0804 12:49:31.652301       1 controller.go:169] ademicev-6hc7q-worker-q7pzt: reconciling Machine
I0804 12:49:31.652330       1 actuator.go:83] ademicev-6hc7q-worker-q7pzt: actuator checking if machine exists
I0804 12:49:31.660480       1 session.go:113] Find template by instance uuid: 30d9461a-cae9-42d7-8369-859a421c6a3a
I0804 12:49:31.687004       1 reconciler.go:175] ademicev-6hc7q-worker-q7pzt: does not exist
I0804 12:49:31.687022       1 controller.go:424] ademicev-6hc7q-worker-q7pzt: going into phase "Failed"
I0804 12:49:31.707896       1 controller.go:169] ademicev-6hc7q-worker-q7pzt: reconciling Machine
W0804 12:49:31.707922       1 controller.go:266] ademicev-6hc7q-worker-q7pzt: machine has gone "Failed" phase. It won't reconcile

Comment 7 sunzhaohua 2020-08-05 08:13:48 UTC

Alexander Demicev, I can't reproduce this bug on 4.6.
clusterversion: 4.6.0-0.nightly-2020-08-05-041346
# oc get machine -o wide
NAME                                 PHASE     TYPE   REGION   ZONE   AGE   NODE                                 PROVIDERID                                       STATE
zhsun85vsphere1-c7wnq-master-0       Running                          82m   zhsun85vsphere1-c7wnq-master-0       vsphere://422b0dd9-f3cf-c551-cb59-6c4e263c2855   poweredOn
zhsun85vsphere1-c7wnq-master-1       Running                          82m   zhsun85vsphere1-c7wnq-master-1       vsphere://422b2d75-2901-1eb7-ed24-c6cbc3791c02   poweredOn
zhsun85vsphere1-c7wnq-master-2       Running                          82m   zhsun85vsphere1-c7wnq-master-2       vsphere://422bc84f-2509-dad3-07ed-17e6afab3ed2   poweredOn
zhsun85vsphere1-c7wnq-worker-sfnj8   Running                          70m   zhsun85vsphere1-c7wnq-worker-sfnj8   vsphere://422bf3b8-1a9e-0852-d541-8f9f31ee6055   poweredOn
zhsun85vsphere1-c7wnq-worker-zbh2n   Failed                           70m   zhsun85vsphere1-c7wnq-worker-zbh2n   vsphere://422be640-7278-6637-e094-616fb3e61468   Unknown

Comment 8 Alexander Demicev 2020-08-17 13:48:09 UTC

Closing this BZ, because the bug appears only on 4.5. All progress can be tracked here https://bugzilla.redhat.com/show_bug.cgi?id=1869320

Comment 10 Joel Speed 2020-08-18 14:00:55 UTC

I've linked the PR that is being cherry-picked in the dependent PR, once it's verified that this fixes the issue, we can move on with the backport

Comment 13 Milind Yadav 2020-08-24 15:42:25 UTC

Validated on - 
4.6.0-0.nightly-2020-08-23-214712

Steps :

Deleted VM from vsphere and from disk 

Machine status became failed after sometime 

[miyadav@miyadav vsp]$ oc get machines -o wide --config vsp
NAME                            PHASE     TYPE   REGION   ZONE   AGE   NODE                            PROVIDERID                                       STATE
jima082401-hxvhx-master-0       Running                          9h    jima082401-hxvhx-master-0       vsphere://422be929-b6f7-c263-2e26-78fc44f17e8c   poweredOn
jima082401-hxvhx-master-1       Running                          9h    jima082401-hxvhx-master-1       vsphere://422bc938-1b49-aad2-fc95-bf367c3e387f   poweredOn
jima082401-hxvhx-master-2       Running                          9h    jima082401-hxvhx-master-2       vsphere://422b4670-78cb-e14e-695c-db98041ef7bb   poweredOn
jima082401-hxvhx-worker-tpplc   Failed                           8h    jima082401-hxvhx-worker-tpplc   vsphere://422ba266-fcf6-a399-4039-0f62a14b3f52   Unknown


Additional info :
Moved to VERIFIED

Comment 15 errata-xmlrpc 2020-10-27 16:10:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.