2089295 – [Nutanix]machine stuck in Deleting phase when delete a machineset whose replicas>=2 and machine is Provisioning phase on Nutanix

Bug 2089295 - [Nutanix]machine stuck in Deleting phase when delete a machineset whose replicas>=2 and machine is Provisioning phase on Nutanix

Summary: [Nutanix]machine stuck in Deleting phase when delete a machineset whose repli...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Yanhua Li
QA Contact:	Huali Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-23 11:25 UTC by Huali Liu
Modified:	2022-08-10 11:13 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 11:13:27 UTC
Target Upstream Version:
Embargoed:
Flags:	huliu: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-provider-nutanix pull 14	0	None	open	Bug 2089295: [Nutanix]machine stuck in Deleting phase when delete a machineset whose replicas>=2 and machine is Provisio...	2022-05-26 07:15:36 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 11:13:46 UTC

Description Huali Liu 2022-05-23 11:25:14 UTC

Description of problem:
machine stuck in Deleting phase when delete a machineset whose replicas>=2 and machine is Provisioning phase on Nutanix

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-05-20-213928

How reproducible:
Always

Steps to Reproduce:
1.Create a machineset whose replicas is 2

liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml 
machineset.machine.openshift.io/huliu-n4-jvqgr-1 created

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                          PHASE          TYPE   REGION   ZONE   AGE
huliu-n4-jvqgr-1-6xd22        Provisioning                          13s
huliu-n4-jvqgr-1-c4pzl        Provisioning                          13s
huliu-n4-jvqgr-master-0       Running                               106m
huliu-n4-jvqgr-master-1       Running                               106m
huliu-n4-jvqgr-master-2       Running                               106m
huliu-n4-jvqgr-worker-ln9qj   Running                               104m
huliu-n4-jvqgr-worker-pz5jc   Running                               104m

2.Delete the machineset whose machine is Provisioning phase

liuhuali@Lius-MacBook-Pro huali-test % oc delete machineset huliu-n4-jvqgr-1
machineset.machine.openshift.io "huliu-n4-jvqgr-1" deleted

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                          PHASE      TYPE   REGION   ZONE   AGE
huliu-n4-jvqgr-1-6xd22        Deleting                          63m
huliu-n4-jvqgr-master-0       Running                           169m
huliu-n4-jvqgr-master-1       Running                           169m
huliu-n4-jvqgr-master-2       Running                           169m
huliu-n4-jvqgr-worker-ln9qj   Running                           167m
huliu-n4-jvqgr-worker-pz5jc   Running                           167m

liuhuali@Lius-MacBook-Pro huali-test % oc logs machine-api-controllers-8678477b8c-d5wf4 -c machine-controller
…
I0523 10:41:47.247646       1 reconciler.go:215] huliu-n4-jvqgr-1-6xd22: vm exists
I0523 10:41:47.247670       1 controller.go:273] huliu-n4-jvqgr-1-6xd22: can't proceed deleting machine while cloud instance is being terminated, requeuing


Actual results:
one machine stuck in Deleting phase

Expected results:
all machines can be deleted successfully

Additional info:
When I tried the machineset replicas is 2 and 3, I can reproduce the issue. But when the replica is 1, all works well, the machine is deleted successfully. 

Workaround - if manually delete the finalizer, the machine can be deleted successfully
liuhuali@Lius-MacBook-Pro huali-test % oc edit machine huliu-n4-jvqgr-1-6xd22
machine.machine.openshift.io/huliu-n4-jvqgr-1-6xd22 edited

delete these lines:
 finalizers:
  - machine.machine.openshift.io

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                          PHASE      TYPE   REGION   ZONE   AGE
huliu-n4-jvqgr-master-0       Running                           172m
huliu-n4-jvqgr-master-1       Running                           172m
huliu-n4-jvqgr-master-2       Running                           172m
huliu-n4-jvqgr-worker-ln9qj   Running                           170m
huliu-n4-jvqgr-worker-pz5jc   Running                           170m

Must Gather - 
https://drive.google.com/file/d/1cO0flT1SqCwcD2ZUAaUDmUt-tajUyrop/view?usp=sharing

Comment 1 Yanhua Li 2022-05-23 20:39:17 UTC

@huliu I cannot reproduce the described issue from the OCP cluster I created with the Nutanix infrastructure. Can you give me access to the OCP cluster where the issue can be reproduced for me to investigate the issue?

Thanks, 
Yanhua Li

Comment 2 Huali Liu 2022-05-24 12:12:31 UTC

@yanhli The cluster access info already send to you email.

Comment 3 Yanhua Li 2022-05-24 22:43:03 UTC

I could reproduce the issue from the OCP cluster with the provided kubeconfig. My initial investigation show this is a synchronization issue. 

The root cause could be when the Machine CR got deletion shortly after created, and the mapi-nutanix-controller just called the Prism API to create the VM and the VM is not ready yet (so the machine.status.providerStatus.vmUUID has not filled yet). Currently when the mapi-nutanix-controller handles the deletion call, it only tries to delete the vm by machine.status.providerStatus.vmUUID. When this is nil it cannot delete the vm. We may fix the issue by trying to delete the vm by name (the vm name is same as the machine name, which should be unique with the generated suffix string), when the machine.status.providerStatus.vmUUID is nil.

Comment 4 Yanhua Li 2022-05-25 22:18:45 UTC

My investigation shows the root cause is when the Machine CR is deleted shortly after creation, and the mapi-nutanix-controller has called the Prism API to create the VM but the VM is not ready yet (so the machine.status.providerStatus.vmUUID has not filled yet). Currently when the mapi-nutanix-controller handles the machine delete() call, it only tries to delete the vm by machine.status.providerStatus.vmUUID. When this is nil it cannot delete the vm, and the machine stuck in Deleting phase. The PR is to fix the issue by finding the vm's UUID via the vm name (same as the machine name) when the machine.status.providerStatus.vmUUID is nil, and then deleting the machine vm when it exists.

Comment 7 Huali Liu 2022-06-06 02:50:23 UTC

Verified on 4.11.0-0.nightly-2022-06-04-014713

1.Create a machineset whose replicas is 2

liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml 
machineset.machine.openshift.io/huliu-n11-kcjhx-1 created

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                           PHASE          TYPE   REGION   ZONE   AGE
huliu-n11-kcjhx-1-f2wcw        Provisioning                          16s
huliu-n11-kcjhx-1-r72nd        Provisioning                          16s
huliu-n11-kcjhx-master-0       Running                               69m
huliu-n11-kcjhx-master-1       Running                               69m
huliu-n11-kcjhx-master-2       Running                               69m
huliu-n11-kcjhx-worker-bwbjf   Running                               66m
huliu-n11-kcjhx-worker-gvsfp   Running                               66m


2.Delete the machineset whose machine is Provisioning phase

liuhuali@Lius-MacBook-Pro huali-test % oc delete machineset huliu-n11-kcjhx-1
machineset.machine.openshift.io "huliu-n11-kcjhx-1" deleted

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                           PHASE     TYPE   REGION   ZONE   AGE
huliu-n11-kcjhx-master-0       Running                          71m
huliu-n11-kcjhx-master-1       Running                          71m
huliu-n11-kcjhx-master-2       Running                          71m
huliu-n11-kcjhx-worker-bwbjf   Running                          67m
huliu-n11-kcjhx-worker-gvsfp   Running                          67m

Comment 9 errata-xmlrpc 2022-08-10 11:13:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.