Bug 2089295

Summary:	[Nutanix]machine stuck in Deleting phase when delete a machineset whose replicas>=2 and machine is Provisioning phase on Nutanix
Product:	OpenShift Container Platform	Reporter:	Huali Liu <huliu>
Component:	Cloud Compute	Assignee:	Yanhua Li <yanhli>
Cloud Compute sub component:	Other Providers	QA Contact:	Huali Liu <huliu>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	low	CC:	yanhli
Version:	4.11	Flags:	huliu: needinfo-
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 11:13:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Huali Liu 2022-05-23 11:25:14 UTC

Description of problem:
machine stuck in Deleting phase when delete a machineset whose replicas>=2 and machine is Provisioning phase on Nutanix

Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-05-20-213928

How reproducible:
Always

Steps to Reproduce:
1.Create a machineset whose replicas is 2

liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml 
machineset.machine.openshift.io/huliu-n4-jvqgr-1 created

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                          PHASE          TYPE   REGION   ZONE   AGE
huliu-n4-jvqgr-1-6xd22        Provisioning                          13s
huliu-n4-jvqgr-1-c4pzl        Provisioning                          13s
huliu-n4-jvqgr-master-0       Running                               106m
huliu-n4-jvqgr-master-1       Running                               106m
huliu-n4-jvqgr-master-2       Running                               106m
huliu-n4-jvqgr-worker-ln9qj   Running                               104m
huliu-n4-jvqgr-worker-pz5jc   Running                               104m

2.Delete the machineset whose machine is Provisioning phase

liuhuali@Lius-MacBook-Pro huali-test % oc delete machineset huliu-n4-jvqgr-1
machineset.machine.openshift.io "huliu-n4-jvqgr-1" deleted

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                          PHASE      TYPE   REGION   ZONE   AGE
huliu-n4-jvqgr-1-6xd22        Deleting                          63m
huliu-n4-jvqgr-master-0       Running                           169m
huliu-n4-jvqgr-master-1       Running                           169m
huliu-n4-jvqgr-master-2       Running                           169m
huliu-n4-jvqgr-worker-ln9qj   Running                           167m
huliu-n4-jvqgr-worker-pz5jc   Running                           167m

liuhuali@Lius-MacBook-Pro huali-test % oc logs machine-api-controllers-8678477b8c-d5wf4 -c machine-controller
…
I0523 10:41:47.247646       1 reconciler.go:215] huliu-n4-jvqgr-1-6xd22: vm exists
I0523 10:41:47.247670       1 controller.go:273] huliu-n4-jvqgr-1-6xd22: can't proceed deleting machine while cloud instance is being terminated, requeuing


Actual results:
one machine stuck in Deleting phase

Expected results:
all machines can be deleted successfully

Additional info:
When I tried the machineset replicas is 2 and 3, I can reproduce the issue. But when the replica is 1, all works well, the machine is deleted successfully. 

Workaround - if manually delete the finalizer, the machine can be deleted successfully
liuhuali@Lius-MacBook-Pro huali-test % oc edit machine huliu-n4-jvqgr-1-6xd22
machine.machine.openshift.io/huliu-n4-jvqgr-1-6xd22 edited

delete these lines:
 finalizers:
  - machine.machine.openshift.io

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                          PHASE      TYPE   REGION   ZONE   AGE
huliu-n4-jvqgr-master-0       Running                           172m
huliu-n4-jvqgr-master-1       Running                           172m
huliu-n4-jvqgr-master-2       Running                           172m
huliu-n4-jvqgr-worker-ln9qj   Running                           170m
huliu-n4-jvqgr-worker-pz5jc   Running                           170m

Must Gather - 
https://drive.google.com/file/d/1cO0flT1SqCwcD2ZUAaUDmUt-tajUyrop/view?usp=sharing

Comment 1 Yanhua Li 2022-05-23 20:39:17 UTC

@huliu I cannot reproduce the described issue from the OCP cluster I created with the Nutanix infrastructure. Can you give me access to the OCP cluster where the issue can be reproduced for me to investigate the issue?

Thanks, 
Yanhua Li

Comment 2 Huali Liu 2022-05-24 12:12:31 UTC

@yanhli The cluster access info already send to you email.

Comment 3 Yanhua Li 2022-05-24 22:43:03 UTC

I could reproduce the issue from the OCP cluster with the provided kubeconfig. My initial investigation show this is a synchronization issue. 

The root cause could be when the Machine CR got deletion shortly after created, and the mapi-nutanix-controller just called the Prism API to create the VM and the VM is not ready yet (so the machine.status.providerStatus.vmUUID has not filled yet). Currently when the mapi-nutanix-controller handles the deletion call, it only tries to delete the vm by machine.status.providerStatus.vmUUID. When this is nil it cannot delete the vm. We may fix the issue by trying to delete the vm by name (the vm name is same as the machine name, which should be unique with the generated suffix string), when the machine.status.providerStatus.vmUUID is nil.

Comment 4 Yanhua Li 2022-05-25 22:18:45 UTC

My investigation shows the root cause is when the Machine CR is deleted shortly after creation, and the mapi-nutanix-controller has called the Prism API to create the VM but the VM is not ready yet (so the machine.status.providerStatus.vmUUID has not filled yet). Currently when the mapi-nutanix-controller handles the machine delete() call, it only tries to delete the vm by machine.status.providerStatus.vmUUID. When this is nil it cannot delete the vm, and the machine stuck in Deleting phase. The PR is to fix the issue by finding the vm's UUID via the vm name (same as the machine name) when the machine.status.providerStatus.vmUUID is nil, and then deleting the machine vm when it exists.

Comment 7 Huali Liu 2022-06-06 02:50:23 UTC

Verified on 4.11.0-0.nightly-2022-06-04-014713

1.Create a machineset whose replicas is 2

liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml 
machineset.machine.openshift.io/huliu-n11-kcjhx-1 created

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                           PHASE          TYPE   REGION   ZONE   AGE
huliu-n11-kcjhx-1-f2wcw        Provisioning                          16s
huliu-n11-kcjhx-1-r72nd        Provisioning                          16s
huliu-n11-kcjhx-master-0       Running                               69m
huliu-n11-kcjhx-master-1       Running                               69m
huliu-n11-kcjhx-master-2       Running                               69m
huliu-n11-kcjhx-worker-bwbjf   Running                               66m
huliu-n11-kcjhx-worker-gvsfp   Running                               66m


2.Delete the machineset whose machine is Provisioning phase

liuhuali@Lius-MacBook-Pro huali-test % oc delete machineset huliu-n11-kcjhx-1
machineset.machine.openshift.io "huliu-n11-kcjhx-1" deleted

liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                           PHASE     TYPE   REGION   ZONE   AGE
huliu-n11-kcjhx-master-0       Running                          71m
huliu-n11-kcjhx-master-1       Running                          71m
huliu-n11-kcjhx-master-2       Running                          71m
huliu-n11-kcjhx-worker-bwbjf   Running                          67m
huliu-n11-kcjhx-worker-gvsfp   Running                          67m

Comment 9 errata-xmlrpc 2022-08-10 11:13:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069