Description of problem: machine stuck in Deleting phase when delete a machineset whose replicas>=2 and machine is Provisioning phase on Nutanix Version-Release number of selected component (if applicable): 4.11.0-0.nightly-2022-05-20-213928 How reproducible: Always Steps to Reproduce: 1.Create a machineset whose replicas is 2 liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml machineset.machine.openshift.io/huliu-n4-jvqgr-1 created liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-n4-jvqgr-1-6xd22 Provisioning 13s huliu-n4-jvqgr-1-c4pzl Provisioning 13s huliu-n4-jvqgr-master-0 Running 106m huliu-n4-jvqgr-master-1 Running 106m huliu-n4-jvqgr-master-2 Running 106m huliu-n4-jvqgr-worker-ln9qj Running 104m huliu-n4-jvqgr-worker-pz5jc Running 104m 2.Delete the machineset whose machine is Provisioning phase liuhuali@Lius-MacBook-Pro huali-test % oc delete machineset huliu-n4-jvqgr-1 machineset.machine.openshift.io "huliu-n4-jvqgr-1" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-n4-jvqgr-1-6xd22 Deleting 63m huliu-n4-jvqgr-master-0 Running 169m huliu-n4-jvqgr-master-1 Running 169m huliu-n4-jvqgr-master-2 Running 169m huliu-n4-jvqgr-worker-ln9qj Running 167m huliu-n4-jvqgr-worker-pz5jc Running 167m liuhuali@Lius-MacBook-Pro huali-test % oc logs machine-api-controllers-8678477b8c-d5wf4 -c machine-controller … I0523 10:41:47.247646 1 reconciler.go:215] huliu-n4-jvqgr-1-6xd22: vm exists I0523 10:41:47.247670 1 controller.go:273] huliu-n4-jvqgr-1-6xd22: can't proceed deleting machine while cloud instance is being terminated, requeuing Actual results: one machine stuck in Deleting phase Expected results: all machines can be deleted successfully Additional info: When I tried the machineset replicas is 2 and 3, I can reproduce the issue. But when the replica is 1, all works well, the machine is deleted successfully. Workaround - if manually delete the finalizer, the machine can be deleted successfully liuhuali@Lius-MacBook-Pro huali-test % oc edit machine huliu-n4-jvqgr-1-6xd22 machine.machine.openshift.io/huliu-n4-jvqgr-1-6xd22 edited delete these lines: finalizers: - machine.machine.openshift.io liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-n4-jvqgr-master-0 Running 172m huliu-n4-jvqgr-master-1 Running 172m huliu-n4-jvqgr-master-2 Running 172m huliu-n4-jvqgr-worker-ln9qj Running 170m huliu-n4-jvqgr-worker-pz5jc Running 170m Must Gather - https://drive.google.com/file/d/1cO0flT1SqCwcD2ZUAaUDmUt-tajUyrop/view?usp=sharing
@huliu I cannot reproduce the described issue from the OCP cluster I created with the Nutanix infrastructure. Can you give me access to the OCP cluster where the issue can be reproduced for me to investigate the issue? Thanks, Yanhua Li
@yanhli The cluster access info already send to you email.
I could reproduce the issue from the OCP cluster with the provided kubeconfig. My initial investigation show this is a synchronization issue. The root cause could be when the Machine CR got deletion shortly after created, and the mapi-nutanix-controller just called the Prism API to create the VM and the VM is not ready yet (so the machine.status.providerStatus.vmUUID has not filled yet). Currently when the mapi-nutanix-controller handles the deletion call, it only tries to delete the vm by machine.status.providerStatus.vmUUID. When this is nil it cannot delete the vm. We may fix the issue by trying to delete the vm by name (the vm name is same as the machine name, which should be unique with the generated suffix string), when the machine.status.providerStatus.vmUUID is nil.
My investigation shows the root cause is when the Machine CR is deleted shortly after creation, and the mapi-nutanix-controller has called the Prism API to create the VM but the VM is not ready yet (so the machine.status.providerStatus.vmUUID has not filled yet). Currently when the mapi-nutanix-controller handles the machine delete() call, it only tries to delete the vm by machine.status.providerStatus.vmUUID. When this is nil it cannot delete the vm, and the machine stuck in Deleting phase. The PR is to fix the issue by finding the vm's UUID via the vm name (same as the machine name) when the machine.status.providerStatus.vmUUID is nil, and then deleting the machine vm when it exists.
Verified on 4.11.0-0.nightly-2022-06-04-014713 1.Create a machineset whose replicas is 2 liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml machineset.machine.openshift.io/huliu-n11-kcjhx-1 created liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-n11-kcjhx-1-f2wcw Provisioning 16s huliu-n11-kcjhx-1-r72nd Provisioning 16s huliu-n11-kcjhx-master-0 Running 69m huliu-n11-kcjhx-master-1 Running 69m huliu-n11-kcjhx-master-2 Running 69m huliu-n11-kcjhx-worker-bwbjf Running 66m huliu-n11-kcjhx-worker-gvsfp Running 66m 2.Delete the machineset whose machine is Provisioning phase liuhuali@Lius-MacBook-Pro huali-test % oc delete machineset huliu-n11-kcjhx-1 machineset.machine.openshift.io "huliu-n11-kcjhx-1" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-n11-kcjhx-master-0 Running 71m huliu-n11-kcjhx-master-1 Running 71m huliu-n11-kcjhx-master-2 Running 71m huliu-n11-kcjhx-worker-bwbjf Running 67m huliu-n11-kcjhx-worker-gvsfp Running 67m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069