Bug 2089295
Summary: | [Nutanix]machine stuck in Deleting phase when delete a machineset whose replicas>=2 and machine is Provisioning phase on Nutanix | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Huali Liu <huliu> |
Component: | Cloud Compute | Assignee: | Yanhua Li <yanhli> |
Cloud Compute sub component: | Other Providers | QA Contact: | Huali Liu <huliu> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | low | ||
Priority: | low | CC: | yanhli |
Version: | 4.11 | Flags: | huliu:
needinfo-
|
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-08-10 11:13:27 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Huali Liu
2022-05-23 11:25:14 UTC
@huliu I cannot reproduce the described issue from the OCP cluster I created with the Nutanix infrastructure. Can you give me access to the OCP cluster where the issue can be reproduced for me to investigate the issue? Thanks, Yanhua Li @yanhli The cluster access info already send to you email. I could reproduce the issue from the OCP cluster with the provided kubeconfig. My initial investigation show this is a synchronization issue. The root cause could be when the Machine CR got deletion shortly after created, and the mapi-nutanix-controller just called the Prism API to create the VM and the VM is not ready yet (so the machine.status.providerStatus.vmUUID has not filled yet). Currently when the mapi-nutanix-controller handles the deletion call, it only tries to delete the vm by machine.status.providerStatus.vmUUID. When this is nil it cannot delete the vm. We may fix the issue by trying to delete the vm by name (the vm name is same as the machine name, which should be unique with the generated suffix string), when the machine.status.providerStatus.vmUUID is nil. My investigation shows the root cause is when the Machine CR is deleted shortly after creation, and the mapi-nutanix-controller has called the Prism API to create the VM but the VM is not ready yet (so the machine.status.providerStatus.vmUUID has not filled yet). Currently when the mapi-nutanix-controller handles the machine delete() call, it only tries to delete the vm by machine.status.providerStatus.vmUUID. When this is nil it cannot delete the vm, and the machine stuck in Deleting phase. The PR is to fix the issue by finding the vm's UUID via the vm name (same as the machine name) when the machine.status.providerStatus.vmUUID is nil, and then deleting the machine vm when it exists. Verified on 4.11.0-0.nightly-2022-06-04-014713 1.Create a machineset whose replicas is 2 liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml machineset.machine.openshift.io/huliu-n11-kcjhx-1 created liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-n11-kcjhx-1-f2wcw Provisioning 16s huliu-n11-kcjhx-1-r72nd Provisioning 16s huliu-n11-kcjhx-master-0 Running 69m huliu-n11-kcjhx-master-1 Running 69m huliu-n11-kcjhx-master-2 Running 69m huliu-n11-kcjhx-worker-bwbjf Running 66m huliu-n11-kcjhx-worker-gvsfp Running 66m 2.Delete the machineset whose machine is Provisioning phase liuhuali@Lius-MacBook-Pro huali-test % oc delete machineset huliu-n11-kcjhx-1 machineset.machine.openshift.io "huliu-n11-kcjhx-1" deleted liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-n11-kcjhx-master-0 Running 71m huliu-n11-kcjhx-master-1 Running 71m huliu-n11-kcjhx-master-2 Running 71m huliu-n11-kcjhx-worker-bwbjf Running 67m huliu-n11-kcjhx-worker-gvsfp Running 67m Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |