Description of problem: I run into this when trying to configure a Machine to have 2 interfaces. The VM failed with "Failed to allocate the network(s)" and landed in ERROR state. The Machine kept being in Provisioning state and even deleting the MachineSet haven't helped. So I just deleted the VM manually and then tried deleting the Machine. It ended up like this: I0128 11:11:48.523301 1 controller.go:171] ostest-4mjhl-double-7xv92: reconciling Machine I0128 11:11:48.523339 1 controller.go:211] ostest-4mjhl-double-7xv92: reconciling machine triggers delete W0128 11:11:51.600815 1 machineservice.go:825] Couldn't delete all instance ports: Resource not found E0128 11:11:51.628437 1 actuator.go:574] Machine error ostest-4mjhl-double-7xv92: error deleting Openstack instance: Resource not found E0128 11:11:51.628481 1 controller.go:232] ostest-4mjhl-double-7xv92: failed to delete machine: error deleting Openstack instance: Resource not found E0128 11:11:51.628542 1 controller.go:237] controller "msg"="Reconciler error" "error"="error deleting Openstack instance: Resource not found" "controller"="machine_controller" "name"="ostest-4mjhl-double-7xv92" "namespace"="openshift-machine-api" I bet there are two problems - the VM got into ERROR state, yet the Machine kept to be in Provisioning. Then deleting the VM manually haven't really helped. Version-Release number of selected component (if applicable): How reproducible: ? Steps to Reproduce: Above Actual results: Need to remove the finalizer manually. Expected results: Machine just accepts that VM is gone and it's a minor problem that it can't delete it. Additional info:
Potentially a duplicate. Emilio to investigate.
There seems like some openstack clusters have a bug where machines that error out are unable to be deleted by the machine api. Just for my sanity, verify that the machine-controller is actually trying to delete it and failing. If this is the case, then the machine object is correct to remain stuck in a deleting state, and see: https://bugzilla.redhat.com/show_bug.cgi?id=1856270
I was able to delete the VM easily as my SoS tenant, so my bet is that the problem is not with OpenStack API itself. If you want an easy reproducer here it is - just create a new subnet on the main network and then try to add that subnet as a secondary interface in a MachineSet. OpenStack won't allow that VM to spawn and you'll get a VM stuck in ERROR causing the problem.
Thats good to know, I'll try that and take a look. It's possible that something is not correct in our control flow.
Looks like there are 2 action items here: 1. If you run oc delete machine xxxx the user expects to be able to delete that machine, even if it is not finished deploying and has finalizers. 2. If an admin deletes a VM in OpenStack, they expect that CAPO is smart enough to recognize that the instance is gone, and will delete the machine in OpenShift as a result.
(In reply to egarcia from comment #5) > Looks like there are 2 action items here: > 1. If you run oc delete machine xxxx the user expects to be able to delete > that machine, even if it is not finished deploying and has finalizers. > 2. If an admin deletes a VM in OpenStack, they expect that CAPO is smart > enough to recognize that the instance is gone, and will delete the machine > in OpenShift as a result. Hm, IMO #1 is true. Also it should be able to delete it if it's in ERROR state too. #2 may be problematic, monitoring OpenStack API for resources existence is heavy (OpenStack APIs are heavy in general). But for sure machine-api should be able to handle deletion of a Machine object even if corresponding VM is gone.
(In reply to Michał Dulko from comment #6) > (In reply to egarcia from comment #5) > > Looks like there are 2 action items here: > > 1. If you run oc delete machine xxxx the user expects to be able to delete > > that machine, even if it is not finished deploying and has finalizers. > > 2. If an admin deletes a VM in OpenStack, they expect that CAPO is smart > > enough to recognize that the instance is gone, and will delete the machine > > in OpenShift as a result. > > Hm, IMO #1 is true. Also it should be able to delete it if it's in ERROR > state too. #2 may be problematic, monitoring OpenStack API for resources > existence is heavy (OpenStack APIs are heavy in general). But for sure > machine-api should be able to handle deletion of a Machine object even if > corresponding VM is gone. Lucky for us, CAPO is constantly hitting the OpenStack APIs to check on the state of machines XD.
*** Bug 1994625 has been marked as a duplicate of this bug. ***
Removing the Triaged keyword because: * the QE automation assessment (flag qe_test_coverage) is missing
Verified in 4.10.26 on top of OSP 16.2.2. Verification steps: 1. Check the VMs in OSP and machines and nodes in OCP $ openstack server list +--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+-----------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+-----------+ | dc1f798c-2287-44fb-8c6d-56630ad2129c | ostest-pl27z-worker-0-grp7q | ACTIVE | ostest-pl27z-openshift=10.196.1.134 | ostest-pl27z-rhcos | m4.xlarge | | 71442f4d-b957-4ab6-bf37-96ba7b06e0d0 | ostest-pl27z-worker-0-vdtfd | ACTIVE | ostest-pl27z-openshift=10.196.0.31 | ostest-pl27z-rhcos | m4.xlarge | | c56a29c4-71de-4b9d-acce-21a3318bb4f1 | ostest-pl27z-worker-0-p7wxm | ACTIVE | ostest-pl27z-openshift=10.196.2.125 | ostest-pl27z-rhcos | m4.xlarge | | 48b08730-cea3-4920-b9fc-bd5a6b49dde6 | ostest-pl27z-master-2 | ACTIVE | ostest-pl27z-openshift=10.196.3.149 | ostest-pl27z-rhcos | m4.xlarge | | c361a70f-b0cb-4e37-afc5-22a4d7c2ab65 | ostest-pl27z-master-1 | ACTIVE | ostest-pl27z-openshift=10.196.1.21 | ostest-pl27z-rhcos | m4.xlarge | | 5fdae607-c1bd-417d-a0b0-3d516fd1cf23 | ostest-pl27z-master-0 | ACTIVE | ostest-pl27z-openshift=10.196.1.76 | ostest-pl27z-rhcos | m4.xlarge | +--------------------------------------+-----------------------------+--------+-------------------------------------+--------------------+-----------+ $ oc -n openshift-machine-api get machineset NAME DESIRED CURRENT READY AVAILABLE AGE ostest-pl27z-worker-0 3 3 3 3 83m $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-pl27z-master-0 Running 83m openshift-machine-api ostest-pl27z-master-1 Running 83m openshift-machine-api ostest-pl27z-master-2 Running 83m openshift-machine-api ostest-pl27z-worker-0-grp7q Running 73m openshift-machine-api ostest-pl27z-worker-0-p7wxm Running m4.xlarge regionOne nova 73m openshift-machine-api ostest-pl27z-worker-0-vdtfd Running m4.xlarge regionOne nova 73m 2. Scale-up and scale-down the machineset so there is no time for the VM to go to ACTIVE Status (the machine will be deleted while is still being created): $ oc scale machineset ostest-htm84-worker-0 -n openshift-machine-api --replicas=4; oc scale machineset ostest-htm84-worker-0 -n openshift-machine-api --replicas=3 $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-pl27z-master-0 Running 85m openshift-machine-api ostest-pl27z-master-1 Running 85m openshift-machine-api ostest-pl27z-master-2 Running 85m openshift-machine-api ostest-pl27z-worker-0-grp7q Running 76m openshift-machine-api ostest-pl27z-worker-0-p7wxm Running m4.xlarge regionOne nova 76m openshift-machine-api ostest-pl27z-worker-0-r62vt 0s openshift-machine-api ostest-pl27z-worker-0-vdtfd Running m4.xlarge regionOne nova 76m $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-pl27z-master-0 Running 85m openshift-machine-api ostest-pl27z-master-1 Running 85m openshift-machine-api ostest-pl27z-master-2 Running 85m openshift-machine-api ostest-pl27z-worker-0-grp7q Running 76m openshift-machine-api ostest-pl27z-worker-0-p7wxm Running m4.xlarge regionOne nova 76m openshift-machine-api ostest-pl27z-worker-0-r62vt Deleting 1s openshift-machine-api ostest-pl27z-worker-0-vdtfd Running m4.xlarge regionOne nova 76m 3. Check the new machine is deleted (and not in a continuous Deleting status) $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-pl27z-master-0 Running 85m openshift-machine-api ostest-pl27z-master-1 Running 85m openshift-machine-api ostest-pl27z-master-2 Running 85m openshift-machine-api ostest-pl27z-worker-0-grp7q Running 76m openshift-machine-api ostest-pl27z-worker-0-p7wxm Running m4.xlarge regionOne nova 76m openshift-machine-api ostest-pl27z-worker-0-vdtfd Running m4.xlarge regionOne nova 76m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.26 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5875