Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2097728

Summary:	A machine is stuck in Deleting phase
Product:	OpenShift Container Platform	Reporter:	Itzik Brown <itbrown>
Component:	Cloud Compute	Assignee:	Stephen Finucane <stephenfin>
Cloud Compute sub component:	OpenStack Provider	QA Contact:	Jon Uriarte <juriarte>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	medium
Priority:	medium	CC:	akesarka, frigo, m.andre, mfedosin, mkrejci, pprinett
Version:	4.10	Keywords:	Triaged
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-09-27 11:37:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Itzik Brown 2022-06-16 12:22:00 UTC

Description of problem:
Having 3 workers , one VM coudn't launch because lack of resources and moved to ERROR. Scaling down the workers machineset.
The VM is not deleted and the machine is in Deleting phase.


Version:
OCP 4.10.0-0.nightly-2022-06-08-150219
OSP RHOS-16.2-RHEL-8-20220311.n.1

Platform (AWS, VSphere, Metal, etc.):
Openshift on Openstack

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure):
Not sure

How reproducible:

Steps to Reproduce:
As described

Actual results:

Expected results:

Additional info:

shiftstack) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------------+----------------------------------------+-----------+
| ID                                   | Name                        | Status | Networks                                                                        | Image                                  | Flavor    |
+--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------------+----------------------------------------+-----------+
| 4be271f6-d47c-4ea9-befc-ce49dc1995f8 | ostest-lvlxl-worker-0-z5jpm | ERROR  |                                                                                 | ostest-lvlxl-rhcos                     | m4.xlarge |
| fcfa8630-781e-4aef-b8d0-aaf80e762f7a | ostest-lvlxl-worker-0-87xpt | ACTIVE | restricted_network=172.16.0.239                                                 | ostest-lvlxl-rhcos                     | m4.xlarge |
| ff79e1c2-839a-4f18-b416-57a93d3b4cd5 | ostest-lvlxl-worker-0-rlmv8 | ACTIVE | restricted_network=172.16.0.57                                                  | ostest-lvlxl-rhcos                     | m4.xlarge |
| 5e6cb585-7e21-4045-9b23-a7c58cd71696 | ostest-lvlxl-master-2       | ACTIVE | restricted_network=172.16.0.21                                                  | ostest-lvlxl-rhcos                     | m4.xlarge |
| f4377b56-90b5-4d11-8344-4a7642ec6622 | ostest-lvlxl-master-1       | ACTIVE | restricted_network=172.16.0.195                                                 | ostest-lvlxl-rhcos                     | m4.xlarge |
| 99a1fe73-449d-4738-aa35-71d8b32c3c8c | ostest-lvlxl-master-0       | ACTIVE | restricted_network=172.16.0.50                                                  | ostest-lvlxl-rhcos                     | m4.xlarge |
| a455fedc-6fb9-47c2-baae-8ad6e830f036 | installer_host              | ACTIVE | installer_host-network=172.16.40.119, 10.0.0.202; restricted_network=172.16.0.3 | rhel-guest-image-8.5-1174.x86_64.qcow2 | m1.medium |
+--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------------+----------------------------------------+-----------+
 
 
 
(shiftstack) [cloud-user@installer-host ~]$ oc get machines -A
NAMESPACE               NAME                          PHASE      TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-lvlxl-master-0         Running                                   174m
openshift-machine-api   ostest-lvlxl-master-1         Running                                   174m
openshift-machine-api   ostest-lvlxl-master-2         Running                                   174m
openshift-machine-api   ostest-lvlxl-worker-0-87xpt   Running    m4.xlarge   regionOne   nova   141m
openshift-machine-api   ostest-lvlxl-worker-0-rlmv8   Running    m4.xlarge   regionOne   nova   153m
openshift-machine-api   ostest-lvlxl-worker-0-z5jpm   Deleting                                  7m54s
 
 
 
E0616 11:07:43.080215       1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error deleting Openstack instance: Resource not found" "name"="ostest-lvlxl-worker-0-z5jpm" "namespace"="openshift-machine-api"
(shiftstack) [cloud-user@installer-host ~]$ oc logs machine-api-controllers-58d97cddf5-d25rb -n openshift-machine-api  -c machine-controller  |grep ostest-lvlxl-worker-0-z5jpm |tail -10
I0616 11:06:18.707389       1 logr.go:252] events "msg"="Warning"  "message"="DeleteError" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"ostest-lvlxl-worker-0-z5jpm","uid":"9a7ea915-5868-4e8b-a943-3fd946017f54","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"79094"} "reason"="FailedDelete"
E0616 11:06:18.732869       1 actuator.go:415] Machine error ostest-lvlxl-worker-0-z5jpm: error deleting Openstack instance: Resource not found
E0616 11:06:18.732926       1 controller.go:261] ostest-lvlxl-worker-0-z5jpm: failed to delete machine: error deleting Openstack instance: Resource not found
E0616 11:06:18.733186       1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error deleting Openstack instance: Resource not found" "name"="ostest-lvlxl-worker-0-z5jpm" "namespace"="openshift-machine-api"
I0616 11:07:40.653944       1 controller.go:175] ostest-lvlxl-worker-0-z5jpm: reconciling Machine
I0616 11:07:40.684395       1 controller.go:219] ostest-lvlxl-worker-0-z5jpm: reconciling machine triggers delete
I0616 11:07:43.053597       1 logr.go:252] events "msg"="Warning"  "message"="DeleteError" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"ostest-lvlxl-worker-0-z5jpm","uid":"9a7ea915-5868-4e8b-a943-3fd946017f54","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"79094"} "reason"="FailedDelete"
E0616 11:07:43.080010       1 actuator.go:415] Machine error ostest-lvlxl-worker-0-z5jpm: error deleting Openstack instance: Resource not found
E0616 11:07:43.080142       1 controller.go:261] ostest-lvlxl-worker-0-z5jpm: failed to delete machine: error deleting Openstack instance: Resource not found
E0616 11:07:43.080215       1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error deleting Openstack instance: Resource not found" "name"="ostest-lvlxl-worker-0-z5jpm" "namespace"="openshift-machine-api"

Comment 2 Stephen Finucane 2022-06-21 11:10:57 UTC

Looks like a valid bug. A HTTP 404 response for a delete request should be treated as a successful delete. This is what upstream CAPO does [1]. We need to do the same.

[1] https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/c5ac70e5f13d303e39a6919c0214345f51c4c6bc/pkg/cloud/services/compute/client.go#L100-L104

Comment 4 ShiftStack Bugwatcher 2022-06-22 07:20:05 UTC

Removing the Triaged keyword because:
* the QE automation assessment (flag qe_test_coverage) is missing

Comment 6 François Rigault 2022-07-12 09:24:53 UTC

(In reply to Stephen Finucane from comment #2)
> Looks like a valid bug. A HTTP 404 response for a delete request should be
> treated as a successful delete

We experienced a similar case and the actual query to OpenStack reads

    DELETE /v2.1/servers/" status: 404

which is missing the server ID (possibly explained by https://bugzilla.redhat.com/show_bug.cgi?id=1994625#c0 as openstack resource ID is missing from the machine resource).

Comment 8 Stephen Finucane 2022-09-27 11:37:42 UTC

As noted in the upstream patch [1], we no longer use CAPO in 4.11 and 4.12 so this issue is not present there. I'm going to clone this as CURRENTRELEASE since the issue should no longer be present. If this is a priority for the customer, we can open a new bug targeting 4.10 specifically but it doesn't seem worth it right now.

[1] https://github.com/openshift/cluster-api-provider-openstack/pull/238#issuecomment-1165517554