Bug 2097728
| Summary: | A machine is stuck in Deleting phase | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Itzik Brown <itbrown> |
| Component: | Cloud Compute | Assignee: | Stephen Finucane <stephenfin> |
| Cloud Compute sub component: | OpenStack Provider | QA Contact: | Jon Uriarte <juriarte> |
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | akesarka, frigo, m.andre, mfedosin, mkrejci, pprinett |
| Version: | 4.10 | Keywords: | Triaged |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-09-27 11:37:42 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Looks like a valid bug. A HTTP 404 response for a delete request should be treated as a successful delete. This is what upstream CAPO does [1]. We need to do the same. [1] https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/c5ac70e5f13d303e39a6919c0214345f51c4c6bc/pkg/cloud/services/compute/client.go#L100-L104 Removing the Triaged keyword because: * the QE automation assessment (flag qe_test_coverage) is missing (In reply to Stephen Finucane from comment #2) > Looks like a valid bug. A HTTP 404 response for a delete request should be > treated as a successful delete We experienced a similar case and the actual query to OpenStack reads DELETE /v2.1/servers/" status: 404 which is missing the server ID (possibly explained by https://bugzilla.redhat.com/show_bug.cgi?id=1994625#c0 as openstack resource ID is missing from the machine resource). As noted in the upstream patch [1], we no longer use CAPO in 4.11 and 4.12 so this issue is not present there. I'm going to clone this as CURRENTRELEASE since the issue should no longer be present. If this is a priority for the customer, we can open a new bug targeting 4.10 specifically but it doesn't seem worth it right now. [1] https://github.com/openshift/cluster-api-provider-openstack/pull/238#issuecomment-1165517554 |
Description of problem: Having 3 workers , one VM coudn't launch because lack of resources and moved to ERROR. Scaling down the workers machineset. The VM is not deleted and the machine is in Deleting phase. Version: OCP 4.10.0-0.nightly-2022-06-08-150219 OSP RHOS-16.2-RHEL-8-20220311.n.1 Platform (AWS, VSphere, Metal, etc.): Openshift on Openstack Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)? (Y/N/Not sure): Not sure How reproducible: Steps to Reproduce: As described Actual results: Expected results: Additional info: shiftstack) [stack@undercloud-0 ~]$ openstack server list +--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------------+----------------------------------------+-----------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------------+----------------------------------------+-----------+ | 4be271f6-d47c-4ea9-befc-ce49dc1995f8 | ostest-lvlxl-worker-0-z5jpm | ERROR | | ostest-lvlxl-rhcos | m4.xlarge | | fcfa8630-781e-4aef-b8d0-aaf80e762f7a | ostest-lvlxl-worker-0-87xpt | ACTIVE | restricted_network=172.16.0.239 | ostest-lvlxl-rhcos | m4.xlarge | | ff79e1c2-839a-4f18-b416-57a93d3b4cd5 | ostest-lvlxl-worker-0-rlmv8 | ACTIVE | restricted_network=172.16.0.57 | ostest-lvlxl-rhcos | m4.xlarge | | 5e6cb585-7e21-4045-9b23-a7c58cd71696 | ostest-lvlxl-master-2 | ACTIVE | restricted_network=172.16.0.21 | ostest-lvlxl-rhcos | m4.xlarge | | f4377b56-90b5-4d11-8344-4a7642ec6622 | ostest-lvlxl-master-1 | ACTIVE | restricted_network=172.16.0.195 | ostest-lvlxl-rhcos | m4.xlarge | | 99a1fe73-449d-4738-aa35-71d8b32c3c8c | ostest-lvlxl-master-0 | ACTIVE | restricted_network=172.16.0.50 | ostest-lvlxl-rhcos | m4.xlarge | | a455fedc-6fb9-47c2-baae-8ad6e830f036 | installer_host | ACTIVE | installer_host-network=172.16.40.119, 10.0.0.202; restricted_network=172.16.0.3 | rhel-guest-image-8.5-1174.x86_64.qcow2 | m1.medium | +--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------------+----------------------------------------+-----------+ (shiftstack) [cloud-user@installer-host ~]$ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-lvlxl-master-0 Running 174m openshift-machine-api ostest-lvlxl-master-1 Running 174m openshift-machine-api ostest-lvlxl-master-2 Running 174m openshift-machine-api ostest-lvlxl-worker-0-87xpt Running m4.xlarge regionOne nova 141m openshift-machine-api ostest-lvlxl-worker-0-rlmv8 Running m4.xlarge regionOne nova 153m openshift-machine-api ostest-lvlxl-worker-0-z5jpm Deleting 7m54s E0616 11:07:43.080215 1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error deleting Openstack instance: Resource not found" "name"="ostest-lvlxl-worker-0-z5jpm" "namespace"="openshift-machine-api" (shiftstack) [cloud-user@installer-host ~]$ oc logs machine-api-controllers-58d97cddf5-d25rb -n openshift-machine-api -c machine-controller |grep ostest-lvlxl-worker-0-z5jpm |tail -10 I0616 11:06:18.707389 1 logr.go:252] events "msg"="Warning" "message"="DeleteError" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"ostest-lvlxl-worker-0-z5jpm","uid":"9a7ea915-5868-4e8b-a943-3fd946017f54","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"79094"} "reason"="FailedDelete" E0616 11:06:18.732869 1 actuator.go:415] Machine error ostest-lvlxl-worker-0-z5jpm: error deleting Openstack instance: Resource not found E0616 11:06:18.732926 1 controller.go:261] ostest-lvlxl-worker-0-z5jpm: failed to delete machine: error deleting Openstack instance: Resource not found E0616 11:06:18.733186 1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error deleting Openstack instance: Resource not found" "name"="ostest-lvlxl-worker-0-z5jpm" "namespace"="openshift-machine-api" I0616 11:07:40.653944 1 controller.go:175] ostest-lvlxl-worker-0-z5jpm: reconciling Machine I0616 11:07:40.684395 1 controller.go:219] ostest-lvlxl-worker-0-z5jpm: reconciling machine triggers delete I0616 11:07:43.053597 1 logr.go:252] events "msg"="Warning" "message"="DeleteError" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"ostest-lvlxl-worker-0-z5jpm","uid":"9a7ea915-5868-4e8b-a943-3fd946017f54","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"79094"} "reason"="FailedDelete" E0616 11:07:43.080010 1 actuator.go:415] Machine error ostest-lvlxl-worker-0-z5jpm: error deleting Openstack instance: Resource not found E0616 11:07:43.080142 1 controller.go:261] ostest-lvlxl-worker-0-z5jpm: failed to delete machine: error deleting Openstack instance: Resource not found E0616 11:07:43.080215 1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error deleting Openstack instance: Resource not found" "name"="ostest-lvlxl-worker-0-z5jpm" "namespace"="openshift-machine-api"