Bug 1994625 - cluster-api-provider-openstack machine-controller dangling openstack resources of machine deleted while in Provisioning state
Summary: cluster-api-provider-openstack machine-controller dangling openstack resource...
Keywords:
Status: CLOSED DUPLICATE of bug 1921656
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Adolfo Duarte
QA Contact: Jon Uriarte
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-17 14:58 UTC by Andrew Collins
Modified: 2021-08-20 15:26 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-20 04:03:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Andrew Collins 2021-08-17 14:58:23 UTC
Description of problem:
If a Machine is deleted while in Provisioning state, the openstack resources are not cleaned up and the Machine remains in Deleting state indefinitely.


Version-Release number of selected component (if applicable):
4.8.2


How reproducible:
100% Reproducible


Steps to Reproduce:
1. Delete a Machine owned by a MachineSet such that a new Machine is created
2. When new Machine reachies Provisioning state, `oc delete machine` on this machine.

Actual results:
* machine-api-controller pod, machine-controller container, begins to reconcile the delete, and errors out trying to update Machine with the resourceID (openstack server ID).
* The openstack Server and Port are still created for this machine, but machine-controller complains that it cannot be found since the openstack-resourceId annotation is missing on the Machine.
* The Machine is stuck in Deleting indefinitely, and the openstack resources remain for manual cleanup.


Expected results:
Machine-controller retries until it is able to update the Machine with the openstack-resourceId, allowing the Delete reconcile succeed.

Additional info:
In the customer environment, the Machine was deleted because of a MachineHealthCheck. The reproducer involves deleting the Machine to mimic this behavior.

Below are the controller logs that show this sequence.

```
I0814 10:12:40.032222       1 machine_webhook.go:490] Validate webhook called for Machine: np-rtp-01-bl4dg-ext-worker-1c-l94ks
I0814 10:12:40.042837       1 machinehealthcheck_controller.go:470] Reconciling openshift-machine-api/workers-notready-unknown/np-rtp-01-bl4dg-ext-worker-1c-l94ks/: is likely to go unhealthy in 10m0s
I0814 10:14:42.025450       1 controller.go:174] np-rtp-01-bl4dg-ext-worker-1c-l94ks: reconciling Machine
I0814 10:14:42.235133       1 controller.go:357] np-rtp-01-bl4dg-ext-worker-1c-l94ks: setting phase to Provisioning and requeuing
I0814 10:14:42.245750       1 machinehealthcheck_controller.go:470] Reconciling openshift-machine-api/workers-notready-unknown/np-rtp-01-bl4dg-ext-worker-1c-l94ks/: is likely to go unhealthy in 10m0.754252052s
I0814 10:23:55.706132       1 controller.go:174] np-rtp-01-bl4dg-ext-worker-1c-l94ks: reconciling Machine
I0814 10:23:55.946367       1 controller.go:364] np-rtp-01-bl4dg-ext-worker-1c-l94ks: reconciling machine triggers idempotent create
I0814 10:24:42.043465       1 machinehealthcheck_controller.go:652] openshift-machine-api/workers-notready-unknown/np-rtp-01-bl4dg-ext-worker-1c-l94ks/: deleting
I0814 10:24:42.050795       1 nodelink_controller.go:306] No-op: Machine "np-rtp-01-bl4dg-ext-worker-1c-l94ks" has a deletion timestamp
I0814 10:25:44.369860       1 actuator.go:595] Found the primary address for the machine np-rtp-01-bl4dg-ext-worker-1c-l94ks: 64.101.120.198
W0814 10:25:44.378811       1 controller.go:366] np-rtp-01-bl4dg-ext-worker-1c-l94ks: failed to create machine: Operation cannot be fulfilled on machines.machine.openshift.io "np-rtp-01-bl4dg-ext-worker-1c-l94ks": the object has been modified; please apply your changes to the latest version and try again
E0814 10:25:44.379006       1 controller.go:302] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="Operation cannot be fulfilled on machines.machine.openshift.io \"np-rtp-01-bl4dg-ext-worker-1c-l94ks\": the object has been modified; please apply your changes to the latest version and try again" "name"="np-rtp-01-bl4dg-ext-worker-1c-l94ks" "namespace"="openshift-machine-api"
I0814 10:26:35.892133       1 controller.go:174] np-rtp-01-bl4dg-ext-worker-1c-l94ks: reconciling Machine
I0814 10:26:35.892144       1 controller.go:482] np-rtp-01-bl4dg-ext-worker-1c-l94ks: going into phase "Deleting"
I0814 10:26:35.900778       1 controller.go:218] np-rtp-01-bl4dg-ext-worker-1c-l94ks: reconciling machine triggers delete
W0814 10:26:36.608880       1 machineservice.go:953] Couldn't delete all instance  ports: Resource not found
E0814 10:26:36.628757       1 actuator.go:574] Machine error np-rtp-01-bl4dg-ext-worker-1c-l94ks: error deleting Openstack instance: Resource not found
E0814 10:26:36.628797       1 controller.go:239] np-rtp-01-bl4dg-ext-worker-1c-l94ks: failed to delete machine: error deleting Openstack instance: Resource not found
E0814 10:26:36.628867       1 controller.go:302] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="error deleting Openstack instance: Resource not found" "name"="np-rtp-01-bl4dg-ext-worker-1c-l94ks" "namespace"="openshift-machine-api" 
```

Comment 1 Adolfo Duarte 2021-08-17 18:08:19 UTC
@andrew when you say.


"Steps to Reproduce:
1. Delete a Machine owned by a MachineSet such that a new Machine is created
2. When new Machine reachies Provisioning state, `oc delete machine` on this machine."



When you say: 
"Delete machine own by machinset..." 
do you mean with OC command (oc delete machine) ... or do you mean with openstack command ("openstack server delete....")

Thanks.

Comment 2 Andrew Collins 2021-08-17 18:41:11 UTC
I mean: "Delete the machine API resource with oc command i.e. `oc delete machine`"

Comment 3 Andrew Collins 2021-08-17 18:58:14 UTC
For what it's worth, the first step os only means of creating a new machine to get it into Provisioning state.


Note You need to log in before you can comment on or make changes to this bug.