Description of problem: ----------------------- After scaling down cluster by one worker node, the delete operation for node stucks oc delete nodes/openshift-worker-4 node "openshift-worker-4" deleted ... Version-Release number of selected component (if applicable): ------------------------------------------------------------- 4.6.0-0.nightly-2020-10-02-065738 Steps to Reproduce: ------------------- 1. Annotate machine, corresponding to a node U r about to delete e.g.: oc annotate machine worker-0-n7q5s machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api 2. Delete BMH consumed but annotated machine, for e.g: oc delete bmh worker-X -n openshift-machine-api 3. Scale down machine set by one for e.g.: oc scale machineset zwlcp-worker-0 --replicas=X-1 -n openshift-machine-api 4. Try to delete node corresponding node: oc delete nodes/openshift-worker-4 Actual results: --------------- Node deletion stuck Expected results: ----------------- Node is deleted Additional info: ---------------- BM IPI setup: 3 masters + 8workers
Moving to cluster-api team. Please run deletion with more verbosity. Chance is high that cluster-api is using finalizers or something like that and oc blocks, by design.
Machine API doesn't normally set any finalizers on node objects, though since this is a BMH, there may be some differences there and they baremetal folks may be setting something, transferring to them to check
(In reply to Yurii Prokulevych from comment #0) > Description of problem: > ----------------------- > After scaling down cluster by one worker node, the delete operation for node > stucks > > > oc delete nodes/openshift-worker-4 > node "openshift-worker-4" deleted > ... > > > Version-Release number of selected component (if applicable): > ------------------------------------------------------------- > 4.6.0-0.nightly-2020-10-02-065738 > > > Steps to Reproduce: > ------------------- > 1. Annotate machine, corresponding to a node U r about to delete e.g.: > oc annotate machine worker-0-n7q5s > machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api > > 2. Delete BMH consumed but annotated machine, for e.g: > oc delete bmh worker-X -n openshift-machine-api > > 3. Scale down machine set by one for e.g.: > oc scale machineset zwlcp-worker-0 --replicas=X-1 -n > openshift-machine-api > > 4. Try to delete node corresponding node: > oc delete nodes/openshift-worker-4 I don't think those are the right steps for scaling a MachineSet down. Step 2 should come last and step 4 should not be needed at all. Could you attach the output of `oc get node` for the node that you're trying to delete, so we can see if there is a finalizer on it.
The order for 1, 2, and 3 does seem to be right according to https://github.com/metal3-io/metal3-docs/blob/master/design/baremetal-operator/remove-host.md The node should be removed automatically, when the machine is removed. I would still like to see the details of the node resource to see if there is a finalizer on it.
This is a variation on bug 1869318. In the fix for that, we ensured that the finalizer is removed from the Node, but it doesn't work if the Host is already deleted. IMHO the priority of this is overstated, since there's no reason to delete the Host before scaling down the MachineSet, but it ought to be fixed so that it works in any order.
*** Bug 1885921 has been marked as a duplicate of this bug. ***
Examining the must-gather confirms that the Machine and Host are both gone, and that the issue is the CAPBM's finalizer left behind on the Node. The intended mechanism whereby we expected to prevent this bug was by the Machine maintaining a finalizer on the BareMetalHost, so the BareMetalHost could not actually disappear. It's clear that this mechanism is not working: 2020-10-07T13:04:46.166 cleanup is complete, removed finalizer {Request.Namespace: 'openshift-machine-api', Request.Name: 'openshift-worker-4', provisioningState: 'deleting', remaining: []} 2 ("remaining" here is the list of finalizers remaining after removing the baremetal-operator's own finalizer from the Host - the list is empty so the Host disappears.) The issue is that we don't wait for Delete() to be called to remove the finalizer from the Host. We also remove it in Update() if we notice that the ProvisioningState of the Host is Deleting, just before we mark the Machine itself for deletion (which will trigger Delete() to be called later): 2020-10-07T07:26:27.851012763Z I1007 07:26:27.850977 1 controller.go:169] ocp-edge1-zwlcp-worker-0-2crvl: reconciling Machine 2020-10-07T07:26:27.851012763Z 2020/10/07 07:26:27 Checking if machine ocp-edge1-zwlcp-worker-0-2crvl exists. 2020-10-07T07:26:27.851184122Z 2020/10/07 07:26:27 Machine ocp-edge1-zwlcp-worker-0-2crvl exists. 2020-10-07T07:26:27.851184122Z I1007 07:26:27.851036 1 controller.go:277] ocp-edge1-zwlcp-worker-0-2crvl: reconciling machine triggers idempotent update 2020-10-07T07:26:27.851184122Z 2020/10/07 07:26:27 Updating machine ocp-edge1-zwlcp-worker-0-2crvl . 2020-10-07T07:26:27.851184122Z 2020/10/07 07:26:27 Removing finalizer for host: openshift-worker-4 2020-10-07T07:26:27.870741201Z 2020/10/07 07:26:27 Removed finalizer for host: openshift-worker-4 2020-10-07T07:26:27.870741201Z 2020/10/07 07:26:27 Deleting machine whose associated host is gone: ocp-edge1-zwlcp-worker-0-2crvl 2020-10-07T07:26:27.883849976Z 2020/10/07 07:26:27 Deleted machine whose associated host is gone: ocp-edge1-zwlcp-worker-0-2crvl This is wrong in principle, even ignoring that we probably shouldn't be deleting the Machine just because the Host has gone away (bug 1868104), because if we fail to mark the Machine for deletion then the Host may go away before we get another chance, in which case we won't see the ProvisioningState and thus won't try again to mark the Machine for deletion. In practice, though, the timing makes that unlikely. The fix for this bug, where we remove the Node finalizer as soon as we start Delete() without checking whether the Host still exists, is valid. We could still remove the code from Update() that deletes the Host finalizer, or just wait for the fix for bug 1868104 (which eliminates the need for a finalizer on the Host at all).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633