Bug 1886028 - [BM][IPI] Failed to delete node after scale down
Summary: [BM][IPI] Failed to delete node after scale down
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.7.0
Assignee: Zane Bitter
QA Contact: Daniel
URL:
Whiteboard:
: 1885921 (view as bug list)
Depends On:
Blocks: 1886582
TreeView+ depends on / blocked
 
Reported: 2020-10-07 13:56 UTC by Yurii Prokulevych
Modified: 2021-02-24 15:24 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:23:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-baremetal pull 122 0 None closed Bug 1886028: Remove Node finalizer first on delete 2021-02-09 12:01:46 UTC
Red Hat Product Errata RHSA-2020:5633 0 None Waiting on Customer Issue in adding the server into exclude list time of creating the JOB 2022-04-26 16:01:51 UTC

Description Yurii Prokulevych 2020-10-07 13:56:58 UTC
Description of problem:
-----------------------
After scaling down cluster by one worker node, the delete operation for node stucks


oc delete nodes/openshift-worker-4
node "openshift-worker-4" deleted
...


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
4.6.0-0.nightly-2020-10-02-065738


Steps to Reproduce:
-------------------
1. Annotate machine, corresponding to a node U r about to delete e.g.:
    oc annotate machine  worker-0-n7q5s machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api

2. Delete BMH consumed but annotated machine, for e.g:
     oc delete bmh worker-X -n openshift-machine-api

3. Scale down machine set by one for e.g.:
    oc scale machineset zwlcp-worker-0 --replicas=X-1 -n openshift-machine-api

4. Try to delete node corresponding node:
    oc delete nodes/openshift-worker-4

Actual results:
---------------
Node deletion stuck

Expected results:
-----------------
Node is deleted


Additional info:
----------------
BM IPI setup: 3 masters + 8workers

Comment 2 Stefan Schimanski 2020-10-07 15:21:24 UTC
Moving to cluster-api team. Please run deletion with more verbosity. Chance is high that cluster-api is using finalizers or something like that and oc blocks, by design.

Comment 3 Joel Speed 2020-10-07 15:37:10 UTC
Machine API doesn't normally set any finalizers on node objects, though since this is a BMH, there may be some differences there and they baremetal folks may be setting something, transferring to them to check

Comment 6 Doug Hellmann 2020-10-08 17:48:57 UTC
(In reply to Yurii Prokulevych from comment #0)
> Description of problem:
> -----------------------
> After scaling down cluster by one worker node, the delete operation for node
> stucks
> 
> 
> oc delete nodes/openshift-worker-4
> node "openshift-worker-4" deleted
> ...
> 
> 
> Version-Release number of selected component (if applicable):
> -------------------------------------------------------------
> 4.6.0-0.nightly-2020-10-02-065738
> 
> 
> Steps to Reproduce:
> -------------------
> 1. Annotate machine, corresponding to a node U r about to delete e.g.:
>     oc annotate machine  worker-0-n7q5s
> machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api
> 
> 2. Delete BMH consumed but annotated machine, for e.g:
>      oc delete bmh worker-X -n openshift-machine-api
> 
> 3. Scale down machine set by one for e.g.:
>     oc scale machineset zwlcp-worker-0 --replicas=X-1 -n
> openshift-machine-api
> 
> 4. Try to delete node corresponding node:
>     oc delete nodes/openshift-worker-4

I don't think those are the right steps for scaling a MachineSet down. Step 2 should come last and step 4 should not be needed at all.

Could you attach the output of `oc get node` for the node that you're trying to delete, so we can see if there is a finalizer on it.

Comment 7 Doug Hellmann 2020-10-08 17:55:07 UTC
The order for 1, 2, and 3 does seem to be right according to https://github.com/metal3-io/metal3-docs/blob/master/design/baremetal-operator/remove-host.md

The node should be removed automatically, when the machine is removed. I would still like to see the details of the node resource to see if there is a finalizer on it.

Comment 8 Zane Bitter 2020-10-08 18:01:23 UTC
This is a variation on bug 1869318. In the fix for that, we ensured that the finalizer is removed from the Node, but it doesn't work if the Host is already deleted.

IMHO the priority of this is overstated, since there's no reason to delete the Host before scaling down the MachineSet, but it ought to be fixed so that it works in any order.

Comment 9 Zane Bitter 2020-10-08 18:15:03 UTC
*** Bug 1885921 has been marked as a duplicate of this bug. ***

Comment 11 Zane Bitter 2020-10-09 02:23:54 UTC
Examining the must-gather confirms that the Machine and Host are both gone, and that the issue is the CAPBM's finalizer left behind on the Node.

The intended mechanism whereby we expected to prevent this bug was by the Machine maintaining a finalizer on the BareMetalHost, so the BareMetalHost could not actually disappear. It's clear that this mechanism is not working:

2020-10-07T13:04:46.166 cleanup is complete, removed finalizer {Request.Namespace: 'openshift-machine-api', Request.Name: 'openshift-worker-4', provisioningState: 'deleting', remaining: []}
2

("remaining" here is the list of finalizers remaining after removing the baremetal-operator's own finalizer from the Host - the list is empty so the Host disappears.)

The issue is that we don't wait for Delete() to be called to remove the finalizer from the Host. We also remove it in Update() if we notice that the ProvisioningState of the Host is Deleting, just before we mark the Machine itself for deletion (which will trigger Delete() to be called later):

2020-10-07T07:26:27.851012763Z I1007 07:26:27.850977       1 controller.go:169] ocp-edge1-zwlcp-worker-0-2crvl: reconciling Machine
2020-10-07T07:26:27.851012763Z 2020/10/07 07:26:27 Checking if machine ocp-edge1-zwlcp-worker-0-2crvl exists.
2020-10-07T07:26:27.851184122Z 2020/10/07 07:26:27 Machine ocp-edge1-zwlcp-worker-0-2crvl exists.
2020-10-07T07:26:27.851184122Z I1007 07:26:27.851036       1 controller.go:277] ocp-edge1-zwlcp-worker-0-2crvl: reconciling machine triggers idempotent update
2020-10-07T07:26:27.851184122Z 2020/10/07 07:26:27 Updating machine ocp-edge1-zwlcp-worker-0-2crvl .
2020-10-07T07:26:27.851184122Z 2020/10/07 07:26:27 Removing finalizer for host: openshift-worker-4
2020-10-07T07:26:27.870741201Z 2020/10/07 07:26:27 Removed finalizer for host: openshift-worker-4
2020-10-07T07:26:27.870741201Z 2020/10/07 07:26:27 Deleting machine whose associated host is gone: ocp-edge1-zwlcp-worker-0-2crvl
2020-10-07T07:26:27.883849976Z 2020/10/07 07:26:27 Deleted machine whose associated host is gone: ocp-edge1-zwlcp-worker-0-2crvl

This is wrong in principle, even ignoring that we probably shouldn't be deleting the Machine just because the Host has gone away (bug 1868104), because if we fail to mark the Machine for deletion then the Host may go away before we get another chance, in which case we won't see the ProvisioningState and thus won't try again to mark the Machine for deletion. In practice, though, the timing makes that unlikely.

The fix for this bug, where we remove the Node finalizer as soon as we start Delete() without checking whether the Host still exists, is valid. We could still remove the code from Update() that deletes the Host finalizer, or just wait for the fix for bug 1868104 (which eliminates the need for a finalizer on the Host at all).

Comment 16 errata-xmlrpc 2021-02-24 15:23:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.