Bug 1871807 - Scale-down of machineset deprovisions BMHs but does not delete nodes
Summary: Scale-down of machineset deprovisions BMHs but does not delete nodes
Keywords:
Status: CLOSED DUPLICATE of bug 1863010
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Zane Bitter
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-24 10:44 UTC by Andrew Bays
Modified: 2020-09-08 14:41 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-24 17:50:31 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Andrew Bays 2020-08-24 10:44:26 UTC
Description of problem:

I have a 4.5.6 OCP cluster deployed with 3 masters and 5 workers.  If I scale down the default worker machineset:

# oc scale machineset/ostest-worker-0 --replicas=4 -n openshift-machine-api

...I see that an associated BMH is deprovisioned:

# oc get bmh -A
openshift-machine-api   worker-4   OK       ready                                            ipmi://10.0.1.2:6254   libvirt            false

...but the associated node remains stuck like so:

# oc get nodes
ostest-worker-4   NotReady,SchedulingDisabled   worker   2d17h   v1.18.3+002a51f


Version-Release number of selected component (if applicable): 4.5.6


How reproducible:

100%

Steps to Reproduce:
1. Deploy a 4.5.6 OCP cluster
2. Scale down the default worker machineset
3. See that the node(s) removed are not actually deleted

Actual results:

Deprovisioned node is not deleted

Expected results:

Deprovisioned node is deleted

Additional info:

Comment 1 Stephen Benjamin 2020-08-24 11:27:26 UTC
This looks just like https://bugzilla.redhat.com/show_bug.cgi?id=1869318, we probably need to get it backported to 4.5. @Zane, could you have a look at this? The 4.6 PR didn't cleanly apply, so we can probably use this BZ to backport it for you (if these are indeed the same problem).

Comment 2 Zane Bitter 2020-08-24 13:40:22 UTC
It *looks* like the same bug, but the patch that added a finalizer isn't present in the release-4.5 branch.

Can you do "oc edit node ostest-worker-4" and confirm whether there is a DeletionTimestamp and what finalizers, if any, are present? (Note that this information is hidden in "oc describe", so don't bother looking there.)

Comment 3 Zane Bitter 2020-08-24 13:52:18 UTC
Also check the status of the Machine - the most likely cause is bug 1863010, which was fixed last week. If that's the case you will likely see that the Machine still exists and is in the Deleting phase, and that's why the Node has not been deleted.

Comment 4 Andrew Bays 2020-08-24 17:05:25 UTC
I don't see any DeletionTimestamp not finalizers on the node itself (I reproduced using a different node):

# oc get node/ostest-worker-2 -o yaml | grep -i delet
# oc get node/ostest-worker-2 -o yaml | grep -i final

I used "oc edit" as well and scanned through it manually.

It does, however, appear that the associated machine is stuck in the deleting state:

# oc get machines -A
NAMESPACE               NAME                    PHASE      TYPE   REGION   ZONE   AGE
openshift-machine-api   ostest-worker-0-bhdbk   Deleting                          3d

Comment 5 Zane Bitter 2020-08-24 17:50:31 UTC
That's completely consistent with bug 1863010. I'll close as a duplicate, but feel free to reopen if you reproduce this in a build that has the fix.

*** This bug has been marked as a duplicate of bug 1863010 ***


Note You need to log in before you can comment on or make changes to this bug.