Bug 1871807

Summary: Scale-down of machineset deprovisions BMHs but does not delete nodes
Product: OpenShift Container Platform Reporter: Andrew Bays <abays>
Component: Bare Metal Hardware ProvisioningAssignee: Zane Bitter <zbitter>
Bare Metal Hardware Provisioning sub component: baremetal-operator QA Contact: Amit Ugol <augol>
Status: CLOSED DUPLICATE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: mbooth, mschuppe, stbenjam, zbitter
Version: 4.5   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-24 17:50:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andrew Bays 2020-08-24 10:44:26 UTC
Description of problem:

I have a 4.5.6 OCP cluster deployed with 3 masters and 5 workers.  If I scale down the default worker machineset:

# oc scale machineset/ostest-worker-0 --replicas=4 -n openshift-machine-api

...I see that an associated BMH is deprovisioned:

# oc get bmh -A
openshift-machine-api   worker-4   OK       ready                                            ipmi://10.0.1.2:6254   libvirt            false

...but the associated node remains stuck like so:

# oc get nodes
ostest-worker-4   NotReady,SchedulingDisabled   worker   2d17h   v1.18.3+002a51f


Version-Release number of selected component (if applicable): 4.5.6


How reproducible:

100%

Steps to Reproduce:
1. Deploy a 4.5.6 OCP cluster
2. Scale down the default worker machineset
3. See that the node(s) removed are not actually deleted

Actual results:

Deprovisioned node is not deleted

Expected results:

Deprovisioned node is deleted

Additional info:

Comment 1 Stephen Benjamin 2020-08-24 11:27:26 UTC
This looks just like https://bugzilla.redhat.com/show_bug.cgi?id=1869318, we probably need to get it backported to 4.5. @Zane, could you have a look at this? The 4.6 PR didn't cleanly apply, so we can probably use this BZ to backport it for you (if these are indeed the same problem).

Comment 2 Zane Bitter 2020-08-24 13:40:22 UTC
It *looks* like the same bug, but the patch that added a finalizer isn't present in the release-4.5 branch.

Can you do "oc edit node ostest-worker-4" and confirm whether there is a DeletionTimestamp and what finalizers, if any, are present? (Note that this information is hidden in "oc describe", so don't bother looking there.)

Comment 3 Zane Bitter 2020-08-24 13:52:18 UTC
Also check the status of the Machine - the most likely cause is bug 1863010, which was fixed last week. If that's the case you will likely see that the Machine still exists and is in the Deleting phase, and that's why the Node has not been deleted.

Comment 4 Andrew Bays 2020-08-24 17:05:25 UTC
I don't see any DeletionTimestamp not finalizers on the node itself (I reproduced using a different node):

# oc get node/ostest-worker-2 -o yaml | grep -i delet
# oc get node/ostest-worker-2 -o yaml | grep -i final

I used "oc edit" as well and scanned through it manually.

It does, however, appear that the associated machine is stuck in the deleting state:

# oc get machines -A
NAMESPACE               NAME                    PHASE      TYPE   REGION   ZONE   AGE
openshift-machine-api   ostest-worker-0-bhdbk   Deleting                          3d

Comment 5 Zane Bitter 2020-08-24 17:50:31 UTC
That's completely consistent with bug 1863010. I'll close as a duplicate, but feel free to reopen if you reproduce this in a build that has the fix.

*** This bug has been marked as a duplicate of bug 1863010 ***