Bug 1871807

Summary:	Scale-down of machineset deprovisions BMHs but does not delete nodes
Product:	OpenShift Container Platform	Reporter:	Andrew Bays <abays>
Component:	Bare Metal Hardware Provisioning	Assignee:	Zane Bitter <zbitter>
Bare Metal Hardware Provisioning sub component:	baremetal-operator	QA Contact:	Amit Ugol <augol>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	mbooth, mschuppe, stbenjam, zbitter
Version:	4.5
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-24 17:50:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Andrew Bays 2020-08-24 10:44:26 UTC

Description of problem:

I have a 4.5.6 OCP cluster deployed with 3 masters and 5 workers.  If I scale down the default worker machineset:

# oc scale machineset/ostest-worker-0 --replicas=4 -n openshift-machine-api

...I see that an associated BMH is deprovisioned:

# oc get bmh -A
openshift-machine-api   worker-4   OK       ready                                            ipmi://10.0.1.2:6254   libvirt            false

...but the associated node remains stuck like so:

# oc get nodes
ostest-worker-4   NotReady,SchedulingDisabled   worker   2d17h   v1.18.3+002a51f


Version-Release number of selected component (if applicable): 4.5.6


How reproducible:

100%

Steps to Reproduce:
1. Deploy a 4.5.6 OCP cluster
2. Scale down the default worker machineset
3. See that the node(s) removed are not actually deleted

Actual results:

Deprovisioned node is not deleted

Expected results:

Deprovisioned node is deleted

Additional info:

Comment 1 Stephen Benjamin 2020-08-24 11:27:26 UTC

This looks just like https://bugzilla.redhat.com/show_bug.cgi?id=1869318, we probably need to get it backported to 4.5. @Zane, could you have a look at this? The 4.6 PR didn't cleanly apply, so we can probably use this BZ to backport it for you (if these are indeed the same problem).

Comment 2 Zane Bitter 2020-08-24 13:40:22 UTC

It *looks* like the same bug, but the patch that added a finalizer isn't present in the release-4.5 branch.

Can you do "oc edit node ostest-worker-4" and confirm whether there is a DeletionTimestamp and what finalizers, if any, are present? (Note that this information is hidden in "oc describe", so don't bother looking there.)

Comment 3 Zane Bitter 2020-08-24 13:52:18 UTC

Also check the status of the Machine - the most likely cause is bug 1863010, which was fixed last week. If that's the case you will likely see that the Machine still exists and is in the Deleting phase, and that's why the Node has not been deleted.

Comment 4 Andrew Bays 2020-08-24 17:05:25 UTC

I don't see any DeletionTimestamp not finalizers on the node itself (I reproduced using a different node):

# oc get node/ostest-worker-2 -o yaml | grep -i delet
# oc get node/ostest-worker-2 -o yaml | grep -i final

I used "oc edit" as well and scanned through it manually.

It does, however, appear that the associated machine is stuck in the deleting state:

# oc get machines -A
NAMESPACE               NAME                    PHASE      TYPE   REGION   ZONE   AGE
openshift-machine-api   ostest-worker-0-bhdbk   Deleting                          3d

Comment 5 Zane Bitter 2020-08-24 17:50:31 UTC

That's completely consistent with bug 1863010. I'll close as a duplicate, but feel free to reopen if you reproduce this in a build that has the fix.

*** This bug has been marked as a duplicate of bug 1863010 ***