Description of problem: When performing a scale down of the worker machineset on a bare metal IPI deployment, machines get stuck in Deleting state and machineconfigpool in updating state. Following the docs @ https://github.com/metal3-io/metal3-docs/blob/master/design/baremetal-operator/remove-host.md#scale-down-the-machineset Version-Release number of selected component (if applicable): 4.5.0-rc.7 How reproducible: 100% Steps to Reproduce: 1. Deploy bare metal IPI with 3 masters and 2 workers 2. Annotate the machine of the worker that you want to remove oc annotate machine ocp-edge-cluster-0-worker-0-xcd9n machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api 3. Scale down the worker machineset oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 --replicas=1 4. Wait for the node to get deprovisioned Actual results: BMH node gets into ready state: openshift-worker-0-0 OK ready redfish://192.168.123.1:8000/redfish/v1/Systems/80db2d06-2c8e-4880-a060-6ead6b5b7415 unknown false node gets into NotReady state: worker-0-0 NotReady,SchedulingDisabled worker 34h v1.18.3+6025c28 worker MCP is updating: worker rendered-worker-02d73d28b403f1ee02c382c93aad78c0 False True False 2 1 2 0 35h machine is stuck in Deleting NAME PHASE TYPE REGION ZONE AGE ocp-edge-cluster-0-worker-0-xcd9n Deleting 35h Expected results: The node gets deprovisioned and no resources get stuck in transitory states. Additional info: I tried annotating the machine with exclude-node-draining but it didn't make any difference:
Note: the same result occurs when I delete the bmh before scaling the machineset: oc annotate machine ocp-edge-cluster-0-worker-0-sgc79 machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api oc -n openshift-machine-api delete bmh openshift-worker-0-0 oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 --replicas=1 oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 66m v1.18.3+6025c28 master-0-1 Ready master 65m v1.18.3+6025c28 master-0-2 Ready master 66m v1.18.3+6025c28 worker-0-0 NotReady,SchedulingDisabled worker 40m v1.18.3+6025c28 worker-0-1 Ready,SchedulingDisabled worker 40m v1.18.3+6025c28 oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-f80ec0279b7efd7ba34a2e43f0a02508 True False False 3 3 3 0 64m worker rendered-worker-f53d854416c9167e722d292c8bfa1fae False True False 2 0 2 0 64m oc -n openshift-machine-api get machine NAME PHASE TYPE REGION ZONE AGE ocp-edge-cluster-0-master-0 Running 80m ocp-edge-cluster-0-master-1 Running 80m ocp-edge-cluster-0-master-2 Running 80m ocp-edge-cluster-0-worker-0-fsmtz Deleting 60m ocp-edge-cluster-0-worker-0-r6rxt Provisioning 2m40s ocp-edge-cluster-0-worker-0-sgc79 Deleting 60m
The same problem happens when trying to scale down from 3 workers to 2
see https://bugzilla.redhat.com/show_bug.cgi?id=1845137 - looks like the same problem, so scenario is different
The fix in https://github.com/openshift/cluster-api-provider-baremetal/pull/87 has merged.
*** Bug 1845137 has been marked as a duplicate of this bug. ***
Verified on Client Version: 4.6.0-0.nightly-2020-08-16-072105 Server Version: 4.6.0-0.nightly-2020-08-16-072105 Kubernetes Version: v1.19.0-rc.2+99cb93a-dirty Machine is deleted from machine list, machineset counters reduced
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196