Bug 1855823

Summary: Scaling down worker machineset on bare metal leaves machines stuck in Deleting state and machineconfigpool in updating state
Product: OpenShift Container Platform Reporter: Marius Cornea <mcornea>
Component: Cloud ComputeAssignee: Doug Hellmann <dhellmann>
Cloud Compute sub component: BareMetal Provider QA Contact: Lubov <lshilin>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: high CC: agurenko, beth.white, dhellmann, lshilin, sasha, stbenjam, xtian, yprokule, zbitter
Version: 4.5Keywords: TestBlocker, Triaged
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1863010 (view as bug list) Environment:
Last Closed: 2020-10-27 16:13:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1863010    

Description Marius Cornea 2020-07-10 15:28:00 UTC
Description of problem:

When performing a scale down of the worker machineset on a bare metal IPI deployment, machines get stuck in Deleting state and machineconfigpool in updating state.

Following the docs @ https://github.com/metal3-io/metal3-docs/blob/master/design/baremetal-operator/remove-host.md#scale-down-the-machineset

Version-Release number of selected component (if applicable):
4.5.0-rc.7

How reproducible:
100%

Steps to Reproduce:
1. Deploy bare metal IPI with 3 masters and 2 workers
2. Annotate the machine of the worker that you want to remove

oc annotate machine ocp-edge-cluster-0-worker-0-xcd9n machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api

3. Scale down the worker machineset

oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 --replicas=1

4. Wait for the node to get deprovisioned

Actual results:

BMH node gets into ready state:

openshift-worker-0-0   OK       ready                                                        redfish://192.168.123.1:8000/redfish/v1/Systems/80db2d06-2c8e-4880-a060-6ead6b5b7415   unknown            false    

node gets into NotReady state:

worker-0-0   NotReady,SchedulingDisabled   worker   34h   v1.18.3+6025c28

worker MCP is updating:

worker   rendered-worker-02d73d28b403f1ee02c382c93aad78c0   False     True       False      2              1                   2                     0                      35h


machine is stuck in Deleting
NAME                                PHASE      TYPE   REGION   ZONE   AGE
ocp-edge-cluster-0-worker-0-xcd9n   Deleting                          35h


Expected results:

The node gets deprovisioned and no resources get stuck in transitory states.

Additional info:

I tried annotating the machine with exclude-node-draining but it didn't make any difference:

Comment 2 Marius Cornea 2020-07-10 17:33:31 UTC
Note: the same result occurs when I delete the bmh before scaling the machineset:

oc annotate machine ocp-edge-cluster-0-worker-0-sgc79 machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api
oc -n openshift-machine-api delete bmh openshift-worker-0-0
oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 --replicas=1

oc get nodes
NAME         STATUS                        ROLES    AGE   VERSION
master-0-0   Ready                         master   66m   v1.18.3+6025c28
master-0-1   Ready                         master   65m   v1.18.3+6025c28
master-0-2   Ready                         master   66m   v1.18.3+6025c28
worker-0-0   NotReady,SchedulingDisabled   worker   40m   v1.18.3+6025c28
worker-0-1   Ready,SchedulingDisabled      worker   40m   v1.18.3+6025c28


oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-f80ec0279b7efd7ba34a2e43f0a02508   True      False      False      3              3                   3                     0                      64m
worker   rendered-worker-f53d854416c9167e722d292c8bfa1fae   False     True       False      2              0                   2                     0                      64m


 oc -n openshift-machine-api get machine
NAME                                PHASE          TYPE   REGION   ZONE   AGE
ocp-edge-cluster-0-master-0         Running                               80m
ocp-edge-cluster-0-master-1         Running                               80m
ocp-edge-cluster-0-master-2         Running                               80m
ocp-edge-cluster-0-worker-0-fsmtz   Deleting                              60m
ocp-edge-cluster-0-worker-0-r6rxt   Provisioning                          2m40s
ocp-edge-cluster-0-worker-0-sgc79   Deleting                              60m

Comment 3 Lubov 2020-07-13 15:56:02 UTC
The same problem happens when trying to scale down from 3 workers to 2

Comment 4 Lubov 2020-07-13 16:19:12 UTC
see https://bugzilla.redhat.com/show_bug.cgi?id=1845137 - looks like the same problem, so scenario is different

Comment 5 Doug Hellmann 2020-07-22 17:46:12 UTC
The fix in https://github.com/openshift/cluster-api-provider-baremetal/pull/87 has merged.

Comment 6 Doug Hellmann 2020-07-27 19:57:28 UTC
*** Bug 1845137 has been marked as a duplicate of this bug. ***

Comment 9 Lubov 2020-08-17 13:44:19 UTC
Verified on
Client Version: 4.6.0-0.nightly-2020-08-16-072105
Server Version: 4.6.0-0.nightly-2020-08-16-072105
Kubernetes Version: v1.19.0-rc.2+99cb93a-dirty

Machine is deleted from machine list, machineset counters reduced

Comment 11 errata-xmlrpc 2020-10-27 16:13:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196