1855823 – Scaling down worker machineset on bare metal leaves machines stuck in Deleting state and machineconfigpool in updating state

Bug 1855823 - Scaling down worker machineset on bare metal leaves machines stuck in Deleting state and machineconfigpool in updating state

Summary: Scaling down worker machineset on bare metal leaves machines stuck in Deletin...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Doug Hellmann
QA Contact:	Lubov
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1845137 (view as bug list)
Depends On:
Blocks:	1863010
TreeView+	depends on / blocked

Reported:	2020-07-10 15:28 UTC by Marius Cornea
Modified:	2020-10-27 16:14 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1863010 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:13:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-baremetal pull 87	0	None	closed	ensure delete removes the link between machine and host	2020-12-30 18:35:02 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:14:21 UTC

Description Marius Cornea 2020-07-10 15:28:00 UTC

Description of problem:

When performing a scale down of the worker machineset on a bare metal IPI deployment, machines get stuck in Deleting state and machineconfigpool in updating state.

Following the docs @ https://github.com/metal3-io/metal3-docs/blob/master/design/baremetal-operator/remove-host.md#scale-down-the-machineset

Version-Release number of selected component (if applicable):
4.5.0-rc.7

How reproducible:
100%

Steps to Reproduce:
1. Deploy bare metal IPI with 3 masters and 2 workers
2. Annotate the machine of the worker that you want to remove

oc annotate machine ocp-edge-cluster-0-worker-0-xcd9n machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api

3. Scale down the worker machineset

oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 --replicas=1

4. Wait for the node to get deprovisioned

Actual results:

BMH node gets into ready state:

openshift-worker-0-0   OK       ready                                                        redfish://192.168.123.1:8000/redfish/v1/Systems/80db2d06-2c8e-4880-a060-6ead6b5b7415   unknown            false    

node gets into NotReady state:

worker-0-0   NotReady,SchedulingDisabled   worker   34h   v1.18.3+6025c28

worker MCP is updating:

worker   rendered-worker-02d73d28b403f1ee02c382c93aad78c0   False     True       False      2              1                   2                     0                      35h


machine is stuck in Deleting
NAME                                PHASE      TYPE   REGION   ZONE   AGE
ocp-edge-cluster-0-worker-0-xcd9n   Deleting                          35h


Expected results:

The node gets deprovisioned and no resources get stuck in transitory states.

Additional info:

I tried annotating the machine with exclude-node-draining but it didn't make any difference:

Comment 2 Marius Cornea 2020-07-10 17:33:31 UTC

Note: the same result occurs when I delete the bmh before scaling the machineset:

oc annotate machine ocp-edge-cluster-0-worker-0-sgc79 machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api
oc -n openshift-machine-api delete bmh openshift-worker-0-0
oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 --replicas=1

oc get nodes
NAME         STATUS                        ROLES    AGE   VERSION
master-0-0   Ready                         master   66m   v1.18.3+6025c28
master-0-1   Ready                         master   65m   v1.18.3+6025c28
master-0-2   Ready                         master   66m   v1.18.3+6025c28
worker-0-0   NotReady,SchedulingDisabled   worker   40m   v1.18.3+6025c28
worker-0-1   Ready,SchedulingDisabled      worker   40m   v1.18.3+6025c28


oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-f80ec0279b7efd7ba34a2e43f0a02508   True      False      False      3              3                   3                     0                      64m
worker   rendered-worker-f53d854416c9167e722d292c8bfa1fae   False     True       False      2              0                   2                     0                      64m


 oc -n openshift-machine-api get machine
NAME                                PHASE          TYPE   REGION   ZONE   AGE
ocp-edge-cluster-0-master-0         Running                               80m
ocp-edge-cluster-0-master-1         Running                               80m
ocp-edge-cluster-0-master-2         Running                               80m
ocp-edge-cluster-0-worker-0-fsmtz   Deleting                              60m
ocp-edge-cluster-0-worker-0-r6rxt   Provisioning                          2m40s
ocp-edge-cluster-0-worker-0-sgc79   Deleting                              60m

Comment 3 Lubov 2020-07-13 15:56:02 UTC

The same problem happens when trying to scale down from 3 workers to 2

Comment 4 Lubov 2020-07-13 16:19:12 UTC

see https://bugzilla.redhat.com/show_bug.cgi?id=1845137 - looks like the same problem, so scenario is different

Comment 5 Doug Hellmann 2020-07-22 17:46:12 UTC

The fix in https://github.com/openshift/cluster-api-provider-baremetal/pull/87 has merged.

Comment 6 Doug Hellmann 2020-07-27 19:57:28 UTC

*** Bug 1845137 has been marked as a duplicate of this bug. ***

Comment 9 Lubov 2020-08-17 13:44:19 UTC

Verified on
Client Version: 4.6.0-0.nightly-2020-08-16-072105
Server Version: 4.6.0-0.nightly-2020-08-16-072105
Kubernetes Version: v1.19.0-rc.2+99cb93a-dirty

Machine is deleted from machine list, machineset counters reduced

Comment 11 errata-xmlrpc 2020-10-27 16:13:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.