Bug 1869318 - Scaling down worker machineset on bare metal leaves node in NotReady,SchedulingDisabled state
Summary: Scaling down worker machineset on bare metal leaves node in NotReady,Scheduli...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Zane Bitter
QA Contact: Lubov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-17 13:39 UTC by Lubov
Modified: 2020-10-27 16:28 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:28:36 UTC
Target Upstream Version:


Attachments (Terms of Use)
Log of Machine object deletion (16.35 KB, text/plain)
2020-08-18 20:25 UTC, Zane Bitter
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-baremetal pull 97 0 None closed Bug 1869318: Remove finalizer from Node on delete 2020-11-05 04:27:45 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:28:53 UTC

Description Lubov 2020-08-17 13:39:45 UTC
Description of problem:
After scaling down worker machineset, the machine is deleted from machinelist, machineset counters reduced, but the node of deleted machine stuck in NotReady,SchedulingDisabled state even after corresponding BMH deleted

Version-Release number of the following components:
Client Version: 4.6.0-0.nightly-2020-08-16-072105
Server Version: 4.6.0-0.nightly-2020-08-16-072105
Kubernetes Version: v1.19.0-rc.2+99cb93a-dirty

How reproducible:
always

Steps to Reproduce:
1. Annotate a node to be deleted from machineset
$ oc annotate machine <name-CONSUMER> machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api

2. Scale down the machineset
$ oc scale machineset -n openshift-machine-api <name-MACHINESET> --replicas=<N-1>

Actual results:
The corresponding node is stuck in nodelist in NotReady,SchedulingDisabled
[kni@provisionhost-0-0 ~]$ oc get nodes
NAME         STATUS                        ROLES    AGE    VERSION
master-0-0   Ready                         master   157m   v1.19.0-rc.2+3d1d343-dirty
master-0-1   Ready                         master   157m   v1.19.0-rc.2+3d1d343-dirty
master-0-2   Ready                         master   157m   v1.19.0-rc.2+3d1d343-dirty
worker-0-0   Ready                         worker   130m   v1.19.0-rc.2+3d1d343-dirty
worker-0-1   Ready                         worker   130m   v1.19.0-rc.2+3d1d343-dirty
worker-0-2   NotReady,SchedulingDisabled   worker   130m   v1.19.0-rc.2+3d1d343-dirty

Expected results:
The corresponding node is deleted from nodelist

Additional info:
The node still listed in NotReady,SchedulingDisabled even after corresponding BMH is delete

Some operators are degraded/progressing
[kni@provisionhost-0-0 ~]$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-08-16-072105   True        False         False      13m
cloud-credential                           4.6.0-0.nightly-2020-08-16-072105   True        False         False      117m
cluster-autoscaler                         4.6.0-0.nightly-2020-08-16-072105   True        False         False      99m
config-operator                            4.6.0-0.nightly-2020-08-16-072105   True        False         False      106m
console                                    4.6.0-0.nightly-2020-08-16-072105   True        False         False      78m
csi-snapshot-controller                    4.6.0-0.nightly-2020-08-16-072105   True        False         False      82m
dns                                        4.6.0-0.nightly-2020-08-16-072105   True        True          False      105m
etcd                                       4.6.0-0.nightly-2020-08-16-072105   True        False         False      105m
image-registry                             4.6.0-0.nightly-2020-08-16-072105   True        False         False      100m
ingress                                    4.6.0-0.nightly-2020-08-16-072105   True        False         False      82m
insights                                   4.6.0-0.nightly-2020-08-16-072105   True        False         False      100m
kube-apiserver                             4.6.0-0.nightly-2020-08-16-072105   True        False         False      104m
kube-controller-manager                    4.6.0-0.nightly-2020-08-16-072105   True        False         False      102m
kube-scheduler                             4.6.0-0.nightly-2020-08-16-072105   True        False         False      103m
kube-storage-version-migrator              4.6.0-0.nightly-2020-08-16-072105   True        False         False      82m
machine-api                                4.6.0-0.nightly-2020-08-16-072105   True        False         False      91m
machine-approver                           4.6.0-0.nightly-2020-08-16-072105   True        False         False      102m
machine-config                             4.6.0-0.nightly-2020-08-16-072105   False       False         True       3m3s
marketplace                                4.6.0-0.nightly-2020-08-16-072105   True        False         False      99m
monitoring                                 4.6.0-0.nightly-2020-08-16-072105   False       True          True       5m5s
network                                    4.6.0-0.nightly-2020-08-16-072105   True        True          True       106m
node-tuning                                4.6.0-0.nightly-2020-08-16-072105   True        False         False      106m
openshift-apiserver                        4.6.0-0.nightly-2020-08-16-072105   True        False         False      34m
openshift-controller-manager               4.6.0-0.nightly-2020-08-16-072105   True        False         False      98m
openshift-samples                          4.6.0-0.nightly-2020-08-16-072105   True        False         False      91m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-08-16-072105   True        False         False      106m
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-08-16-072105   True        False         False      106m
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-08-16-072105   True        False         False      80m
service-ca                                 4.6.0-0.nightly-2020-08-16-072105   True        False         False      106m
storage                                    4.6.0-0.nightly-2020-08-16-072105   True        False         False      106m

Comment 1 Zane Bitter 2020-08-18 20:14:28 UTC
Changing component since this is definitely nothing to do with the installer.

Comment 2 Zane Bitter 2020-08-18 20:25:42 UTC
Created attachment 1711777 [details]
Log of Machine object deletion

I reproduced the issue and attached the log from the machine-api-controller.

The log looks like about what I'd expect... when the Machine gets deleted due to the scale-down, it drains the node, and once that has succeeded tells the actuator to deprovision the Host and then remove the finalizer on the Machine, allowing it to go away.

Comment 3 Zane Bitter 2020-08-18 20:34:13 UTC
I'm not clear on exactly which actor is responsible for cleaning up the Node in response to the Machine being deleted in a scale-down. The Machine healthcheck controller definitely has different behaviour on baremetal deployments than cloud deployments (it tries to remediate by restarting rather than deleting Machines - and remediation usually concludes with deleting the Node). So that is one possibility.

If it's not that then I don't know how this could be a baremetal-specific issue. The baremetal actuator implements Delete() the exact same way as all other actuators.

Comment 4 Zane Bitter 2020-08-18 21:41:45 UTC
Possibly relevant: at least in 4.5, the docs list different procedures for removing a Node on baremetal vs other platforms:

https://docs.openshift.com/container-platform/4.5/nodes/nodes/nodes-nodes-working.html#nodes-nodes-working-deleting_nodes-nodes-working
https://docs.openshift.com/container-platform/4.5/nodes/nodes/nodes-nodes-working.html#nodes-nodes-working-deleting-bare-metal_nodes-nodes-working

That said, these don't make a lot of sense to me because they do completely different things (and the command to manually delete the Node at the end has a copy-and-paste typo). But perhaps it's a clue that there is some expected difference in how baremetal platforms react to scaling down a MachineSet that requires manual intervention. Can anybody on the Cluster API side comment?

Comment 5 Andrew Beekhof 2020-08-19 00:40:31 UTC
MHC shouldn't be involved in the orderly removal of a machine/node, so I don't think thats the source.

Comment 6 Joel Speed 2020-08-19 10:10:24 UTC
Typically in a cloud install, say on AWS, it is the cloud-controller-manager that is responsible for the removal of the Node. This is a controller that is built into kube-controller-manager at present, though eventually will be split out.
In a normal Kubernetes cluster and within OpenShift, the cloud-controller is expected to monitor the cloud provider to check if the instance for each node exists, by matching provider IDs. If any of the Node's instances are removed, this controller should notice that and delete the Node object.

I'm not sure if there is an analogous component in the baremetal case.

Comment 7 Zane Bitter 2020-08-20 02:12:05 UTC
So it appears to me that there is code in the Machine controller to delete the Node:

https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/machine/controller.go#L248

And logs show that this code is being run:

I0818 18:05:15.247313       1 controller.go:247] worker-1: deleting node "ostest-9lldd-worker-0-rpcmq" for machine

and yet the Node continues to exist. It has no DeletionTimestamp. The status is NodeStatusUnknown, indicating that the kubelet is no longer reporting the results (and I verified that the baremetal operator had indeed successfully deprovisioned and powered down the host). Subsequently running through the manual procedure from https://docs.openshift.com/container-platform/4.5/nodes/nodes/nodes-nodes-working.html#nodes-nodes-working-deleting-bare-metal_nodes-nodes-working has the same result:

$ oc adm cordon worker-1
node/worker-1 already cordoned
$ oc adm drain worker-1 --ignore-daemonsets
node/worker-1 already cordoned
WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-nssh4, openshift-dns/dns-default-mqx8v, openshift-image-registry/node-ca-6k8tb, openshift-machine-config-operator/machine-config-daemon-rrfxd, openshift-monitoring/node-exporter-bjgvp, openshift-multus/multus-njx6n, openshift-multus/network-metrics-daemon-9p6zb, openshift-sdn/ovs-8n7hx, openshift-sdn/sdn-metrics-fbmcd, openshift-sdn/sdn-n8hmz
node/worker-1 drained
$ oc delete node worker-1
node "worker-1" deleted

The delete command hangs forever, because even though it reports that the Node is deleted, it never goes away.

I'm not familiar with the Node controller, and of course Node is a core resource so it doesn't even necessarily operate like a CRD. My only guess is that we have changed something on the baremetal platform that is causing the Node to decide not to delete itself. The only obvious candidate is that since bug 1852959 we set the ProviderID in the Node from the machine actuator.

Comment 8 Andrew Beekhof 2020-08-20 04:05:29 UTC
No DeletionTimestamp... maybe a race condition where the Node was deleted but something recreated it afterwards?

Comment 9 Joel Speed 2020-08-20 11:43:51 UTC
> No DeletionTimestamp... maybe a race condition where the Node was deleted but something recreated it afterwards?

Normally, the only thing that should be creating a Node is the Kubelet on the machine itself, if the Machine is gone then I hope nothings recreating it, could check the creation timestamp to confirm (from the initial comment, I don't think it's that).
On some resources, the Kubernetes storage has special logic for the built in types to prevent people doing certain things, for example deleting a namespace before it is empty.
I had a look through the code for the node storage and it didn't appear there would be anything there blocking the Node deletion, but there could be somewhere else something that does this (eg Node admission controller?)

Comment 10 Zane Bitter 2020-08-20 15:02:31 UTC
(In reply to Joel Speed from comment #9)
> > No DeletionTimestamp... maybe a race condition where the Node was deleted but something recreated it afterwards?
> 
> Normally, the only thing that should be creating a Node is the Kubelet on
> the machine itself, if the Machine is gone then I hope nothings recreating
> it, could check the creation timestamp to confirm (from the initial comment,
> I don't think it's that).

Yeah, that was my first thought too, but I reproduced and dumped the Node resource before and after, and indeed the CreationTimestamp does not change.

> On some resources, the Kubernetes storage has special logic for the built in
> types to prevent people doing certain things, for example deleting a
> namespace before it is empty.
> I had a look through the code for the node storage and it didn't appear
> there would be anything there blocking the Node deletion, but there could be
> somewhere else something that does this (eg Node admission controller?)

Maybe, but wouldn't you expect that to return an error rather than report success and just not do it?

Comment 11 Steven Hardy 2020-08-20 16:17:16 UTC
> The only obvious candidate is that since bug 1852959 we set the ProviderID in the Node from the machine actuator.

I guess testing with a revert of that would be worthwhile to confirm - do you think we need to clear the ProviderID along with the ConsumerRef when deleting in the actuator?

https://github.com/openshift/cluster-api-provider-baremetal/blob/master/pkg/cloud/baremetal/actuators/machine/actuator.go#L221

Comment 12 Zane Bitter 2020-08-20 18:52:38 UTC
(In reply to Steven Hardy from comment #11)
> > The only obvious candidate is that since bug 1852959 we set the ProviderID in the Node from the machine actuator.
> 
> I guess testing with a revert of that would be worthwhile to confirm - do
> you think we need to clear the ProviderID along with the ConsumerRef when
> deleting in the actuator?

That was my second thought of, but it turns out there's a validation hook that prevents you removing the ProviderID (you can only change it from "" -> something). So, no, nothing that easy :(

Comment 13 Zane Bitter 2020-08-20 19:12:36 UTC
OK, I tried that again and it actually led to me finding the real problem: because Node is a core resource type, kubectl silently drops all of the fields that it doesn't think you ought to be interested in... such as DeletionTimestamp... and finalizers.

If you do "oc edit node worker-1" then you can see that in fact there *is* a deletion timestamp, and also we've left the finalizer "metal3.io/capbm" on there. So if we make sure we clean that up, it goes away as expected.

Comment 17 Lubov 2020-08-23 08:55:40 UTC
Verified on 
Client Version: 4.6.0-0.nightly-2020-08-22-221008
Server Version: 4.6.0-0.nightly-2020-08-22-221008
Kubernetes Version: v1.19.0-rc.2+3e083ac-dirty

Comment 20 errata-xmlrpc 2020-10-27 16:28:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.