Description of problem:
After scaling down worker machineset, the machine is deleted from machinelist, machineset counters reduced, but the node of deleted machine stuck in NotReady,SchedulingDisabled state even after corresponding BMH deleted
Version-Release number of the following components:
Client Version: 4.6.0-0.nightly-2020-08-16-072105
Server Version: 4.6.0-0.nightly-2020-08-16-072105
Kubernetes Version: v1.19.0-rc.2+99cb93a-dirty
Steps to Reproduce:
1. Annotate a node to be deleted from machineset
$ oc annotate machine <name-CONSUMER> machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api
2. Scale down the machineset
$ oc scale machineset -n openshift-machine-api <name-MACHINESET> --replicas=<N-1>
The corresponding node is stuck in nodelist in NotReady,SchedulingDisabled
[kni@provisionhost-0-0 ~]$ oc get nodes
NAME STATUS ROLES AGE VERSION
master-0-0 Ready master 157m v1.19.0-rc.2+3d1d343-dirty
master-0-1 Ready master 157m v1.19.0-rc.2+3d1d343-dirty
master-0-2 Ready master 157m v1.19.0-rc.2+3d1d343-dirty
worker-0-0 Ready worker 130m v1.19.0-rc.2+3d1d343-dirty
worker-0-1 Ready worker 130m v1.19.0-rc.2+3d1d343-dirty
worker-0-2 NotReady,SchedulingDisabled worker 130m v1.19.0-rc.2+3d1d343-dirty
The corresponding node is deleted from nodelist
The node still listed in NotReady,SchedulingDisabled even after corresponding BMH is delete
Some operators are degraded/progressing
[kni@provisionhost-0-0 ~]$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.6.0-0.nightly-2020-08-16-072105 True False False 13m
cloud-credential 4.6.0-0.nightly-2020-08-16-072105 True False False 117m
cluster-autoscaler 4.6.0-0.nightly-2020-08-16-072105 True False False 99m
config-operator 4.6.0-0.nightly-2020-08-16-072105 True False False 106m
console 4.6.0-0.nightly-2020-08-16-072105 True False False 78m
csi-snapshot-controller 4.6.0-0.nightly-2020-08-16-072105 True False False 82m
dns 4.6.0-0.nightly-2020-08-16-072105 True True False 105m
etcd 4.6.0-0.nightly-2020-08-16-072105 True False False 105m
image-registry 4.6.0-0.nightly-2020-08-16-072105 True False False 100m
ingress 4.6.0-0.nightly-2020-08-16-072105 True False False 82m
insights 4.6.0-0.nightly-2020-08-16-072105 True False False 100m
kube-apiserver 4.6.0-0.nightly-2020-08-16-072105 True False False 104m
kube-controller-manager 4.6.0-0.nightly-2020-08-16-072105 True False False 102m
kube-scheduler 4.6.0-0.nightly-2020-08-16-072105 True False False 103m
kube-storage-version-migrator 4.6.0-0.nightly-2020-08-16-072105 True False False 82m
machine-api 4.6.0-0.nightly-2020-08-16-072105 True False False 91m
machine-approver 4.6.0-0.nightly-2020-08-16-072105 True False False 102m
machine-config 4.6.0-0.nightly-2020-08-16-072105 False False True 3m3s
marketplace 4.6.0-0.nightly-2020-08-16-072105 True False False 99m
monitoring 4.6.0-0.nightly-2020-08-16-072105 False True True 5m5s
network 4.6.0-0.nightly-2020-08-16-072105 True True True 106m
node-tuning 4.6.0-0.nightly-2020-08-16-072105 True False False 106m
openshift-apiserver 4.6.0-0.nightly-2020-08-16-072105 True False False 34m
openshift-controller-manager 4.6.0-0.nightly-2020-08-16-072105 True False False 98m
openshift-samples 4.6.0-0.nightly-2020-08-16-072105 True False False 91m
operator-lifecycle-manager 4.6.0-0.nightly-2020-08-16-072105 True False False 106m
operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-08-16-072105 True False False 106m
operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-08-16-072105 True False False 80m
service-ca 4.6.0-0.nightly-2020-08-16-072105 True False False 106m
storage 4.6.0-0.nightly-2020-08-16-072105 True False False 106m
Changing component since this is definitely nothing to do with the installer.
Created attachment 1711777 [details]
Log of Machine object deletion
I reproduced the issue and attached the log from the machine-api-controller.
The log looks like about what I'd expect... when the Machine gets deleted due to the scale-down, it drains the node, and once that has succeeded tells the actuator to deprovision the Host and then remove the finalizer on the Machine, allowing it to go away.
I'm not clear on exactly which actor is responsible for cleaning up the Node in response to the Machine being deleted in a scale-down. The Machine healthcheck controller definitely has different behaviour on baremetal deployments than cloud deployments (it tries to remediate by restarting rather than deleting Machines - and remediation usually concludes with deleting the Node). So that is one possibility.
If it's not that then I don't know how this could be a baremetal-specific issue. The baremetal actuator implements Delete() the exact same way as all other actuators.
Possibly relevant: at least in 4.5, the docs list different procedures for removing a Node on baremetal vs other platforms:
That said, these don't make a lot of sense to me because they do completely different things (and the command to manually delete the Node at the end has a copy-and-paste typo). But perhaps it's a clue that there is some expected difference in how baremetal platforms react to scaling down a MachineSet that requires manual intervention. Can anybody on the Cluster API side comment?
MHC shouldn't be involved in the orderly removal of a machine/node, so I don't think thats the source.
Typically in a cloud install, say on AWS, it is the cloud-controller-manager that is responsible for the removal of the Node. This is a controller that is built into kube-controller-manager at present, though eventually will be split out.
In a normal Kubernetes cluster and within OpenShift, the cloud-controller is expected to monitor the cloud provider to check if the instance for each node exists, by matching provider IDs. If any of the Node's instances are removed, this controller should notice that and delete the Node object.
I'm not sure if there is an analogous component in the baremetal case.
So it appears to me that there is code in the Machine controller to delete the Node:
And logs show that this code is being run:
I0818 18:05:15.247313 1 controller.go:247] worker-1: deleting node "ostest-9lldd-worker-0-rpcmq" for machine
and yet the Node continues to exist. It has no DeletionTimestamp. The status is NodeStatusUnknown, indicating that the kubelet is no longer reporting the results (and I verified that the baremetal operator had indeed successfully deprovisioned and powered down the host). Subsequently running through the manual procedure from https://docs.openshift.com/container-platform/4.5/nodes/nodes/nodes-nodes-working.html#nodes-nodes-working-deleting-bare-metal_nodes-nodes-working has the same result:
$ oc adm cordon worker-1
node/worker-1 already cordoned
$ oc adm drain worker-1 --ignore-daemonsets
node/worker-1 already cordoned
WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-nssh4, openshift-dns/dns-default-mqx8v, openshift-image-registry/node-ca-6k8tb, openshift-machine-config-operator/machine-config-daemon-rrfxd, openshift-monitoring/node-exporter-bjgvp, openshift-multus/multus-njx6n, openshift-multus/network-metrics-daemon-9p6zb, openshift-sdn/ovs-8n7hx, openshift-sdn/sdn-metrics-fbmcd, openshift-sdn/sdn-n8hmz
$ oc delete node worker-1
node "worker-1" deleted
The delete command hangs forever, because even though it reports that the Node is deleted, it never goes away.
I'm not familiar with the Node controller, and of course Node is a core resource so it doesn't even necessarily operate like a CRD. My only guess is that we have changed something on the baremetal platform that is causing the Node to decide not to delete itself. The only obvious candidate is that since bug 1852959 we set the ProviderID in the Node from the machine actuator.
No DeletionTimestamp... maybe a race condition where the Node was deleted but something recreated it afterwards?
> No DeletionTimestamp... maybe a race condition where the Node was deleted but something recreated it afterwards?
Normally, the only thing that should be creating a Node is the Kubelet on the machine itself, if the Machine is gone then I hope nothings recreating it, could check the creation timestamp to confirm (from the initial comment, I don't think it's that).
On some resources, the Kubernetes storage has special logic for the built in types to prevent people doing certain things, for example deleting a namespace before it is empty.
I had a look through the code for the node storage and it didn't appear there would be anything there blocking the Node deletion, but there could be somewhere else something that does this (eg Node admission controller?)
(In reply to Joel Speed from comment #9)
> > No DeletionTimestamp... maybe a race condition where the Node was deleted but something recreated it afterwards?
> Normally, the only thing that should be creating a Node is the Kubelet on
> the machine itself, if the Machine is gone then I hope nothings recreating
> it, could check the creation timestamp to confirm (from the initial comment,
> I don't think it's that).
Yeah, that was my first thought too, but I reproduced and dumped the Node resource before and after, and indeed the CreationTimestamp does not change.
> On some resources, the Kubernetes storage has special logic for the built in
> types to prevent people doing certain things, for example deleting a
> namespace before it is empty.
> I had a look through the code for the node storage and it didn't appear
> there would be anything there blocking the Node deletion, but there could be
> somewhere else something that does this (eg Node admission controller?)
Maybe, but wouldn't you expect that to return an error rather than report success and just not do it?
> The only obvious candidate is that since bug 1852959 we set the ProviderID in the Node from the machine actuator.
I guess testing with a revert of that would be worthwhile to confirm - do you think we need to clear the ProviderID along with the ConsumerRef when deleting in the actuator?
(In reply to Steven Hardy from comment #11)
> > The only obvious candidate is that since bug 1852959 we set the ProviderID in the Node from the machine actuator.
> I guess testing with a revert of that would be worthwhile to confirm - do
> you think we need to clear the ProviderID along with the ConsumerRef when
> deleting in the actuator?
That was my second thought of, but it turns out there's a validation hook that prevents you removing the ProviderID (you can only change it from "" -> something). So, no, nothing that easy :(
OK, I tried that again and it actually led to me finding the real problem: because Node is a core resource type, kubectl silently drops all of the fields that it doesn't think you ought to be interested in... such as DeletionTimestamp... and finalizers.
If you do "oc edit node worker-1" then you can see that in fact there *is* a deletion timestamp, and also we've left the finalizer "metal3.io/capbm" on there. So if we make sure we clean that up, it goes away as expected.
Client Version: 4.6.0-0.nightly-2020-08-22-221008
Server Version: 4.6.0-0.nightly-2020-08-22-221008
Kubernetes Version: v1.19.0-rc.2+3e083ac-dirty
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.