Description of problem: After scaling down worker machineset, the machine is deleted from machinelist, machineset counters reduced, but the node of deleted machine stuck in NotReady,SchedulingDisabled state even after corresponding BMH deleted Version-Release number of the following components: Client Version: 4.6.0-0.nightly-2020-08-16-072105 Server Version: 4.6.0-0.nightly-2020-08-16-072105 Kubernetes Version: v1.19.0-rc.2+99cb93a-dirty How reproducible: always Steps to Reproduce: 1. Annotate a node to be deleted from machineset $ oc annotate machine <name-CONSUMER> machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api 2. Scale down the machineset $ oc scale machineset -n openshift-machine-api <name-MACHINESET> --replicas=<N-1> Actual results: The corresponding node is stuck in nodelist in NotReady,SchedulingDisabled [kni@provisionhost-0-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 157m v1.19.0-rc.2+3d1d343-dirty master-0-1 Ready master 157m v1.19.0-rc.2+3d1d343-dirty master-0-2 Ready master 157m v1.19.0-rc.2+3d1d343-dirty worker-0-0 Ready worker 130m v1.19.0-rc.2+3d1d343-dirty worker-0-1 Ready worker 130m v1.19.0-rc.2+3d1d343-dirty worker-0-2 NotReady,SchedulingDisabled worker 130m v1.19.0-rc.2+3d1d343-dirty Expected results: The corresponding node is deleted from nodelist Additional info: The node still listed in NotReady,SchedulingDisabled even after corresponding BMH is delete Some operators are degraded/progressing [kni@provisionhost-0-0 ~]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-08-16-072105 True False False 13m cloud-credential 4.6.0-0.nightly-2020-08-16-072105 True False False 117m cluster-autoscaler 4.6.0-0.nightly-2020-08-16-072105 True False False 99m config-operator 4.6.0-0.nightly-2020-08-16-072105 True False False 106m console 4.6.0-0.nightly-2020-08-16-072105 True False False 78m csi-snapshot-controller 4.6.0-0.nightly-2020-08-16-072105 True False False 82m dns 4.6.0-0.nightly-2020-08-16-072105 True True False 105m etcd 4.6.0-0.nightly-2020-08-16-072105 True False False 105m image-registry 4.6.0-0.nightly-2020-08-16-072105 True False False 100m ingress 4.6.0-0.nightly-2020-08-16-072105 True False False 82m insights 4.6.0-0.nightly-2020-08-16-072105 True False False 100m kube-apiserver 4.6.0-0.nightly-2020-08-16-072105 True False False 104m kube-controller-manager 4.6.0-0.nightly-2020-08-16-072105 True False False 102m kube-scheduler 4.6.0-0.nightly-2020-08-16-072105 True False False 103m kube-storage-version-migrator 4.6.0-0.nightly-2020-08-16-072105 True False False 82m machine-api 4.6.0-0.nightly-2020-08-16-072105 True False False 91m machine-approver 4.6.0-0.nightly-2020-08-16-072105 True False False 102m machine-config 4.6.0-0.nightly-2020-08-16-072105 False False True 3m3s marketplace 4.6.0-0.nightly-2020-08-16-072105 True False False 99m monitoring 4.6.0-0.nightly-2020-08-16-072105 False True True 5m5s network 4.6.0-0.nightly-2020-08-16-072105 True True True 106m node-tuning 4.6.0-0.nightly-2020-08-16-072105 True False False 106m openshift-apiserver 4.6.0-0.nightly-2020-08-16-072105 True False False 34m openshift-controller-manager 4.6.0-0.nightly-2020-08-16-072105 True False False 98m openshift-samples 4.6.0-0.nightly-2020-08-16-072105 True False False 91m operator-lifecycle-manager 4.6.0-0.nightly-2020-08-16-072105 True False False 106m operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-08-16-072105 True False False 106m operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-08-16-072105 True False False 80m service-ca 4.6.0-0.nightly-2020-08-16-072105 True False False 106m storage 4.6.0-0.nightly-2020-08-16-072105 True False False 106m
Changing component since this is definitely nothing to do with the installer.
Created attachment 1711777 [details] Log of Machine object deletion I reproduced the issue and attached the log from the machine-api-controller. The log looks like about what I'd expect... when the Machine gets deleted due to the scale-down, it drains the node, and once that has succeeded tells the actuator to deprovision the Host and then remove the finalizer on the Machine, allowing it to go away.
I'm not clear on exactly which actor is responsible for cleaning up the Node in response to the Machine being deleted in a scale-down. The Machine healthcheck controller definitely has different behaviour on baremetal deployments than cloud deployments (it tries to remediate by restarting rather than deleting Machines - and remediation usually concludes with deleting the Node). So that is one possibility. If it's not that then I don't know how this could be a baremetal-specific issue. The baremetal actuator implements Delete() the exact same way as all other actuators.
Possibly relevant: at least in 4.5, the docs list different procedures for removing a Node on baremetal vs other platforms: https://docs.openshift.com/container-platform/4.5/nodes/nodes/nodes-nodes-working.html#nodes-nodes-working-deleting_nodes-nodes-working https://docs.openshift.com/container-platform/4.5/nodes/nodes/nodes-nodes-working.html#nodes-nodes-working-deleting-bare-metal_nodes-nodes-working That said, these don't make a lot of sense to me because they do completely different things (and the command to manually delete the Node at the end has a copy-and-paste typo). But perhaps it's a clue that there is some expected difference in how baremetal platforms react to scaling down a MachineSet that requires manual intervention. Can anybody on the Cluster API side comment?
MHC shouldn't be involved in the orderly removal of a machine/node, so I don't think thats the source.
Typically in a cloud install, say on AWS, it is the cloud-controller-manager that is responsible for the removal of the Node. This is a controller that is built into kube-controller-manager at present, though eventually will be split out. In a normal Kubernetes cluster and within OpenShift, the cloud-controller is expected to monitor the cloud provider to check if the instance for each node exists, by matching provider IDs. If any of the Node's instances are removed, this controller should notice that and delete the Node object. I'm not sure if there is an analogous component in the baremetal case.
So it appears to me that there is code in the Machine controller to delete the Node: https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/machine/controller.go#L248 And logs show that this code is being run: I0818 18:05:15.247313 1 controller.go:247] worker-1: deleting node "ostest-9lldd-worker-0-rpcmq" for machine and yet the Node continues to exist. It has no DeletionTimestamp. The status is NodeStatusUnknown, indicating that the kubelet is no longer reporting the results (and I verified that the baremetal operator had indeed successfully deprovisioned and powered down the host). Subsequently running through the manual procedure from https://docs.openshift.com/container-platform/4.5/nodes/nodes/nodes-nodes-working.html#nodes-nodes-working-deleting-bare-metal_nodes-nodes-working has the same result: $ oc adm cordon worker-1 node/worker-1 already cordoned $ oc adm drain worker-1 --ignore-daemonsets node/worker-1 already cordoned WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-nssh4, openshift-dns/dns-default-mqx8v, openshift-image-registry/node-ca-6k8tb, openshift-machine-config-operator/machine-config-daemon-rrfxd, openshift-monitoring/node-exporter-bjgvp, openshift-multus/multus-njx6n, openshift-multus/network-metrics-daemon-9p6zb, openshift-sdn/ovs-8n7hx, openshift-sdn/sdn-metrics-fbmcd, openshift-sdn/sdn-n8hmz node/worker-1 drained $ oc delete node worker-1 node "worker-1" deleted The delete command hangs forever, because even though it reports that the Node is deleted, it never goes away. I'm not familiar with the Node controller, and of course Node is a core resource so it doesn't even necessarily operate like a CRD. My only guess is that we have changed something on the baremetal platform that is causing the Node to decide not to delete itself. The only obvious candidate is that since bug 1852959 we set the ProviderID in the Node from the machine actuator.
No DeletionTimestamp... maybe a race condition where the Node was deleted but something recreated it afterwards?
> No DeletionTimestamp... maybe a race condition where the Node was deleted but something recreated it afterwards? Normally, the only thing that should be creating a Node is the Kubelet on the machine itself, if the Machine is gone then I hope nothings recreating it, could check the creation timestamp to confirm (from the initial comment, I don't think it's that). On some resources, the Kubernetes storage has special logic for the built in types to prevent people doing certain things, for example deleting a namespace before it is empty. I had a look through the code for the node storage and it didn't appear there would be anything there blocking the Node deletion, but there could be somewhere else something that does this (eg Node admission controller?)
(In reply to Joel Speed from comment #9) > > No DeletionTimestamp... maybe a race condition where the Node was deleted but something recreated it afterwards? > > Normally, the only thing that should be creating a Node is the Kubelet on > the machine itself, if the Machine is gone then I hope nothings recreating > it, could check the creation timestamp to confirm (from the initial comment, > I don't think it's that). Yeah, that was my first thought too, but I reproduced and dumped the Node resource before and after, and indeed the CreationTimestamp does not change. > On some resources, the Kubernetes storage has special logic for the built in > types to prevent people doing certain things, for example deleting a > namespace before it is empty. > I had a look through the code for the node storage and it didn't appear > there would be anything there blocking the Node deletion, but there could be > somewhere else something that does this (eg Node admission controller?) Maybe, but wouldn't you expect that to return an error rather than report success and just not do it?
> The only obvious candidate is that since bug 1852959 we set the ProviderID in the Node from the machine actuator. I guess testing with a revert of that would be worthwhile to confirm - do you think we need to clear the ProviderID along with the ConsumerRef when deleting in the actuator? https://github.com/openshift/cluster-api-provider-baremetal/blob/master/pkg/cloud/baremetal/actuators/machine/actuator.go#L221
(In reply to Steven Hardy from comment #11) > > The only obvious candidate is that since bug 1852959 we set the ProviderID in the Node from the machine actuator. > > I guess testing with a revert of that would be worthwhile to confirm - do > you think we need to clear the ProviderID along with the ConsumerRef when > deleting in the actuator? That was my second thought of, but it turns out there's a validation hook that prevents you removing the ProviderID (you can only change it from "" -> something). So, no, nothing that easy :(
OK, I tried that again and it actually led to me finding the real problem: because Node is a core resource type, kubectl silently drops all of the fields that it doesn't think you ought to be interested in... such as DeletionTimestamp... and finalizers. If you do "oc edit node worker-1" then you can see that in fact there *is* a deletion timestamp, and also we've left the finalizer "metal3.io/capbm" on there. So if we make sure we clean that up, it goes away as expected.
Verified on Client Version: 4.6.0-0.nightly-2020-08-22-221008 Server Version: 4.6.0-0.nightly-2020-08-22-221008 Kubernetes Version: v1.19.0-rc.2+3e083ac-dirty
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196