Created attachment 1696124 [details] After attempt to scale up Created attachment 1696124 [details] After attempt to scale up Description of problem: When scaling down machineset the node is not deleted: in machine list it is listed in state Deleting in node list it is listed as NotReady,SchedulingDisabled It looks like the deleting process stuck due failure while draining node In addition CO monitoring is degrading and becomes not available In bmh list the node becomes deprovisioning and ready as expected If we try to scale up the machineset, the machine in Deleting state is still listed, new machine shown in provisioned state, but new node is not added (see in attachment) Version-Release number of selected component (if applicable): Client Version: 4.5.0-0.nightly-2020-06-05-214616 Server Version: 4.5.0-0.nightly-2020-06-07-080121 Kubernetes Version: v1.18.3+a637491 How reproducible: Constantly Steps to Reproduce: 1. Annotate a node to be deleted from machineset $ oc annotate machine <name-CONSUMER> machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api 2. Scale down the machineset $ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 --replicas=<N-1> Actual results: The node is still listed: $ oc get machine -o wide -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE NODE ocp-edge-cluster-0-worker-0-ctb52 Deleting 66m worker-0-2 $ oc get nodes NAME STATUS ROLES AGE VERSION worker-0-2 NotReady,SchedulingDisabled worker 45m v1.18.3+a637491 Expected results: The node not listed in both machines and nodes lists: $ oc get nodes $ oc get machine -o wide -n openshift-machine-api Additional info: attaching logs From machine-controller.log: I0608 13:44:58.857259 1 controller.go:165] ocp-edge-cluster-0-worker-0-ctb52: reconciling Machine I0608 13:44:58.857322 1 controller.go:420] ocp-edge-cluster-0-worker-0-ctb52: going into phase "Deleting" I0608 13:44:58.869530 1 controller.go:205] ocp-edge-cluster-0-worker-0-ctb52: reconciling machine triggers delete E0608 13:44:59.009028 1 controller.go:466] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-s2jc9, openshift-dns/dns-default-5trxc, openshift-image-registry/node-ca-8sj8h, openshift-machine-config-operator/machine-config-daemon-svk7k, openshift-monitoring/node-exporter-57fcz, openshift-multus/multus-v4zd9, openshift-ovn-kubernetes/ovnkube-node-xw8sq, openshift-ovn-kubernetes/ovs-node-462hd I0608 13:44:59.029124 1 controller.go:466] evicting pod openshift-monitoring/alertmanager-main-0 I0608 13:44:59.029167 1 controller.go:466] evicting pod openshift-monitoring/grafana-755f49d779-kjrtm I0608 13:44:59.029530 1 controller.go:466] evicting pod openshift-monitoring/thanos-querier-6bc8cc988f-bl9gf I0608 13:44:59.029700 1 controller.go:466] evicting pod openshift-ingress/router-default-5776858d7b-g566h I0608 13:45:02.319958 1 controller.go:354] Evicted pod from Nodepodthanos-querier-6bc8cc988f-bl9gf/openshift-monitoring I0608 13:45:02.319961 1 controller.go:354] Evicted pod from Nodepodgrafana-755f49d779-kjrtm/openshift-monitoring I0608 13:45:02.320030 1 controller.go:354] Evicted pod from Nodepodalertmanager-main-0/openshift-monitoring W0608 13:45:19.234082 1 controller.go:375] drain failed for machine "ocp-edge-cluster-0-worker-0-ctb52": error when waiting for pod "router-default-5776858d7b-g566h" terminating: global timeout reached: 20s E0608 13:45:19.234128 1 controller.go:213] ocp-edge-cluster-0-worker-0-ctb52: failed to drain node for machine: requeue in: 20s I0608 13:45:19.234143 1 controller.go:401] Actuator returned requeue-after error: requeue in: 20s
Created attachment 1696125 [details] machine-api-operator container log
Created attachment 1696126 [details] monitoring operator log
Created attachment 1696127 [details] machine controller log
Created attachment 1696128 [details] machineset controller log
Re-checked with Server Version: 4.5.0-0.nightly-2020-06-09-223121 Now draining succeeded but node still not deleted Attaching updated machine-controller.log
Created attachment 1696487 [details] updated machine controller log
From machine-controller: "can't proceed deleting machine while cloud instance is being terminated, requeuing" What this means is the bare metal provider is indicating that the instance/server still exists according to the bare metal operator. So, either the instance is never being successfully deleted by the bare metal operator, or if the BMO is trying to re-use an instance name, the replacement already exists. I have a feeling this might be the case due to the original case comment: "If we try to scale up the machineset, the machine in Deleting state is still listed, new machine shown in provisioned state". We can also see from the machine-controller logs that the node for this associated machine is unreachable, which indicates it was stopped at some point or otherwise isolated from the cluster (this does not happen during a normal drain, most likely this was caused by the BMO). Since the 'new machine' was stuck in provisioned state in the original case comment, it gives me the indication that names are being reused in the infrastructure, and that's going to cause these problems. One way the BMO could handle this is to keep track of instance/machine ID mappings. For example, when the BM provider queries the BMO for 'machine-1', it should associate that with some actual instance, 'instance-a'. The BMO needs to keep track of the lifecycle of 'instance-a' and ensure it's not associated with any new machine until it's completely deprovisioned. When the BM actuator inquires about 'machine-1' again, it should know that 'instance-a' is no longer associated with 'machine-1' and return 404. This would allow 'machine-2' to be associated with 'instance-a' internally to the BMO, so these types of races/conflicts can't happen. Anyway, that's what I *think* is possibly happening, there's not really enough information in this BZ to confirm.
I'm reasonably confident that this bug is the same as 1855823. The fix for that has already merged. If I am missing some detail about why they are not the same, please let me know. *** This bug has been marked as a duplicate of bug 1855823 ***