1845137 – Scaling down of machineset fails to delete machine properly

Bug 1845137 - Scaling down of machineset fails to delete machine properly

Summary: Scaling down of machineset fails to delete machine properly

Keywords:
Status:	CLOSED DUPLICATE of bug 1855823
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Linux
Priority:	medium
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Doug Hellmann
QA Contact:	Lubov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-08 14:17 UTC by Lubov
Modified:	2020-07-27 19:57 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-27 19:57:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
After attempt to scale up (267.81 KB, image/png) 2020-06-08 14:17 UTC, Lubov	no flags	Details
machine-api-operator container log (8.82 KB, text/plain) 2020-06-08 14:19 UTC, Lubov	no flags	Details
monitoring operator log (46.90 KB, text/plain) 2020-06-08 14:19 UTC, Lubov	no flags	Details
machine controller log (81.25 KB, text/plain) 2020-06-08 14:20 UTC, Lubov	no flags	Details
machineset controller log (10.33 KB, text/plain) 2020-06-08 14:20 UTC, Lubov	no flags	Details
updated machine controller log (649.72 KB, text/plain) 2020-06-10 13:16 UTC, Lubov	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-baremetal pull 87	0	None	closed	ensure delete removes the link between machine and host	2020-12-30 18:34:17 UTC

Description Lubov 2020-06-08 14:17:51 UTC

Created attachment 1696124 [details]
After attempt to scale up

Created attachment 1696124 [details]
After attempt to scale up


Description of problem:
When scaling down machineset the node is not deleted:
in machine list it is listed in state Deleting
in node list it is listed as NotReady,SchedulingDisabled

It looks like the deleting process stuck due failure while draining node

In addition CO monitoring is degrading and becomes not available

In bmh list the node becomes deprovisioning and ready as expected

If we try to scale up the machineset, the machine in Deleting state is still listed, new machine shown in provisioned state, but new node is not added (see in attachment)


Version-Release number of selected component (if applicable):
Client Version: 4.5.0-0.nightly-2020-06-05-214616
Server Version: 4.5.0-0.nightly-2020-06-07-080121
Kubernetes Version: v1.18.3+a637491

How reproducible:
Constantly

Steps to Reproduce:
1. Annotate a node to be deleted from machineset
$ oc annotate machine <name-CONSUMER> machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api

2. Scale down the machineset
$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 --replicas=<N-1>

Actual results:
The node is still listed:
$ oc get machine -o wide -n openshift-machine-api
NAME                                PHASE      TYPE   REGION   ZONE   AGE   NODE
ocp-edge-cluster-0-worker-0-ctb52   Deleting                          66m   worker-0-2 

$ oc get nodes
NAME         STATUS                        ROLES    AGE   VERSION
worker-0-2   NotReady,SchedulingDisabled   worker   45m   v1.18.3+a637491 

Expected results:
The node not listed in both machines and nodes lists:
$ oc get nodes 
$ oc get machine -o wide -n openshift-machine-api  

Additional info:
attaching logs

From machine-controller.log:
I0608 13:44:58.857259       1 controller.go:165] ocp-edge-cluster-0-worker-0-ctb52: reconciling Machine
I0608 13:44:58.857322       1 controller.go:420] ocp-edge-cluster-0-worker-0-ctb52: going into phase "Deleting"
I0608 13:44:58.869530       1 controller.go:205] ocp-edge-cluster-0-worker-0-ctb52: reconciling machine triggers delete
E0608 13:44:59.009028       1 controller.go:466] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-s2jc9, openshift-dns/dns-default-5trxc, openshift-image-registry/node-ca-8sj8h, openshift-machine-config-operator/machine-config-daemon-svk7k, openshift-monitoring/node-exporter-57fcz, openshift-multus/multus-v4zd9, openshift-ovn-kubernetes/ovnkube-node-xw8sq, openshift-ovn-kubernetes/ovs-node-462hd
I0608 13:44:59.029124       1 controller.go:466] evicting pod openshift-monitoring/alertmanager-main-0
I0608 13:44:59.029167       1 controller.go:466] evicting pod openshift-monitoring/grafana-755f49d779-kjrtm
I0608 13:44:59.029530       1 controller.go:466] evicting pod openshift-monitoring/thanos-querier-6bc8cc988f-bl9gf
I0608 13:44:59.029700       1 controller.go:466] evicting pod openshift-ingress/router-default-5776858d7b-g566h
I0608 13:45:02.319958       1 controller.go:354] Evicted pod from Nodepodthanos-querier-6bc8cc988f-bl9gf/openshift-monitoring
I0608 13:45:02.319961       1 controller.go:354] Evicted pod from Nodepodgrafana-755f49d779-kjrtm/openshift-monitoring
I0608 13:45:02.320030       1 controller.go:354] Evicted pod from Nodepodalertmanager-main-0/openshift-monitoring
W0608 13:45:19.234082       1 controller.go:375] drain failed for machine "ocp-edge-cluster-0-worker-0-ctb52": error when waiting for pod "router-default-5776858d7b-g566h" terminating: global timeout reached: 20s
E0608 13:45:19.234128       1 controller.go:213] ocp-edge-cluster-0-worker-0-ctb52: failed to drain node for machine: requeue in: 20s
I0608 13:45:19.234143       1 controller.go:401] Actuator returned requeue-after error: requeue in: 20s

Comment 1 Lubov 2020-06-08 14:19:09 UTC

Created attachment 1696125 [details]
machine-api-operator container log

Comment 2 Lubov 2020-06-08 14:19:48 UTC

Created attachment 1696126 [details]
monitoring operator log

Comment 3 Lubov 2020-06-08 14:20:23 UTC

Created attachment 1696127 [details]
machine controller log

Comment 4 Lubov 2020-06-08 14:20:56 UTC

Created attachment 1696128 [details]
machineset controller log

Comment 5 Lubov 2020-06-10 13:14:49 UTC

Re-checked with 
Server Version: 4.5.0-0.nightly-2020-06-09-223121

Now draining succeeded but node still not deleted
Attaching updated machine-controller.log

Comment 6 Lubov 2020-06-10 13:16:19 UTC

Created attachment 1696487 [details]
updated machine controller log

Comment 7 Michael Gugino 2020-07-23 19:06:57 UTC

From machine-controller: "can't proceed deleting machine while cloud instance is being terminated, requeuing"

What this means is the bare metal provider is indicating that the instance/server still exists according to the bare metal operator.

So, either the instance is never being successfully deleted by the bare metal operator, or if the BMO is trying to re-use an instance name, the replacement already exists. I have a feeling this might be the case due to the original case comment: "If we try to scale up the machineset, the machine in Deleting state is still listed, new machine shown in provisioned state". We can also see from the machine-controller logs that the node for this associated machine is unreachable, which indicates it was stopped at some point or otherwise isolated from the cluster (this does not happen during a normal drain, most likely this was caused by the BMO).

Since the 'new machine' was stuck in provisioned state in the original case comment, it gives me the indication that names are being reused in the infrastructure, and that's going to cause these problems.

One way the BMO could handle this is to keep track of instance/machine ID mappings.

For example, when the BM provider queries the BMO for 'machine-1', it should associate that with some actual instance, 'instance-a'.

The BMO needs to keep track of the lifecycle of 'instance-a' and ensure it's not associated with any new machine until it's completely deprovisioned. When the BM actuator inquires about 'machine-1' again, it should know that 'instance-a' is no longer associated with 'machine-1' and return 404. This would allow 'machine-2' to be associated with 'instance-a' internally to the BMO, so these types of races/conflicts can't happen.

Anyway, that's what I *think* is possibly happening, there's not really enough information in this BZ to confirm.

Comment 8 Doug Hellmann 2020-07-27 19:57:28 UTC

I'm reasonably confident that this bug is the same as 1855823. The fix for that has already merged. If I am missing some detail about why they are not the same, please let me know.

*** This bug has been marked as a duplicate of bug 1855823 ***

Note You need to log in before you can comment on or make changes to this bug.