1828003 – Problem with deleting a node without draining it

Bug 1828003 - Problem with deleting a node without draining it

Summary: Problem with deleting a node without draining it

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Alberto
QA Contact:	Daniel
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1840577 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-26 08:55 UTC by Daniel
Modified:	2020-07-13 17:31 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:31:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
worker description (6.33 KB, text/plain) 2020-04-27 06:58 UTC, Lubov	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-baremetal pull 69	0	None	closed	Bug 1828003: Change cluster-api to machine-api-operator API	2020-12-17 14:09:13 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:31:56 UTC

Description Daniel 2020-04-26 08:55:06 UTC

Description of problem:
After creating a pod on a worker and then trying to delete the worker without draining it, the pod is successfully moved to another worker and gets stuck in 'Terminating' state on the deleted worker. The deleted worker can't be deleted and is still listed when checking the nodes but it's unreachable.

If trying to drain the node with the running pod after the mentioned above,
The pod is moved to another worker, And get's stuck in a 'Terminating' state on the current node. Hence, the current node is not deleted properly. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Create a pod on worker.

2. Delete the worker without draining. cmd: $ oc adm drain
$ oc annotate machine <worker-CONSUMER-name> machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api
$ oc delete bmh <worker-name> -n openshift-machine-api

3. The pod status is stuck on terminating, So we're unable to delete the worker node. When trying delete the worker node, the status became NotReady, SchedulingDisabled.  

Actual results:
$ oc get nodes:
Status of deleted worker node: NotReady, SchedulingDisabled.  
$ oc get machine -o wide -n openshift-machine-api:
The node is still listed.

Expected results:
Node is successfully deleted and is not listed anywhere(
$ oc get nodes
$ oc get bmh -n openshift-machine-api
$ oc get machine -o wide -n openshift-machine-api
).

Additional info:

Comment 1 Mrunal Patel 2020-04-26 15:09:46 UTC

Can you provide information on the crio and kubelet versions in use?

Comment 2 Mrunal Patel 2020-04-26 17:11:36 UTC

Also, we need the kubelet and crio logs as well if you have them.

Comment 6 Lubov 2020-04-27 06:58:51 UTC

Created attachment 1682043 [details]
worker description

Comment 7 MinLi 2020-04-27 07:28:03 UTC

I can not reproduce the above steps, in step 2:

$ oc delete bmh ip-10-0-168-12.us-east-2.compute.internal -n openshift-machine-api
Error from server (NotFound): baremetalhosts.metal3.io "ip-10-0-168-12.us-east-2.compute.internal" not found

It seems that baremetalhosts don't exist!

Comment 8 Stephen Cuppett 2020-04-27 12:38:14 UTC

Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 10 Michael Gugino 2020-05-19 01:58:05 UTC

The issue here is that the cluster-api-provider-baremetal revendored github.com/openshift/cluster-api.  In 4.4 and newer, the relevant machine-api code has moved to github.com/openshift/machine-api-operator and should be vendored instead.  github.com/openshift/cluster-api is no longer in development.

Comment 11 Honza Pokorny 2020-05-26 16:38:41 UTC

*** Bug 1837505 has been marked as a duplicate of this bug. ***

Comment 12 Stephen Benjamin 2020-05-28 13:00:45 UTC

*** Bug 1840577 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2020-07-13 17:31:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.