1743846 – MCD does not wait for nodes to drain

Bug 1743846 - MCD does not wait for nodes to drain

Summary: MCD does not wait for nodes to drain

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Antonio Murdaca
QA Contact:	Micah Abbott
Docs Contact:
URL:
Whiteboard:
Depends On:	1729512 1737379
Blocks:	1729510
TreeView+	depends on / blocked

Reported:	2019-08-20 19:24 UTC by Antonio Murdaca
Modified:	2019-10-16 06:37 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1729510
Environment:
Last Closed:	2019-10-16 06:36:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:37:07 UTC

Description Antonio Murdaca 2019-08-20 19:24:06 UTC

+++ This bug was initially created as a clone of Bug #1729510 +++

The MCO is also affected by this bug as we use the library to drain nodes.

+++ This bug was initially created as a clone of Bug #1729243 +++

Description of problem:
When deleting a machine object, the machine-controller first attempts to cordon and drain the node.  Unfortunately, a bug in the library github.com/openshift/kubernetes-drain prevents the machine-controller for waiting for a successful drain.  This bug causes the library to believe the pod has been successfully evicted or deleted before such eviction/deletion has actually taken place.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.  Delete worker machine object on 4.1 cluster on IPI.

Actual results:
Drain reports complete before pods are actually evicted or deleted.

Expected results:
We should wait to ensure services aren't interrupted.

Additional info:
Logs from modified machine controller: https://gist.github.com/michaelgugino/bb8b4129094c683681d87cb63a4e5875

Modified machine-controller code: https://github.com/openshift/cluster-api-provider-aws/pull/234

--- Additional comment from Michael Gugino on 2019-07-11 16:51:38 UTC ---

PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1

Comment 1 Antonio Murdaca 2019-08-20 19:25:10 UTC

PR: https://github.com/openshift/machine-config-operator/pull/962

Comment 3 Michael Nguyen 2019-08-21 20:41:00 UTC

Verified on  4.2.0-0.nightly-2019-08-20-162755
pods are properly drained before machine is deleted

Steps
1. pick a worker node from
oc get node

2. create 50 pods, replacing kubernetes.io/hostname with the hostname of the worker in step 2
for i in {1..50}; do oc run --restart=Never --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-157-197" } } }' --image registry.fedoraproject.org/fedora:30 "foobar$i" -- sleep 1h; sleep .001; done

3. in a separate terminal, get the machine-api-controller name to watch the logs for the machine-controller
oc -n openshift-machine-api get pods
oc -n openshift-machine-api logs -f pods/machine-api-controllers-654b499995-cjfdp -c machine-controller

4. find the machine with the node matching the worker node name selected in step 2 and delete it
oc -n openshift-machine-api get machine -o wide
oc -n openshift-machine-api delete machine/foo-w8rzg-worker-us-west-2a-xf26r

5. watch the logs on the terminal in step 4 to verify that the node drains all the pods before the machine is deleted

Comment 4 errata-xmlrpc 2019-10-16 06:36:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.