Bug 1743846

Summary:	MCD does not wait for nodes to drain
Product:	OpenShift Container Platform	Reporter:	Antonio Murdaca <amurdaca>
Component:	Machine Config Operator	Assignee:	Antonio Murdaca <amurdaca>
Status:	CLOSED ERRATA	QA Contact:	Micah Abbott <miabbott>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.2.0	CC:	agarcial, jchaloup, jhou, mgugino, miabbott, mnguyen
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1729510	Environment:
Last Closed:	2019-10-16 06:36:58 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1729512, 1737379
Bug Blocks:	1729510

Description Antonio Murdaca 2019-08-20 19:24:06 UTC

+++ This bug was initially created as a clone of Bug #1729510 +++

The MCO is also affected by this bug as we use the library to drain nodes.

+++ This bug was initially created as a clone of Bug #1729243 +++

Description of problem:
When deleting a machine object, the machine-controller first attempts to cordon and drain the node.  Unfortunately, a bug in the library github.com/openshift/kubernetes-drain prevents the machine-controller for waiting for a successful drain.  This bug causes the library to believe the pod has been successfully evicted or deleted before such eviction/deletion has actually taken place.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.  Delete worker machine object on 4.1 cluster on IPI.

Actual results:
Drain reports complete before pods are actually evicted or deleted.

Expected results:
We should wait to ensure services aren't interrupted.

Additional info:
Logs from modified machine controller: https://gist.github.com/michaelgugino/bb8b4129094c683681d87cb63a4e5875

Modified machine-controller code: https://github.com/openshift/cluster-api-provider-aws/pull/234

--- Additional comment from Michael Gugino on 2019-07-11 16:51:38 UTC ---

PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1

Comment 1 Antonio Murdaca 2019-08-20 19:25:10 UTC

PR: https://github.com/openshift/machine-config-operator/pull/962

Comment 3 Michael Nguyen 2019-08-21 20:41:00 UTC

Verified on  4.2.0-0.nightly-2019-08-20-162755
pods are properly drained before machine is deleted

Steps
1. pick a worker node from
oc get node

2. create 50 pods, replacing kubernetes.io/hostname with the hostname of the worker in step 2
for i in {1..50}; do oc run --restart=Never --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-157-197" } } }' --image registry.fedoraproject.org/fedora:30 "foobar$i" -- sleep 1h; sleep .001; done

3. in a separate terminal, get the machine-api-controller name to watch the logs for the machine-controller
oc -n openshift-machine-api get pods
oc -n openshift-machine-api logs -f pods/machine-api-controllers-654b499995-cjfdp -c machine-controller

4. find the machine with the node matching the worker node name selected in step 2 and delete it
oc -n openshift-machine-api get machine -o wide
oc -n openshift-machine-api delete machine/foo-w8rzg-worker-us-west-2a-xf26r

5. watch the logs on the terminal in step 4 to verify that the node drains all the pods before the machine is deleted

Comment 4 errata-xmlrpc 2019-10-16 06:36:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922