Bug 1743846

Summary: MCD does not wait for nodes to drain
Product: OpenShift Container Platform Reporter: Antonio Murdaca <amurdaca>
Component: Machine Config OperatorAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED ERRATA QA Contact: Micah Abbott <miabbott>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: agarcial, jchaloup, jhou, mgugino, miabbott, mnguyen
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1729510 Environment:
Last Closed: 2019-10-16 06:36:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1729512, 1737379    
Bug Blocks: 1729510    

Description Antonio Murdaca 2019-08-20 19:24:06 UTC
+++ This bug was initially created as a clone of Bug #1729510 +++

The MCO is also affected by this bug as we use the library to drain nodes.

+++ This bug was initially created as a clone of Bug #1729243 +++

Description of problem:
When deleting a machine object, the machine-controller first attempts to cordon and drain the node.  Unfortunately, a bug in the library github.com/openshift/kubernetes-drain prevents the machine-controller for waiting for a successful drain.  This bug causes the library to believe the pod has been successfully evicted or deleted before such eviction/deletion has actually taken place.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.  Delete worker machine object on 4.1 cluster on IPI.

Actual results:
Drain reports complete before pods are actually evicted or deleted.

Expected results:
We should wait to ensure services aren't interrupted.

Additional info:
Logs from modified machine controller: https://gist.github.com/michaelgugino/bb8b4129094c683681d87cb63a4e5875

Modified machine-controller code: https://github.com/openshift/cluster-api-provider-aws/pull/234

--- Additional comment from Michael Gugino on 2019-07-11 16:51:38 UTC ---

PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1

Comment 1 Antonio Murdaca 2019-08-20 19:25:10 UTC
PR: https://github.com/openshift/machine-config-operator/pull/962

Comment 3 Michael Nguyen 2019-08-21 20:41:00 UTC
Verified on  4.2.0-0.nightly-2019-08-20-162755
pods are properly drained before machine is deleted

1. pick a worker node from
oc get node

2. create 50 pods, replacing kubernetes.io/hostname with the hostname of the worker in step 2
for i in {1..50}; do oc run --restart=Never --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-157-197" } } }' --image registry.fedoraproject.org/fedora:30 "foobar$i" -- sleep 1h; sleep .001; done

3. in a separate terminal, get the machine-api-controller name to watch the logs for the machine-controller
oc -n openshift-machine-api get pods
oc -n openshift-machine-api logs -f pods/machine-api-controllers-654b499995-cjfdp -c machine-controller

4. find the machine with the node matching the worker node name selected in step 2 and delete it
oc -n openshift-machine-api get machine -o wide
oc -n openshift-machine-api delete machine/foo-w8rzg-worker-us-west-2a-xf26r

5. watch the logs on the terminal in step 4 to verify that the node drains all the pods before the machine is deleted

Comment 4 errata-xmlrpc 2019-10-16 06:36:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.