Bug 1743846 - MCD does not wait for nodes to drain
Summary: MCD does not wait for nodes to drain
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.2.0
Assignee: Antonio Murdaca
QA Contact: Micah Abbott
URL:
Whiteboard:
Depends On: 1729512 1737379
Blocks: 1729510
TreeView+ depends on / blocked
 
Reported: 2019-08-20 19:24 UTC by Antonio Murdaca
Modified: 2019-10-16 06:37 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1729510
Environment:
Last Closed: 2019-10-16 06:36:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:37:07 UTC

Description Antonio Murdaca 2019-08-20 19:24:06 UTC
+++ This bug was initially created as a clone of Bug #1729510 +++

The MCO is also affected by this bug as we use the library to drain nodes.

+++ This bug was initially created as a clone of Bug #1729243 +++

Description of problem:
When deleting a machine object, the machine-controller first attempts to cordon and drain the node.  Unfortunately, a bug in the library github.com/openshift/kubernetes-drain prevents the machine-controller for waiting for a successful drain.  This bug causes the library to believe the pod has been successfully evicted or deleted before such eviction/deletion has actually taken place.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.  Delete worker machine object on 4.1 cluster on IPI.

Actual results:
Drain reports complete before pods are actually evicted or deleted.

Expected results:
We should wait to ensure services aren't interrupted.

Additional info:
Logs from modified machine controller: https://gist.github.com/michaelgugino/bb8b4129094c683681d87cb63a4e5875

Modified machine-controller code: https://github.com/openshift/cluster-api-provider-aws/pull/234

--- Additional comment from Michael Gugino on 2019-07-11 16:51:38 UTC ---

PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1

Comment 1 Antonio Murdaca 2019-08-20 19:25:10 UTC
PR: https://github.com/openshift/machine-config-operator/pull/962

Comment 3 Michael Nguyen 2019-08-21 20:41:00 UTC
Verified on  4.2.0-0.nightly-2019-08-20-162755
pods are properly drained before machine is deleted

Steps
1. pick a worker node from
oc get node

2. create 50 pods, replacing kubernetes.io/hostname with the hostname of the worker in step 2
for i in {1..50}; do oc run --restart=Never --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-157-197" } } }' --image registry.fedoraproject.org/fedora:30 "foobar$i" -- sleep 1h; sleep .001; done

3. in a separate terminal, get the machine-api-controller name to watch the logs for the machine-controller
oc -n openshift-machine-api get pods
oc -n openshift-machine-api logs -f pods/machine-api-controllers-654b499995-cjfdp -c machine-controller

4. find the machine with the node matching the worker node name selected in step 2 and delete it
oc -n openshift-machine-api get machine -o wide
oc -n openshift-machine-api delete machine/foo-w8rzg-worker-us-west-2a-xf26r

5. watch the logs on the terminal in step 4 to verify that the node drains all the pods before the machine is deleted

Comment 4 errata-xmlrpc 2019-10-16 06:36:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.