+++ This bug was initially created as a clone of Bug #1729510 +++ The MCO is also affected by this bug as we use the library to drain nodes. +++ This bug was initially created as a clone of Bug #1729243 +++ Description of problem: When deleting a machine object, the machine-controller first attempts to cordon and drain the node. Unfortunately, a bug in the library github.com/openshift/kubernetes-drain prevents the machine-controller for waiting for a successful drain. This bug causes the library to believe the pod has been successfully evicted or deleted before such eviction/deletion has actually taken place. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Delete worker machine object on 4.1 cluster on IPI. Actual results: Drain reports complete before pods are actually evicted or deleted. Expected results: We should wait to ensure services aren't interrupted. Additional info: Logs from modified machine controller: https://gist.github.com/michaelgugino/bb8b4129094c683681d87cb63a4e5875 Modified machine-controller code: https://github.com/openshift/cluster-api-provider-aws/pull/234 --- Additional comment from Michael Gugino on 2019-07-11 16:51:38 UTC --- PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1
PR: https://github.com/openshift/machine-config-operator/pull/962
Verified on 4.2.0-0.nightly-2019-08-20-162755 pods are properly drained before machine is deleted Steps 1. pick a worker node from oc get node 2. create 50 pods, replacing kubernetes.io/hostname with the hostname of the worker in step 2 for i in {1..50}; do oc run --restart=Never --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-157-197" } } }' --image registry.fedoraproject.org/fedora:30 "foobar$i" -- sleep 1h; sleep .001; done 3. in a separate terminal, get the machine-api-controller name to watch the logs for the machine-controller oc -n openshift-machine-api get pods oc -n openshift-machine-api logs -f pods/machine-api-controllers-654b499995-cjfdp -c machine-controller 4. find the machine with the node matching the worker node name selected in step 2 and delete it oc -n openshift-machine-api get machine -o wide oc -n openshift-machine-api delete machine/foo-w8rzg-worker-us-west-2a-xf26r 5. watch the logs on the terminal in step 4 to verify that the node drains all the pods before the machine is deleted
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922