+++ This bug was initially created as a clone of Bug #1729243 +++ Description of problem: When deleting a machine object, the machine-controller first attempts to cordon and drain the node. Unfortunately, a bug in the library github.com/openshift/kubernetes-drain prevents the machine-controller for waiting for a successful drain. This bug causes the library to believe the pod has been successfully evicted or deleted before such eviction/deletion has actually taken place. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Delete worker machine object on 4.1 cluster on IPI. Actual results: Drain reports complete before pods are actually evicted or deleted. Expected results: We should wait to ensure services aren't interrupted. Additional info: Logs from modified machine controller: https://gist.github.com/michaelgugino/bb8b4129094c683681d87cb63a4e5875 Modified machine-controller code: https://github.com/openshift/cluster-api-provider-aws/pull/234 --- Additional comment from Michael Gugino on 2019-07-11 16:51:38 UTC --- PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1 --- Additional comment from Michael Gugino on 2019-07-12 12:52:25 UTC --- PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1 merged. Need to distribute fix to machine-api libraries next.
PR to openshift/cluster-api in 4.2: https://github.com/openshift/cluster-api/pull/52 After merging, need to vendor that change into the cluster-api-provider-* libs.
cluster-api merged in 4.2. cluster-api-provider-aws: https://github.com/openshift/cluster-api-provider-aws/pull/237 Outstanding: Libvirt, GCP, Azure, Baremetal
PR for cluster-api-provider-libvirt in 4.2: https://github.com/openshift/cluster-api-provider-libvirt/pull/162 Outstanding: GCP, Azure, Baremetal
PR for cluster-api-provider-gcp in 4.2: https://github.com/openshift/cluster-api-provider-gcp/pull/31 Outstanding: Azure, Baremetal
PR for azure: https://github.com/openshift/cluster-api-provider-azure/pull/52 Outstanding: Baremetal Baremetal is tracking here: https://jira.coreos.com/browse/KNIDEPLOY-652
Everything except baremetal has merged in 4.2, they will track separately.
Bare metal PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/38
Created attachment 1591678 [details] Pods deleted during node draining Verified in 4.2.0-0.nightly-2019-07-17-115118 on AWS IPI. Log attached.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922