|Summary:||machine-controller does not wait for nodes to drain (4.2)|
|Product:||OpenShift Container Platform||Reporter:||Michael Gugino <mgugino>|
|Component:||Cloud Compute||Assignee:||Michael Gugino <mgugino>|
|Status:||CLOSED ERRATA||QA Contact:||Jianwei Hou <jhou>|
|Version:||4.2.0||CC:||agarcial, dhellmann, eparis, jhou|
|Fixed In Version:||Doc Type:||If docs needed, set a value|
|Doc Text:||Story Points:||---|
|Last Closed:||2019-10-16 06:29:43 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
|Bug Depends On:|
|Bug Blocks:||1729243, 1729510, 1743846|
Description Michael Gugino 2019-07-12 12:54:58 UTC
+++ This bug was initially created as a clone of Bug #1729243 +++ Description of problem: When deleting a machine object, the machine-controller first attempts to cordon and drain the node. Unfortunately, a bug in the library github.com/openshift/kubernetes-drain prevents the machine-controller for waiting for a successful drain. This bug causes the library to believe the pod has been successfully evicted or deleted before such eviction/deletion has actually taken place. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Delete worker machine object on 4.1 cluster on IPI. Actual results: Drain reports complete before pods are actually evicted or deleted. Expected results: We should wait to ensure services aren't interrupted. Additional info: Logs from modified machine controller: https://gist.github.com/michaelgugino/bb8b4129094c683681d87cb63a4e5875 Modified machine-controller code: https://github.com/openshift/cluster-api-provider-aws/pull/234 --- Additional comment from Michael Gugino on 2019-07-11 16:51:38 UTC --- PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1 --- Additional comment from Michael Gugino on 2019-07-12 12:52:25 UTC --- PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1 merged. Need to distribute fix to machine-api libraries next.
Comment 1 Michael Gugino 2019-07-12 14:03:13 UTC
PR to openshift/cluster-api in 4.2: https://github.com/openshift/cluster-api/pull/52 After merging, need to vendor that change into the cluster-api-provider-* libs.
Comment 2 Michael Gugino 2019-07-12 14:14:19 UTC
cluster-api merged in 4.2. cluster-api-provider-aws: https://github.com/openshift/cluster-api-provider-aws/pull/237 Outstanding: Libvirt, GCP, Azure, Baremetal
Comment 3 Michael Gugino 2019-07-12 14:31:08 UTC
PR for cluster-api-provider-libvirt in 4.2: https://github.com/openshift/cluster-api-provider-libvirt/pull/162 Outstanding: GCP, Azure, Baremetal
Comment 4 Michael Gugino 2019-07-12 14:35:53 UTC
PR for cluster-api-provider-gcp in 4.2: https://github.com/openshift/cluster-api-provider-gcp/pull/31 Outstanding: Azure, Baremetal
Comment 5 Michael Gugino 2019-07-12 15:12:14 UTC
PR for azure: https://github.com/openshift/cluster-api-provider-azure/pull/52 Outstanding: Baremetal Baremetal is tracking here: https://jira.coreos.com/browse/KNIDEPLOY-652
Comment 6 Michael Gugino 2019-07-12 16:02:26 UTC
Everything except baremetal has merged in 4.2, they will track separately.
Comment 8 Doug Hellmann 2019-07-12 18:12:20 UTC
Comment 9 Jianwei Hou 2019-07-18 06:13:49 UTC
Created attachment 1591678 [details] Pods deleted during node draining Verified in 4.2.0-0.nightly-2019-07-17-115118 on AWS IPI. Log attached.
Comment 10 errata-xmlrpc 2019-10-16 06:29:43 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922