Bug 1729512

Summary: machine-controller does not wait for nodes to drain (4.2)
Product: OpenShift Container Platform Reporter: Michael Gugino <mgugino>
Component: Cloud ComputeAssignee: Michael Gugino <mgugino>
Status: CLOSED ERRATA QA Contact: Jianwei Hou <jhou>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: agarcial, dhellmann, eparis, jhou
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1729243 Environment:
Last Closed: 2019-10-16 06:29:43 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1729243, 1729510, 1743846    
Attachments:
Description Flags
Pods deleted during node draining none

Description Michael Gugino 2019-07-12 12:54:58 UTC
+++ This bug was initially created as a clone of Bug #1729243 +++

Description of problem:
When deleting a machine object, the machine-controller first attempts to cordon and drain the node.  Unfortunately, a bug in the library github.com/openshift/kubernetes-drain prevents the machine-controller for waiting for a successful drain.  This bug causes the library to believe the pod has been successfully evicted or deleted before such eviction/deletion has actually taken place.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.  Delete worker machine object on 4.1 cluster on IPI.

Actual results:
Drain reports complete before pods are actually evicted or deleted.

Expected results:
We should wait to ensure services aren't interrupted.

Additional info:
Logs from modified machine controller: https://gist.github.com/michaelgugino/bb8b4129094c683681d87cb63a4e5875

Modified machine-controller code: https://github.com/openshift/cluster-api-provider-aws/pull/234

--- Additional comment from Michael Gugino on 2019-07-11 16:51:38 UTC ---

PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1

--- Additional comment from Michael Gugino on 2019-07-12 12:52:25 UTC ---

PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1

merged.

Need to distribute fix to machine-api libraries next.

Comment 1 Michael Gugino 2019-07-12 14:03:13 UTC
PR to openshift/cluster-api in 4.2: https://github.com/openshift/cluster-api/pull/52

After merging, need to vendor that change into the cluster-api-provider-* libs.

Comment 2 Michael Gugino 2019-07-12 14:14:19 UTC
cluster-api merged in 4.2.

cluster-api-provider-aws: https://github.com/openshift/cluster-api-provider-aws/pull/237

Outstanding:  Libvirt, GCP, Azure, Baremetal

Comment 3 Michael Gugino 2019-07-12 14:31:08 UTC
PR for cluster-api-provider-libvirt in 4.2: https://github.com/openshift/cluster-api-provider-libvirt/pull/162

Outstanding: GCP, Azure, Baremetal

Comment 4 Michael Gugino 2019-07-12 14:35:53 UTC
PR for cluster-api-provider-gcp in 4.2: https://github.com/openshift/cluster-api-provider-gcp/pull/31

Outstanding: Azure, Baremetal

Comment 5 Michael Gugino 2019-07-12 15:12:14 UTC
PR for azure: https://github.com/openshift/cluster-api-provider-azure/pull/52

Outstanding: Baremetal

Baremetal is tracking here: https://jira.coreos.com/browse/KNIDEPLOY-652

Comment 6 Michael Gugino 2019-07-12 16:02:26 UTC
Everything except baremetal has merged in 4.2, they will track separately.

Comment 8 Doug Hellmann 2019-07-12 18:12:20 UTC
Bare metal PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/38

Comment 9 Jianwei Hou 2019-07-18 06:13:49 UTC
Created attachment 1591678 [details]
Pods deleted during node draining

Verified in 4.2.0-0.nightly-2019-07-17-115118 on AWS IPI. Log attached.

Comment 10 errata-xmlrpc 2019-10-16 06:29:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922