Bug 1729512 - machine-controller does not wait for nodes to drain (4.2)
Summary: machine-controller does not wait for nodes to drain (4.2)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.2.0
Assignee: Michael Gugino
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks: 1729243 1729510 1743846
TreeView+ depends on / blocked
 
Reported: 2019-07-12 12:54 UTC by Michael Gugino
Modified: 2019-10-16 06:29 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1729243
Environment:
Last Closed: 2019-10-16 06:29:43 UTC
Target Upstream Version:


Attachments (Terms of Use)
Pods deleted during node draining (10.45 KB, text/plain)
2019-07-18 06:13 UTC, Jianwei Hou
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 None None None 2019-10-16 06:29:58 UTC

Description Michael Gugino 2019-07-12 12:54:58 UTC
+++ This bug was initially created as a clone of Bug #1729243 +++

Description of problem:
When deleting a machine object, the machine-controller first attempts to cordon and drain the node.  Unfortunately, a bug in the library github.com/openshift/kubernetes-drain prevents the machine-controller for waiting for a successful drain.  This bug causes the library to believe the pod has been successfully evicted or deleted before such eviction/deletion has actually taken place.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.  Delete worker machine object on 4.1 cluster on IPI.

Actual results:
Drain reports complete before pods are actually evicted or deleted.

Expected results:
We should wait to ensure services aren't interrupted.

Additional info:
Logs from modified machine controller: https://gist.github.com/michaelgugino/bb8b4129094c683681d87cb63a4e5875

Modified machine-controller code: https://github.com/openshift/cluster-api-provider-aws/pull/234

--- Additional comment from Michael Gugino on 2019-07-11 16:51:38 UTC ---

PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1

--- Additional comment from Michael Gugino on 2019-07-12 12:52:25 UTC ---

PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1

merged.

Need to distribute fix to machine-api libraries next.

Comment 1 Michael Gugino 2019-07-12 14:03:13 UTC
PR to openshift/cluster-api in 4.2: https://github.com/openshift/cluster-api/pull/52

After merging, need to vendor that change into the cluster-api-provider-* libs.

Comment 2 Michael Gugino 2019-07-12 14:14:19 UTC
cluster-api merged in 4.2.

cluster-api-provider-aws: https://github.com/openshift/cluster-api-provider-aws/pull/237

Outstanding:  Libvirt, GCP, Azure, Baremetal

Comment 3 Michael Gugino 2019-07-12 14:31:08 UTC
PR for cluster-api-provider-libvirt in 4.2: https://github.com/openshift/cluster-api-provider-libvirt/pull/162

Outstanding: GCP, Azure, Baremetal

Comment 4 Michael Gugino 2019-07-12 14:35:53 UTC
PR for cluster-api-provider-gcp in 4.2: https://github.com/openshift/cluster-api-provider-gcp/pull/31

Outstanding: Azure, Baremetal

Comment 5 Michael Gugino 2019-07-12 15:12:14 UTC
PR for azure: https://github.com/openshift/cluster-api-provider-azure/pull/52

Outstanding: Baremetal

Baremetal is tracking here: https://jira.coreos.com/browse/KNIDEPLOY-652

Comment 6 Michael Gugino 2019-07-12 16:02:26 UTC
Everything except baremetal has merged in 4.2, they will track separately.

Comment 8 Doug Hellmann 2019-07-12 18:12:20 UTC
Bare metal PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/38

Comment 9 Jianwei Hou 2019-07-18 06:13:49 UTC
Created attachment 1591678 [details]
Pods deleted during node draining

Verified in 4.2.0-0.nightly-2019-07-17-115118 on AWS IPI. Log attached.

Comment 10 errata-xmlrpc 2019-10-16 06:29:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.