1729512 – machine-controller does not wait for nodes to drain (4.2)

Bug 1729512 - machine-controller does not wait for nodes to drain (4.2)

Summary: machine-controller does not wait for nodes to drain (4.2)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Michael Gugino
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1729243 1729510 1743846
TreeView+	depends on / blocked

Reported:	2019-07-12 12:54 UTC by Michael Gugino
Modified:	2019-10-16 06:29 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1729243
Environment:
Last Closed:	2019-10-16 06:29:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Pods deleted during node draining (10.45 KB, text/plain) 2019-07-18 06:13 UTC, Jianwei Hou	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:29:58 UTC

Description Michael Gugino 2019-07-12 12:54:58 UTC

+++ This bug was initially created as a clone of Bug #1729243 +++

Description of problem:
When deleting a machine object, the machine-controller first attempts to cordon and drain the node.  Unfortunately, a bug in the library github.com/openshift/kubernetes-drain prevents the machine-controller for waiting for a successful drain.  This bug causes the library to believe the pod has been successfully evicted or deleted before such eviction/deletion has actually taken place.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.  Delete worker machine object on 4.1 cluster on IPI.

Actual results:
Drain reports complete before pods are actually evicted or deleted.

Expected results:
We should wait to ensure services aren't interrupted.

Additional info:
Logs from modified machine controller: https://gist.github.com/michaelgugino/bb8b4129094c683681d87cb63a4e5875

Modified machine-controller code: https://github.com/openshift/cluster-api-provider-aws/pull/234

--- Additional comment from Michael Gugino on 2019-07-11 16:51:38 UTC ---

PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1

--- Additional comment from Michael Gugino on 2019-07-12 12:52:25 UTC ---

PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1

merged.

Need to distribute fix to machine-api libraries next.

Comment 1 Michael Gugino 2019-07-12 14:03:13 UTC

PR to openshift/cluster-api in 4.2: https://github.com/openshift/cluster-api/pull/52

After merging, need to vendor that change into the cluster-api-provider-* libs.

Comment 2 Michael Gugino 2019-07-12 14:14:19 UTC

cluster-api merged in 4.2.

cluster-api-provider-aws: https://github.com/openshift/cluster-api-provider-aws/pull/237

Outstanding:  Libvirt, GCP, Azure, Baremetal

Comment 3 Michael Gugino 2019-07-12 14:31:08 UTC

PR for cluster-api-provider-libvirt in 4.2: https://github.com/openshift/cluster-api-provider-libvirt/pull/162

Outstanding: GCP, Azure, Baremetal

Comment 4 Michael Gugino 2019-07-12 14:35:53 UTC

PR for cluster-api-provider-gcp in 4.2: https://github.com/openshift/cluster-api-provider-gcp/pull/31

Outstanding: Azure, Baremetal

Comment 5 Michael Gugino 2019-07-12 15:12:14 UTC

PR for azure: https://github.com/openshift/cluster-api-provider-azure/pull/52

Outstanding: Baremetal

Baremetal is tracking here: https://jira.coreos.com/browse/KNIDEPLOY-652

Comment 6 Michael Gugino 2019-07-12 16:02:26 UTC

Everything except baremetal has merged in 4.2, they will track separately.

Comment 8 Doug Hellmann 2019-07-12 18:12:20 UTC

Bare metal PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/38

Comment 9 Jianwei Hou 2019-07-18 06:13:49 UTC

Created attachment 1591678 [details]
Pods deleted during node draining

Verified in 4.2.0-0.nightly-2019-07-17-115118 on AWS IPI. Log attached.

Comment 10 errata-xmlrpc 2019-10-16 06:29:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.