Bug 1729512

Summary:

machine-controller does not wait for nodes to drain (4.2)

Product:

OpenShift Container Platform

Reporter:

Michael Gugino <mgugino>

Component:

Cloud Compute

Assignee:

Michael Gugino <mgugino>

Status:

CLOSED ERRATA

QA Contact:

Jianwei Hou <jhou>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

4.2.0

CC:

agarcial, dhellmann, eparis, jhou

Target Milestone:

---

Target Release:

4.2.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

1729243

Environment:

Last Closed:

2019-10-16 06:29:43 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1729243, 1729510, 1743846

Attachments:

Description	Flags
Pods deleted during node draining	none

Description Michael Gugino 2019-07-12 12:54:58 UTC

+++ This bug was initially created as a clone of Bug #1729243 +++

Description of problem:
When deleting a machine object, the machine-controller first attempts to cordon and drain the node.  Unfortunately, a bug in the library github.com/openshift/kubernetes-drain prevents the machine-controller for waiting for a successful drain.  This bug causes the library to believe the pod has been successfully evicted or deleted before such eviction/deletion has actually taken place.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1.  Delete worker machine object on 4.1 cluster on IPI.

Actual results:
Drain reports complete before pods are actually evicted or deleted.

Expected results:
We should wait to ensure services aren't interrupted.

Additional info:
Logs from modified machine controller: https://gist.github.com/michaelgugino/bb8b4129094c683681d87cb63a4e5875

Modified machine-controller code: https://github.com/openshift/cluster-api-provider-aws/pull/234

--- Additional comment from Michael Gugino on 2019-07-11 16:51:38 UTC ---

PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1

--- Additional comment from Michael Gugino on 2019-07-12 12:52:25 UTC ---

PR for kubernetes-drain: https://github.com/openshift/kubernetes-drain/pull/1

merged.

Need to distribute fix to machine-api libraries next.

Comment 1 Michael Gugino 2019-07-12 14:03:13 UTC

PR to openshift/cluster-api in 4.2: https://github.com/openshift/cluster-api/pull/52

After merging, need to vendor that change into the cluster-api-provider-* libs.

Comment 2 Michael Gugino 2019-07-12 14:14:19 UTC

cluster-api merged in 4.2.

cluster-api-provider-aws: https://github.com/openshift/cluster-api-provider-aws/pull/237

Outstanding:  Libvirt, GCP, Azure, Baremetal

Comment 3 Michael Gugino 2019-07-12 14:31:08 UTC

PR for cluster-api-provider-libvirt in 4.2: https://github.com/openshift/cluster-api-provider-libvirt/pull/162

Outstanding: GCP, Azure, Baremetal

Comment 4 Michael Gugino 2019-07-12 14:35:53 UTC

PR for cluster-api-provider-gcp in 4.2: https://github.com/openshift/cluster-api-provider-gcp/pull/31

Outstanding: Azure, Baremetal

Comment 5 Michael Gugino 2019-07-12 15:12:14 UTC

PR for azure: https://github.com/openshift/cluster-api-provider-azure/pull/52

Outstanding: Baremetal

Baremetal is tracking here: https://jira.coreos.com/browse/KNIDEPLOY-652

Comment 6 Michael Gugino 2019-07-12 16:02:26 UTC

Everything except baremetal has merged in 4.2, they will track separately.

Comment 8 Doug Hellmann 2019-07-12 18:12:20 UTC

Bare metal PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/38

Comment 9 Jianwei Hou 2019-07-18 06:13:49 UTC

Created attachment 1591678 [details]
Pods deleted during node draining

Verified in 4.2.0-0.nightly-2019-07-17-115118 on AWS IPI. Log attached.

Comment 10 errata-xmlrpc 2019-10-16 06:29:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922