Bug 1586120

Summary:

[starter-ca-central-1] drain error due to namespace stuck in termination

Product:

OpenShift Container Platform

Reporter:

Justin Pierce <jupierce>

Component:

Node

Assignee:

Ryan Phillips <rphillips>

Status:

CLOSED ERRATA

QA Contact:

weiwei jiang <wjiang>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

3.10.0

CC:

aos-bugs, dma, jokerman, jupierce, kalexand, mmccomas, sjenning, xtian

Target Milestone:

---

Keywords:

TestCaseNeeded

Target Release:

3.11.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: Upstream Bug Consequence: kubectl drain hang on error Fix: https://github.com/kubernetes/kubernetes/pull/64896 Result: kubectl will no longer hang if pods return an error

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-10-11 07:20:33 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Listing showing the current age of the termination	none

Description Justin Pierce 2018-06-05 14:35:56 UTC

Created attachment 1447874 [details]
Listing showing the current age of the termination

Description of problem:
During a 3.9->3.10 upgrade of starter-ca-central-1, one particular node could not be drained due to the following error:

There are pending nodes to be drained:
 ip-172-31-26-72.ca-central-1.compute.internal
error: error when evicting pod "arecocla-1-g9dwv": pods "arecocla-1-g9dwv" is forbidden: unable to create new content in namespace arecocla because it is being terminated

Version-Release number of selected component (if applicable):
v3.10.0-0.54.0 (master)
v3.9.14 (ip-172-31-26-72.ca-central-1.compute.internal)

Comment 3 Seth Jennings 2018-06-06 16:39:06 UTC

https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/drain.go#L552
https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/staging/src/k8s.io/apiserver/pkg/admission/plugin/namespace/lifecycle/admission.go#L172

This error is the result of writing to the eviction endpoint for a pod in a namespace that is being deleted.  So this error message isn't indicative of the root cause.  The terminating namespace should kill this pod.

However, this does cause the "oc adm drain" command to fail, which is unfortunate...

I think the fix should go in this area:
https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/drain.go#L577

The errCh should be handled like the doneCh case and not immediately return if one of the evictions fails.  Ideally would could treat this particular error (i.e. the namespace, and thus the pod, is already in the process of being deleted) as a success case.

Comment 4 Seth Jennings 2018-06-06 16:45:37 UTC

That being said, it is only a matter of waiting until the pod terminates and then "oc adm drain" should return success.  Basically "oc adm drain; if error; sleep 60 (2x normal grace period), oc adm drain" and that should work as any pods in a terminating namespace should be cleaned up by the and the node was cordoned due to the first drain attempt.

Since there seems to be a straightforward workaround for this, moving to 3.10.z.

Comment 5 Seth Jennings 2018-06-06 16:47:00 UTC

Ryan, can you take a look?

Comment 6 Seth Jennings 2018-06-06 16:53:17 UTC

see also https://bugzilla.redhat.com/show_bug.cgi?id=1479362

If the pod in the terminating namepace does not terminate in the grace period, this could be a "pod stuck terminating" issue.

Comment 7 Ryan Phillips 2018-06-07 22:16:18 UTC

PR and reproduction steps: https://github.com/kubernetes/kubernetes/pull/64896

Comment 11 weiwei jiang 2018-08-30 06:45:30 UTC

Checked with 
# oc version
oc v3.11.0-0.25.0
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-wjiang-master-etcd-1:8443
openshift v3.11.0-0.25.0
kubernetes v1.11.0+d4cacc0


And errors will be returned, so verified.

# oc adm drain qe-wjiang-node-1 --ignore-daemonsets=true --delete-local-data
node/qe-wjiang-node-1 cordoned
WARNING: Ignoring DaemonSet-managed pods: dockergc-pwd2r, node-exporter-5wqsc, sync-4l6zx, ovs-n2c7p, sdn-wl9h4; Deleting pods with local storage: mongodb-1-k4jqq, mongodb-1-qhr7p                                                                                            
pod/mongodb-1-qhr7p evicted
pod/nodejs-mongodb-example-1-kj274 evicted
pod/mongodb-1-k4jqq evicted
pod/nodejs-mongodb-example-1-5dq7w evicted
WARNING: Ignoring DaemonSet-managed pods: dockergc-pwd2r, node-exporter-5wqsc, sync-4l6zx, ovs-n2c7p, sdn-wl9h4
There are pending pods in node "qe-wjiang-node-1" when an error occurred: error when evicting pod "h-2-b77fh": pods "h-2-b77fh" is forbidden: unable to create new content in namespace wjiang because it is being terminated                                                  
error: unable to drain node "qe-wjiang-node-1", aborting command...

There are pending nodes to be drained:
 qe-wjiang-node-1
error: error when evicting pod "h-2-b77fh": pods "h-2-b77fh" is forbidden: unable to create new content in namespace wjiang because it is being terminated

Comment 16 errata-xmlrpc 2018-10-11 07:20:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652