1586120 – [starter-ca-central-1] drain error due to namespace stuck in termination

Bug 1586120 - [starter-ca-central-1] drain error due to namespace stuck in termination

Summary: [starter-ca-central-1] drain error due to namespace stuck in termination

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.11.0
Assignee:	Ryan Phillips
QA Contact:	weiwei jiang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-05 14:35 UTC by Justin Pierce
Modified:	2018-10-11 07:21 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Upstream Bug Consequence: kubectl drain hang on error Fix: https://github.com/kubernetes/kubernetes/pull/64896 Result: kubectl will no longer hang if pods return an error
Clone Of:
Environment:
Last Closed:	2018-10-11 07:20:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Listing showing the current age of the termination (911 bytes, text/plain) 2018-06-05 14:35 UTC, Justin Pierce	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:2652	0	None	None	None	2018-10-11 07:21:34 UTC

Description Justin Pierce 2018-06-05 14:35:56 UTC

Created attachment 1447874 [details]
Listing showing the current age of the termination

Description of problem:
During a 3.9->3.10 upgrade of starter-ca-central-1, one particular node could not be drained due to the following error:

There are pending nodes to be drained:
 ip-172-31-26-72.ca-central-1.compute.internal
error: error when evicting pod "arecocla-1-g9dwv": pods "arecocla-1-g9dwv" is forbidden: unable to create new content in namespace arecocla because it is being terminated

Version-Release number of selected component (if applicable):
v3.10.0-0.54.0 (master)
v3.9.14 (ip-172-31-26-72.ca-central-1.compute.internal)

Comment 3 Seth Jennings 2018-06-06 16:39:06 UTC

https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/drain.go#L552
https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/staging/src/k8s.io/apiserver/pkg/admission/plugin/namespace/lifecycle/admission.go#L172

This error is the result of writing to the eviction endpoint for a pod in a namespace that is being deleted.  So this error message isn't indicative of the root cause.  The terminating namespace should kill this pod.

However, this does cause the "oc adm drain" command to fail, which is unfortunate...

I think the fix should go in this area:
https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/drain.go#L577

The errCh should be handled like the doneCh case and not immediately return if one of the evictions fails.  Ideally would could treat this particular error (i.e. the namespace, and thus the pod, is already in the process of being deleted) as a success case.

Comment 4 Seth Jennings 2018-06-06 16:45:37 UTC

That being said, it is only a matter of waiting until the pod terminates and then "oc adm drain" should return success.  Basically "oc adm drain; if error; sleep 60 (2x normal grace period), oc adm drain" and that should work as any pods in a terminating namespace should be cleaned up by the and the node was cordoned due to the first drain attempt.

Since there seems to be a straightforward workaround for this, moving to 3.10.z.

Comment 5 Seth Jennings 2018-06-06 16:47:00 UTC

Ryan, can you take a look?

Comment 6 Seth Jennings 2018-06-06 16:53:17 UTC

see also https://bugzilla.redhat.com/show_bug.cgi?id=1479362

If the pod in the terminating namepace does not terminate in the grace period, this could be a "pod stuck terminating" issue.

Comment 7 Ryan Phillips 2018-06-07 22:16:18 UTC

PR and reproduction steps: https://github.com/kubernetes/kubernetes/pull/64896

Comment 11 weiwei jiang 2018-08-30 06:45:30 UTC

Checked with 
# oc version
oc v3.11.0-0.25.0
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-wjiang-master-etcd-1:8443
openshift v3.11.0-0.25.0
kubernetes v1.11.0+d4cacc0


And errors will be returned, so verified.

# oc adm drain qe-wjiang-node-1 --ignore-daemonsets=true --delete-local-data
node/qe-wjiang-node-1 cordoned
WARNING: Ignoring DaemonSet-managed pods: dockergc-pwd2r, node-exporter-5wqsc, sync-4l6zx, ovs-n2c7p, sdn-wl9h4; Deleting pods with local storage: mongodb-1-k4jqq, mongodb-1-qhr7p                                                                                            
pod/mongodb-1-qhr7p evicted
pod/nodejs-mongodb-example-1-kj274 evicted
pod/mongodb-1-k4jqq evicted
pod/nodejs-mongodb-example-1-5dq7w evicted
WARNING: Ignoring DaemonSet-managed pods: dockergc-pwd2r, node-exporter-5wqsc, sync-4l6zx, ovs-n2c7p, sdn-wl9h4
There are pending pods in node "qe-wjiang-node-1" when an error occurred: error when evicting pod "h-2-b77fh": pods "h-2-b77fh" is forbidden: unable to create new content in namespace wjiang because it is being terminated                                                  
error: unable to drain node "qe-wjiang-node-1", aborting command...

There are pending nodes to be drained:
 qe-wjiang-node-1
error: error when evicting pod "h-2-b77fh": pods "h-2-b77fh" is forbidden: unable to create new content in namespace wjiang because it is being terminated

Comment 16 errata-xmlrpc 2018-10-11 07:20:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652

Note You need to log in before you can comment on or make changes to this bug.