Bug 1586120
Summary: | [starter-ca-central-1] drain error due to namespace stuck in termination | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Justin Pierce <jupierce> | ||||
Component: | Node | Assignee: | Ryan Phillips <rphillips> | ||||
Status: | CLOSED ERRATA | QA Contact: | weiwei jiang <wjiang> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3.10.0 | CC: | aos-bugs, dma, jokerman, jupierce, kalexand, mmccomas, sjenning, xtian | ||||
Target Milestone: | --- | Keywords: | TestCaseNeeded | ||||
Target Release: | 3.11.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: Upstream Bug
Consequence: kubectl drain hang on error
Fix: https://github.com/kubernetes/kubernetes/pull/64896
Result: kubectl will no longer hang if pods return an error
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-10-11 07:20:33 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/drain.go#L552 https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/staging/src/k8s.io/apiserver/pkg/admission/plugin/namespace/lifecycle/admission.go#L172 This error is the result of writing to the eviction endpoint for a pod in a namespace that is being deleted. So this error message isn't indicative of the root cause. The terminating namespace should kill this pod. However, this does cause the "oc adm drain" command to fail, which is unfortunate... I think the fix should go in this area: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/drain.go#L577 The errCh should be handled like the doneCh case and not immediately return if one of the evictions fails. Ideally would could treat this particular error (i.e. the namespace, and thus the pod, is already in the process of being deleted) as a success case. That being said, it is only a matter of waiting until the pod terminates and then "oc adm drain" should return success. Basically "oc adm drain; if error; sleep 60 (2x normal grace period), oc adm drain" and that should work as any pods in a terminating namespace should be cleaned up by the and the node was cordoned due to the first drain attempt. Since there seems to be a straightforward workaround for this, moving to 3.10.z. Ryan, can you take a look? see also https://bugzilla.redhat.com/show_bug.cgi?id=1479362 If the pod in the terminating namepace does not terminate in the grace period, this could be a "pod stuck terminating" issue. PR and reproduction steps: https://github.com/kubernetes/kubernetes/pull/64896 Checked with # oc version oc v3.11.0-0.25.0 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-wjiang-master-etcd-1:8443 openshift v3.11.0-0.25.0 kubernetes v1.11.0+d4cacc0 And errors will be returned, so verified. # oc adm drain qe-wjiang-node-1 --ignore-daemonsets=true --delete-local-data node/qe-wjiang-node-1 cordoned WARNING: Ignoring DaemonSet-managed pods: dockergc-pwd2r, node-exporter-5wqsc, sync-4l6zx, ovs-n2c7p, sdn-wl9h4; Deleting pods with local storage: mongodb-1-k4jqq, mongodb-1-qhr7p pod/mongodb-1-qhr7p evicted pod/nodejs-mongodb-example-1-kj274 evicted pod/mongodb-1-k4jqq evicted pod/nodejs-mongodb-example-1-5dq7w evicted WARNING: Ignoring DaemonSet-managed pods: dockergc-pwd2r, node-exporter-5wqsc, sync-4l6zx, ovs-n2c7p, sdn-wl9h4 There are pending pods in node "qe-wjiang-node-1" when an error occurred: error when evicting pod "h-2-b77fh": pods "h-2-b77fh" is forbidden: unable to create new content in namespace wjiang because it is being terminated error: unable to drain node "qe-wjiang-node-1", aborting command... There are pending nodes to be drained: qe-wjiang-node-1 error: error when evicting pod "h-2-b77fh": pods "h-2-b77fh" is forbidden: unable to create new content in namespace wjiang because it is being terminated Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652 |
Created attachment 1447874 [details] Listing showing the current age of the termination Description of problem: During a 3.9->3.10 upgrade of starter-ca-central-1, one particular node could not be drained due to the following error: There are pending nodes to be drained: ip-172-31-26-72.ca-central-1.compute.internal error: error when evicting pod "arecocla-1-g9dwv": pods "arecocla-1-g9dwv" is forbidden: unable to create new content in namespace arecocla because it is being terminated Version-Release number of selected component (if applicable): v3.10.0-0.54.0 (master) v3.9.14 (ip-172-31-26-72.ca-central-1.compute.internal)