Hide Forgot
Created attachment 1447874 [details] Listing showing the current age of the termination Description of problem: During a 3.9->3.10 upgrade of starter-ca-central-1, one particular node could not be drained due to the following error: There are pending nodes to be drained: ip-172-31-26-72.ca-central-1.compute.internal error: error when evicting pod "arecocla-1-g9dwv": pods "arecocla-1-g9dwv" is forbidden: unable to create new content in namespace arecocla because it is being terminated Version-Release number of selected component (if applicable): v3.10.0-0.54.0 (master) v3.9.14 (ip-172-31-26-72.ca-central-1.compute.internal)
https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/drain.go#L552 https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/staging/src/k8s.io/apiserver/pkg/admission/plugin/namespace/lifecycle/admission.go#L172 This error is the result of writing to the eviction endpoint for a pod in a namespace that is being deleted. So this error message isn't indicative of the root cause. The terminating namespace should kill this pod. However, this does cause the "oc adm drain" command to fail, which is unfortunate... I think the fix should go in this area: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/kubectl/cmd/drain.go#L577 The errCh should be handled like the doneCh case and not immediately return if one of the evictions fails. Ideally would could treat this particular error (i.e. the namespace, and thus the pod, is already in the process of being deleted) as a success case.
That being said, it is only a matter of waiting until the pod terminates and then "oc adm drain" should return success. Basically "oc adm drain; if error; sleep 60 (2x normal grace period), oc adm drain" and that should work as any pods in a terminating namespace should be cleaned up by the and the node was cordoned due to the first drain attempt. Since there seems to be a straightforward workaround for this, moving to 3.10.z.
Ryan, can you take a look?
see also https://bugzilla.redhat.com/show_bug.cgi?id=1479362 If the pod in the terminating namepace does not terminate in the grace period, this could be a "pod stuck terminating" issue.
PR and reproduction steps: https://github.com/kubernetes/kubernetes/pull/64896
Checked with # oc version oc v3.11.0-0.25.0 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-wjiang-master-etcd-1:8443 openshift v3.11.0-0.25.0 kubernetes v1.11.0+d4cacc0 And errors will be returned, so verified. # oc adm drain qe-wjiang-node-1 --ignore-daemonsets=true --delete-local-data node/qe-wjiang-node-1 cordoned WARNING: Ignoring DaemonSet-managed pods: dockergc-pwd2r, node-exporter-5wqsc, sync-4l6zx, ovs-n2c7p, sdn-wl9h4; Deleting pods with local storage: mongodb-1-k4jqq, mongodb-1-qhr7p pod/mongodb-1-qhr7p evicted pod/nodejs-mongodb-example-1-kj274 evicted pod/mongodb-1-k4jqq evicted pod/nodejs-mongodb-example-1-5dq7w evicted WARNING: Ignoring DaemonSet-managed pods: dockergc-pwd2r, node-exporter-5wqsc, sync-4l6zx, ovs-n2c7p, sdn-wl9h4 There are pending pods in node "qe-wjiang-node-1" when an error occurred: error when evicting pod "h-2-b77fh": pods "h-2-b77fh" is forbidden: unable to create new content in namespace wjiang because it is being terminated error: unable to drain node "qe-wjiang-node-1", aborting command... There are pending nodes to be drained: qe-wjiang-node-1 error: error when evicting pod "h-2-b77fh": pods "h-2-b77fh" is forbidden: unable to create new content in namespace wjiang because it is being terminated
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652