Description of problem: Pod got stuck in Terminating state see https://bugzilla.redhat.com/show_bug.cgi?id=1364176 Docker was not responding on that node, so I did a reboot. Now node is not coming back because of possibly https://bugzilla.redhat.com/show_bug.cgi?id=1362109 Since the node is now showing in NotReady state pod should be rescheduled to another node, but its not happening. root@300node-support-2: ~/svt/openshift_scalability # oc get pods --all-namespaces -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE clusterproject266 deploymentconfig2v0-1-9us8s 1/1 Terminating 0 1d 172.21.5.5 192.1.1.63 root@300node-support-2: ~/svt/openshift_scalability # oc get nodes | grep 192.1.1.63 192.1.1.63 NotReady 6d Version-Release number of selected component (if applicable): openshift v3.3.0.10 kubernetes v1.3.0+57fb9ac etcd 2.3.0+git How reproducible: Steps to Reproduce: 1. Pod is terminating and if the node becomes NotReady Actual results: Pod is stuck in Terminating state and Project does not get deleted. Expected results: Pod should be rescheduled to another Ready node Additional info:
If you wait > 5 minutes, does the DeploymentConfig create a new pod on another node?
No, this Terminating pod is stuck for a day. Node became NotReady for few hours now, still it was not creating on another node.
Derek would you mind looking at this? I think this may reproduce on a multi-node cluster by just stopping Docker on one node and waiting >5 minutes to see if the NodeController evicts the pods on the NotReady node.
I do want to clarify that pods never get rescheduled. If you have a scalable resource (replication controller, deployment config), that will attempt to create new pods to replace failed ones, but a pod by itself is never moved or rescheduled. Just wanted to make sure that's clear :-)
*** Bug 1365657 has been marked as a duplicate of this bug. ***
To summarize the full set of discussion topics in this thread: 1. The kubelet will wait 5 minutes before transitioning from a Ready to NotReady state if the kubelet container runtime goes down. I think this time is too long, and its not tunable by operators since its hard-coded. See upstream issue to try and come to a consensus: https://github.com/kubernetes/kubernetes/issues/30534 2. The node controller does not evict a Pod if its in Terminating state, and its the ONLY pod scheduled to that node that required eviction. This is because the node controller identifies that the nodes on the pod should be evicted, but because its the only pod on the node, and it has a TerminationGracePeriodSeconds, the current logic skips a delete on it, and it never goes into the terminating evictor queue. See upstream issue to try and determine how to refactor: https://github.com/kubernetes/kubernetes/issues/30536 The operator can forcefully delete the pod in question by doing: $ oc delete pods <pod> --grace-period=0 Given this is an edge-case, and its fix requires a larger refactor, I am marking this UPCOMING_RELEASE and hope to get fixes into kubernetes 1.4 to be picked up by OpenShift upon that rebase.
*** Bug 1343157 has been marked as a duplicate of this bug. ***
Upstream PR for node controller not removing terminating pods from a node if it was the only pod on the node: https://github.com/kubernetes/kubernetes/pull/30624
Origin PR https://github.com/openshift/origin/pull/10503
This should be fixed as the requisite origin pr has merged above.
Tested with following scenario - Created 2 nodes cluster - Created projects which has pods on both the nodes - Stopped docker on one of the nodes - Deleted projects immediately - Node becomes NotReady and Pods stay in Terminating state (It was stuck in this state) - After few minutes Pods are gone, Node is Still in NotReady state - Stared docker back on that node, Node is Ready and everything is good.
Verified in following version openshift v3.4.0.16+cc70b72 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0066