Description of problem: We have recently seen that pods are seen in pending state however when describing the pod it doesn't show any reason and hangs forever.
There were recently 2 issues found, however there can be many reasons to it.
- When docker pool is exhausted
- When Node is overcommited
Actual results: nothing in the description of pod or events for describing the issue for why the pod is in pending state. Also, node logs do not provide any information.
Expected results: The issue should be more descriptive. Also, if the node is overcommitted or the docker is already exhausted, the scheduler should be more active so that it should schedule the pod on that node.
Either way the users should be provided more infromation on what is wrong while the pod is in pending state.
Typically when pods are stuck pending, it's because the node has asked Docker to pull an image, and the pull "hangs" for some reason. And because image pulling is done in serial, if a pull hangs, then all subsequent attempts to run pods on that node will be stuck Pending until the pull finishes.
One improvement could be to add an event when a request to pull an image is queued. That way, you would at least know that was the last operation for the pod, and it would be easy to tell that pulling was hanging.
Re the docker pool, OSE 3.2 added support for correctly reporting the docker pool usage on devicemapper systems (i.e. RHEL), so I would expect to see improvements in 3.2 that aren't in 3.1.
We are also working on proactively evicting pods from nodes when the node determines that it's running low on memory or disk.
Is there anything else you're looking for?
OCP 3.4 has rebased on Kube 1.4 which has the support for disk eviction policies we added upstream (see http://kubernetes.io/docs/admin/out-of-resource/)
Moving this to ON_QA
Test on openshift v18.104.22.168+9c963ec, disk pressure works as expected.
detail in the card. https://trello.com/c/3LvGAHr3/371-5-kubelet-evicts-pods-when-low-on-disk-node-reliability
Verify this bug.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.