Bug 1337470

Summary: Pods pending on Node with unknown reason
Product: OpenShift Container Platform Reporter: Jaspreet Kaur <jkaur>
Component: NodeAssignee: Derek Carr <decarr>
Status: CLOSED ERRATA QA Contact: DeShuai Ma <dma>
Severity: high Docs Contact:
Priority: high    
Version: 3.1.0CC: agoldste, aos-bugs, bleanhar, jkaur, jokerman, mmccomas, tdawson, wmeng
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Feature: Ability to define eviction thresholds for imagefs Reason: Evicts pods when node is running low on disk Result: Disk is reclaimed and node remains stable.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-18 12:41:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jaspreet Kaur 2016-05-19 09:29:33 UTC
Description of problem: We have recently seen that pods are seen in pending state however when describing the pod it doesn't show any reason and hangs forever.

There were recently 2 issues found, however there can be many reasons to it.
--
- When docker pool is exhausted 
- When Node is overcommited


Actual results: nothing in the description of pod or events for describing the issue for why the pod is in pending state. Also, node logs do not provide any information.


Expected results: The issue should be more descriptive. Also, if the node is overcommitted or the docker is already exhausted, the scheduler should be more active so that it should schedule the pod on that node. 

Either way the users should be provided more infromation on what is wrong while the pod is in pending state. 


Additional info:

Comment 1 Andy Goldstein 2016-05-19 15:31:55 UTC
Typically when pods are stuck pending, it's because the node has asked Docker to pull an image, and the pull "hangs" for some reason. And because image pulling is done in serial, if a pull hangs, then all subsequent attempts to run pods on that node will be stuck Pending until the pull finishes.

One improvement could be to add an event when a request to pull an image is queued. That way, you would at least know that was the last operation for the pod, and it would be easy to tell that pulling was hanging.

Re the docker pool, OSE 3.2 added support for correctly reporting the docker pool usage on devicemapper systems (i.e. RHEL), so I would expect to see improvements in 3.2 that aren't in 3.1.

We are also working on proactively evicting pods from nodes when the node determines that it's running low on memory or disk.

Is there anything else you're looking for?

Comment 12 Derek Carr 2016-10-25 19:33:59 UTC
OCP 3.4 has rebased on Kube 1.4 which has the support for disk eviction policies we added upstream (see http://kubernetes.io/docs/admin/out-of-resource/)

Moving this to ON_QA

Comment 13 DeShuai Ma 2016-10-26 05:55:39 UTC
Test on openshift v3.4.0.15+9c963ec, disk pressure works as expected. 
detail in the card. https://trello.com/c/3LvGAHr3/371-5-kubelet-evicts-pods-when-low-on-disk-node-reliability

Verify this bug.

Comment 15 errata-xmlrpc 2017-01-18 12:41:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066