Bug 1337470 - Pods pending on Node with unknown reason
Summary: Pods pending on Node with unknown reason
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Derek Carr
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-19 09:29 UTC by Jaspreet Kaur
Modified: 2019-12-16 05:48 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: Ability to define eviction thresholds for imagefs Reason: Evicts pods when node is running low on disk Result: Disk is reclaimed and node remains stable.
Clone Of:
Environment:
Last Closed: 2017-01-18 12:41:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0066 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4 RPM Release Advisory 2017-01-18 17:23:26 UTC

Description Jaspreet Kaur 2016-05-19 09:29:33 UTC
Description of problem: We have recently seen that pods are seen in pending state however when describing the pod it doesn't show any reason and hangs forever.

There were recently 2 issues found, however there can be many reasons to it.
--
- When docker pool is exhausted 
- When Node is overcommited


Actual results: nothing in the description of pod or events for describing the issue for why the pod is in pending state. Also, node logs do not provide any information.


Expected results: The issue should be more descriptive. Also, if the node is overcommitted or the docker is already exhausted, the scheduler should be more active so that it should schedule the pod on that node. 

Either way the users should be provided more infromation on what is wrong while the pod is in pending state. 


Additional info:

Comment 1 Andy Goldstein 2016-05-19 15:31:55 UTC
Typically when pods are stuck pending, it's because the node has asked Docker to pull an image, and the pull "hangs" for some reason. And because image pulling is done in serial, if a pull hangs, then all subsequent attempts to run pods on that node will be stuck Pending until the pull finishes.

One improvement could be to add an event when a request to pull an image is queued. That way, you would at least know that was the last operation for the pod, and it would be easy to tell that pulling was hanging.

Re the docker pool, OSE 3.2 added support for correctly reporting the docker pool usage on devicemapper systems (i.e. RHEL), so I would expect to see improvements in 3.2 that aren't in 3.1.

We are also working on proactively evicting pods from nodes when the node determines that it's running low on memory or disk.

Is there anything else you're looking for?

Comment 12 Derek Carr 2016-10-25 19:33:59 UTC
OCP 3.4 has rebased on Kube 1.4 which has the support for disk eviction policies we added upstream (see http://kubernetes.io/docs/admin/out-of-resource/)

Moving this to ON_QA

Comment 13 DeShuai Ma 2016-10-26 05:55:39 UTC
Test on openshift v3.4.0.15+9c963ec, disk pressure works as expected. 
detail in the card. https://trello.com/c/3LvGAHr3/371-5-kubelet-evicts-pods-when-low-on-disk-node-reliability

Verify this bug.

Comment 15 errata-xmlrpc 2017-01-18 12:41:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066


Note You need to log in before you can comment on or make changes to this bug.