1337470 – Pods pending on Node with unknown reason

Bug 1337470 - Pods pending on Node with unknown reason

Summary: Pods pending on Node with unknown reason

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Derek Carr
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-05-19 09:29 UTC by Jaspreet Kaur
Modified:	2019-12-16 05:48 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Feature: Ability to define eviction thresholds for imagefs Reason: Evicts pods when node is running low on disk Result: Disk is reclaimed and node remains stable.
Clone Of:
Environment:
Last Closed:	2017-01-18 12:41:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0066	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.4 RPM Release Advisory	2017-01-18 17:23:26 UTC

Description Jaspreet Kaur 2016-05-19 09:29:33 UTC

Description of problem: We have recently seen that pods are seen in pending state however when describing the pod it doesn't show any reason and hangs forever.

There were recently 2 issues found, however there can be many reasons to it.
--
- When docker pool is exhausted 
- When Node is overcommited


Actual results: nothing in the description of pod or events for describing the issue for why the pod is in pending state. Also, node logs do not provide any information.


Expected results: The issue should be more descriptive. Also, if the node is overcommitted or the docker is already exhausted, the scheduler should be more active so that it should schedule the pod on that node. 

Either way the users should be provided more infromation on what is wrong while the pod is in pending state. 


Additional info:

Comment 1 Andy Goldstein 2016-05-19 15:31:55 UTC

Typically when pods are stuck pending, it's because the node has asked Docker to pull an image, and the pull "hangs" for some reason. And because image pulling is done in serial, if a pull hangs, then all subsequent attempts to run pods on that node will be stuck Pending until the pull finishes.

One improvement could be to add an event when a request to pull an image is queued. That way, you would at least know that was the last operation for the pod, and it would be easy to tell that pulling was hanging.

Re the docker pool, OSE 3.2 added support for correctly reporting the docker pool usage on devicemapper systems (i.e. RHEL), so I would expect to see improvements in 3.2 that aren't in 3.1.

We are also working on proactively evicting pods from nodes when the node determines that it's running low on memory or disk.

Is there anything else you're looking for?

Comment 12 Derek Carr 2016-10-25 19:33:59 UTC

OCP 3.4 has rebased on Kube 1.4 which has the support for disk eviction policies we added upstream (see http://kubernetes.io/docs/admin/out-of-resource/)

Moving this to ON_QA

Comment 13 DeShuai Ma 2016-10-26 05:55:39 UTC

Test on openshift v3.4.0.15+9c963ec, disk pressure works as expected. 
detail in the card. https://trello.com/c/3LvGAHr3/371-5-kubelet-evicts-pods-when-low-on-disk-node-reliability

Verify this bug.

Comment 15 errata-xmlrpc 2017-01-18 12:41:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066

Note You need to log in before you can comment on or make changes to this bug.