Bug 1292964 - OpenShift doesn't notice that Docker Storage is, or is reaching that state of being, full
OpenShift doesn't notice that Docker Storage is, or is reaching that state of...
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod (Show other bugs)
3.1.0
Unspecified Unspecified
urgent Severity medium
: ---
: ---
Assigned To: Derek Carr
DeShuai Ma
:
Depends On:
Blocks: 1292845 1267746
  Show dependency treegraph
 
Reported: 2015-12-18 16:23 EST by Eric Jones
Modified: 2017-03-08 13 EST (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
OpenShift Enterprise 3.1 Current issue found on a a node hosted in OpenStack
Last Closed: 2017-01-18 07:38:53 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Eric Jones 2015-12-18 16:23:38 EST
Description of problem:
OpenShift doesn't notice that Docker Storage is, or is reaching that state of being, full. This allows the storage to fill up and no one be the wiser, and then the docker service can, and will likely, fail, preventing further use of the node.

Version-Release number of selected component (if applicable):
Found on OSE 3.0, but cannot find evidence that it is fixed in 3.1

How reproducible:
100%

Steps to Reproduce:
1. Spin up normal OSE 3.0 environment (infrastructure of the env does not appears to impact it)
2. Use like normal, pull images and fill up storage

Actual results:
Either the docker service completely fails or it pretends to work but will not properly pull images into pods leaving pods stuck in a pending state

Expected results:
As you pull images, you get an update on storage levels.

Additional info:
This is only so useful if you can clear up the docker storage, hence blocks on bug 1292845.
Comment 1 Paul Weil 2015-12-23 10:23:40 EST
GH issue: https://github.com/openshift/origin/issues/6350
Comment 2 Boris Kurktchiev 2016-01-04 16:29:59 EST
So I reported this originally, and as pointed out it blocks on the image clean up. I had to manually run clean up steps (specifically the image ones) from the OSE 3 docs in order to get it to actually clean up after itself. I have not tried running the ansible playbooks, but overall it seems like the clean up process should be happening automatically at least semi scheduled. Which in my case did not seem to be the case
Comment 3 Ryan Howe 2016-03-29 17:35:35 EDT
Has this been fixed in 3.1 with this PR? 

https://github.com/openshift/origin/pull/5599
Comment 4 Michal Minar 2016-03-30 02:37:03 EDT
No, we have no fix yet.
Comment 7 Michal Minar 2016-05-12 03:25:59 EDT
This work is being done upstream. According to a proposal [1], everything we need (volume accounting) shall be covered. As of now, only the volume interface [2] is in place. Unfortunatelly, accounting for the host_path volumes has been recently disabled [3] due to high CPU load. Neither NFS nor AWS nor GCE is supported yet.

[1] https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/disk-accounting.md#introduction
[2] https://github.com/kubernetes/kubernetes/pull/18232
[3] https://github.com/kubernetes/kubernetes/pull/23446
Comment 8 Derek Carr 2016-08-12 11:17:52 EDT
This is a new feature in Kubernetes 1.4 that just got merged.

You can read the feature description here:

https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/kubelet-eviction.md

Users will be able to set thresholds for both the rootfs (i.e. nodefs) and the imagefs (i.e. docker storage).  If those thresholds are met, the node will report disk pressure, perform image gc, and evict pods on the node to reduce disk pressure to a stable state.  While the node reports disk pressure, no additional pods are admitted to the node for execution.

Marking this upcoming release.
Comment 9 Boris Kurktchiev 2016-08-12 11:40:44 EDT
I am assuming these are going to be exposed in some way in the node configs?
Comment 10 Derek Carr 2016-09-30 10:22:35 EDT
Boris - correct, in 3.4, users will be able to configure the values in node-config.
Comment 11 Derek Carr 2016-10-25 12:38:22 EDT
OCP 3.4 has added support to handle disk pressure based on the work we did in upstream Kubernetes 1.4.

For details:
http://kubernetes.io/docs/admin/out-of-resource/

I am moving this to ON_QA as a result.
Comment 12 DeShuai Ma 2016-10-26 01:53:10 EDT
Test on openshift v3.4.0.15+9c963ec, disk pressure works as expected. 
detail in the card. https://trello.com/c/3LvGAHr3/371-5-kubelet-evicts-pods-when-low-on-disk-node-reliability

Verify this bug.
Comment 14 errata-xmlrpc 2017-01-18 07:38:53 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066

Note You need to log in before you can comment on or make changes to this bug.