Bug 1297521 - Scaling up pod causes loop with Node is out of disk
Summary: Scaling up pod causes loop with Node is out of disk
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Andy Goldstein
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks: 1267746
TreeView+ depends on / blocked
 
Reported: 2016-01-11 18:45 UTC by Ryan Howe
Modified: 2019-10-10 10:51 UTC (History)
9 users (show)

Fixed In Version: atomic-openshift-3.1.1.900-1.git.1.bacd67f.el7
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-12 16:26:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
GUI images of issue (194.42 KB, image/gif)
2016-01-11 18:45 UTC, Ryan Howe
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:1064 0 normal SHIPPED_LIVE Important: Red Hat OpenShift Enterprise 3.2 security, bug fix, and enhancement update 2016-05-12 20:19:17 UTC

Description Ryan Howe 2016-01-11 18:45:35 UTC
Created attachment 1113648 [details]
GUI images of issue

Description of problem:
 When manually scaling up pod, openshift tries to place it on a node that doesn't have any disk space left, and openshift proceeds to get stuck in a loop of trying to deploy. 


Version-Release number of selected component (if applicable):
3.1 


Steps to Reproduce:
1. Schedule pod to a node
2. Fill up diskspace
3. Manually scale pod up via GUI or CLI 

Actual results:
Pod continues to loop creating many pods that fail with the same error.

Expected results:
Fail once with error

Additional info:
https://lists.openshift.redhat.com/openshift-archives/users/2016-January/msg00033.html

# oc get pods
....
logging-fluentd-1-x5h0i   0/1       OutOfDisk          0          1s
logging-fluentd-1-xl4hz   0/1       OutOfDisk          0          12s
logging-fluentd-1-xqhul   0/1       OutOfDisk          0          10s
logging-fluentd-1-ykpku   0/1       OutOfDisk          0          13s
logging-fluentd-1-z2map   0/1       OutOfDisk          0          7s

[root master-001 ~]# oc get pods | wc -l
116
[root master-001 ~]# oc get pods | wc -l
119

Comment 1 Andy Goldstein 2016-01-11 19:11:56 UTC
This should be resolved with the next rebase into origin. The following upstream PRs add the ability to prevent scheduling to nodes that are out of disk:

https://github.com/kubernetes/kubernetes/pull/16178
https://github.com/kubernetes/kubernetes/pull/16179

Comment 2 Andy Goldstein 2016-01-12 14:21:59 UTC
Not a 3.1.1 blocker

Comment 3 Eric Paris 2016-02-02 16:32:17 UTC
Upstream fixed merged Oct 29 and Nov 2. Fixed when rebase lands.

Comment 4 Derek Carr 2016-02-03 16:28:47 UTC
The upstream PRs have landed in openshift/origin repository.

Comment 5 DeShuai Ma 2016-02-24 05:07:35 UTC
Verify on openshift v3.1.1.905

steps:
1. Get the node
[root@openshift-115 dma]# oc get node
NAME                               STATUS                     AGE
openshift-115.lab.sjc.redhat.com   Ready,SchedulingDisabled   1d
openshift-136.lab.sjc.redhat.com   Ready                      1d

2.Create a rc and scale the pod replicas=0
[root@openshift-115 dma]# oc get rc -n dma
CONTROLLER   CONTAINER(S)   IMAGE(S)                                                                           SELECTOR                                               REPLICAS   AGE
mysql-1      mysql          brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhscl/mysql-56-rhel7:latest   deployment=mysql-1,deploymentconfig=mysql,name=mysql   0          18m

3.Create a large file to fill the disk with 100% usage
[root@openshift-136 ~]# df -lh
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/rhel72-root   10G   10G   20K 100% /
devtmpfs                 1.9G     0  1.9G   0% /dev
tmpfs                    1.9G     0  1.9G   0% /dev/shm
tmpfs                    1.9G  190M  1.7G  11% /run
tmpfs                    1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/vda1                497M  197M  300M  40% /boot
tmpfs                    380M     0  380M   0% /run/user/0

4.Scale the rc with replicas=3
# oc scale rc/mysql-1 --replicas=3 -n dma

5. Check the pod status
[root@openshift-115 dma]# oc get pod -n dma
NAME            READY     STATUS    RESTARTS   AGE
mysql-1-8ss17   0/1       Pending   0          1m
mysql-1-aj620   0/1       Pending   0          1m
mysql-1-ufryk   0/1       Pending   0          1m
[root@openshift-115 dma]# oc describe pod/mysql-1-8ss17 -n dma|grep FailedScheduling
  1m		33s		7	{default-scheduler }			Warning		FailedScheduling	no nodes available to schedule pods
[root@openshift-115 dma]# oc describe pod/mysql-1-aj620 -n dma|grep FailedScheduling
  2m		11s		12	{default-scheduler }			Warning		FailedScheduling	no nodes available to schedule pods
[root@openshift-115 dma]# oc describe pod/mysql-1-ufryk -n dma|grep FailedScheduling
  2m		14s		13	{default-scheduler }			Warning		FailedScheduling	no nodes available to schedule pods

Comment 8 errata-xmlrpc 2016-05-12 16:26:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064


Note You need to log in before you can comment on or make changes to this bug.