Description of problem: I left a cluster of openshift v3 up for a week or two without checking on it. When coming back I noticed that some of the services in my application were not functioning. In particular my web ui that attaches to a database was showing an error about not being able to connect to it. I noticed that /var/log/messages had filled up /var on both of my node's file systems. When I checked `oc get pods` I noticed that the master had spun up 5000 pods of the mysql instance all in the Error state of OutOfDisk. As well as the other pods running on that machine. <snip> NAME READY REASON RESTARTS AGE mysql-1-00bg6 0/1 OutOfDisk 0 8h mysql-1-01579 0/1 OutOfDisk 0 8h mysql-1-01d0m 0/1 OutOfDisk 0 6h mysql-1-02293 0/1 OutOfDisk 0 11h ... <snip> Version-Release number of selected component (if applicable): openshift-master-3.0.1.0-0.git.205.2c9a9b0.el7ose.x86_64 openshift-3.0.1.0-0.git.205.2c9a9b0.el7ose.x86_64 openshift-sdn-ovs-3.0.1.0-0.git.205.2c9a9b0.el7ose.x86_64 openshift-node-3.0.1.0-0.git.205.2c9a9b0.el7ose.x86_64 tuned-profiles-openshift-node-3.0.1.0-0.git.205.2c9a9b0.el7ose.x86_64 How reproducible: Unsure. Steps to Reproduce: 1. Install an application. 2. Fill up /var 3. Verify that openshift-master attempts to create lots of pods. Actual results: Openshift-master has attempted to create new pods when the disk space had run out. Expected results: Openshift-master should recognize that the node(s) is in a bad state and not schedule any more pod creations. Additional info: Docker is doing direct lvm using a separate disk (xvdb) and is 100GB. /var is on xvda3 and is 8GB.
*** Bug 1248662 has been marked as a duplicate of this bug. ***
The fix for this should be in master now that the latest rebase has landed, please retest.
Verify on openshift v3.1.1.905 steps: 1. Get the node [root@openshift-115 dma]# oc get node NAME STATUS AGE openshift-115.lab.sjc.redhat.com Ready,SchedulingDisabled 1d openshift-136.lab.sjc.redhat.com Ready 1d 2.Create a rc and scale the pod replicas=0 [root@openshift-115 dma]# oc get rc -n dma CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS AGE mysql-1 mysql brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhscl/mysql-56-rhel7:latest deployment=mysql-1,deploymentconfig=mysql,name=mysql 0 18m 3.Create a large file to fill the disk with 100% usage [root@openshift-136 ~]# df -lh Filesystem Size Used Avail Use% Mounted on /dev/mapper/rhel72-root 10G 10G 20K 100% / devtmpfs 1.9G 0 1.9G 0% /dev tmpfs 1.9G 0 1.9G 0% /dev/shm tmpfs 1.9G 190M 1.7G 11% /run tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup /dev/vda1 497M 197M 300M 40% /boot tmpfs 380M 0 380M 0% /run/user/0 4.Scale the rc with replicas=3 # oc scale rc/mysql-1 --replicas=3 -n dma 5. Check the pod status [root@openshift-115 dma]# oc get pod -n dma NAME READY STATUS RESTARTS AGE mysql-1-8ss17 0/1 Pending 0 1m mysql-1-aj620 0/1 Pending 0 1m mysql-1-ufryk 0/1 Pending 0 1m [root@openshift-115 dma]# oc describe pod/mysql-1-8ss17 -n dma|grep FailedScheduling 1m 33s 7 {default-scheduler } Warning FailedScheduling no nodes available to schedule pods [root@openshift-115 dma]# oc describe pod/mysql-1-aj620 -n dma|grep FailedScheduling 2m 11s 12 {default-scheduler } Warning FailedScheduling no nodes available to schedule pods [root@openshift-115 dma]# oc describe pod/mysql-1-ufryk -n dma|grep FailedScheduling 2m 14s 13 {default-scheduler } Warning FailedScheduling no nodes available to schedule pods
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064