Description of problem: Cluster services fail over if /tmp gets full Version-Release number of selected component (if applicable): rgmanager-2.0.52-37.el5.x86_64 How reproducible: Chris Henderson was able to reproduce this. I will copy his update on that into the first BZ comment. Actual results: Services failed over from one node to another when /tmp was full. Expected results: Service should not fail over. Additional info:
I reproduced this on a 2 node cluster I just built. password=redhat chenders~:{1026}% rssh 10.10.178.33 [255] root.178.33's password: Last login: Wed Aug 14 13:13:20 2013 from 10.3.113.132 [root@vm33 ~]# clustat Cluster Status for amdocstest @ Wed Aug 14 13:16:05 2013 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ vm33.gsslab.rdu2.redhat.com 1 Online, Local, rgmanager dhcp95.gsslab.rdu2.redhat.com 2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:testservice dhcp95.gsslab.rdu2.redhat.com started [root@vm33 ~]# clusvcadm -r testservice -m vm33.gsslab.rdu2.redhat.com Trying to relocate service:testservice to vm33.gsslab.rdu2.redhat.com...Success service:testservice is now running on vm33.gsslab.rdu2.redhat.com [root@vm33 ~]# clustat Cluster Status for amdocstest @ Wed Aug 14 13:16:56 2013 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ vm33.gsslab.rdu2.redhat.com 1 Online, Local, rgmanager dhcp95.gsslab.rdu2.redhat.com 2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:testservice vm33.gsslab.rdu2.redhat.com started [root@vm33 ~]# df /tmp Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/testvg-templv 99150 5669 88361 7% /tmp [root@vm33 ~]# dd if=/dev/zero of=/tmp/bigfile dd: writing to `/tmp/bigfile': No space left on device 186217+0 records in 186216+0 records out 95342592 bytes (95 MB) copied, 1.73068 seconds, 55.1 MB/s root@vm33 ~]# df /tmp Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/testvg-templv 99150 99145 0 100% /tmp [root@vm33 ~]# tail /var/log/messages Aug 14 13:19:25 vm33 clurgmgrd: [2871]: <err> fs:test: /dev/mapper/testvg-testlv is not mounted on /test Aug 14 13:19:25 vm33 clurgmgrd[2871]: <notice> status on fs "test" returned 1 (generic error) Aug 14 13:19:25 vm33 clurgmgrd[2871]: <notice> Stopping service service:testservice Aug 14 13:19:26 vm33 clurgmgrd[2871]: <notice> Service service:testservice is recovering Aug 14 13:19:26 vm33 clurgmgrd[2871]: <notice> Service service:testservice is now running on member 2 [root@vm33 ~]# clustat Cluster Status for amdocstest @ Wed Aug 14 13:20:35 2013 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ vm33.gsslab.rdu2.redhat.com 1 Online, Local, rgmanager dhcp95.gsslab.rdu2.redhat.com 2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:testservice dhcp95.gsslab.rdu2.redhat.com started
Based on a quick look at /usr/share/cluster/utils/fs-lib.sh, I believe RHEL6 has this problem as well. typeset proc_mounts=$(mktemp /tmp/fs.proc.mounts.XXXXXX) cat /proc/mounts > $proc_mounts while read -r tmp_dev tmp_mp junka junkb junkc junkd; do # XXX fork/clone warning XXX if [ "${tmp_dev:0:1}" != "-" ]; then tmp_dev="$(printf "$tmp_dev")" fi if [ -n "$tmp_dev" -a "$tmp_dev" = "$dev" ]; then case $OCF_RESKEY_fstype in cifs|nfs|nfs4) ;; *) return $YES ;; esac fi # Mountpoint from /proc/mounts containing spaces will # have spaces represented in octal. printf takes care # of this for us. tmp_mp="$(printf "$tmp_mp")" if [ -n "$tmp_mp" -a "$tmp_mp" = "$mp" ]; then return $YES fi done < $proc_mounts rm -f $proc_mounts
(In reply to Chris Henderson from comment #4) > Based on a quick look at /usr/share/cluster/utils/fs-lib.sh, I believe RHEL6 > has this problem as well. Yeah. It's the same code in both places. I cloned the bug against RHEL6 at https://bugzilla.redhat.com/show_bug.cgi?id=998012
This usage of copying /proc/mounts to /tmp dir has already been removed upstream, and it is already fixed in the latests 6.5 build. Below is the commit that introduced the fix upstream. https://github.com/ClusterLabs/resource-agents/commit/4d57e9cf453cbb34761d8b2e546dc4a71ba91c3c#L0R389
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1207.html