Description of problem: When stopping a service that contains a filesystem resource that is managed by fs.sh, on a stop operation it can kill a process that is not located on that mount point. For example if the service is defined as so: <service name="demo1" recovery="disable"> <fs device="/dev/sda1" force_fsck="0" force_unmount="1" fsid="7880" fstype="ext3" mountpoint="/media/demo1" name="demo1fs" options="" self_fence="0"/> </service> There was a couple of sceniros that would kill a process that was not on that mount point when it was stopped: These processes should not have been killed: $ less /tmp/media/demo1/tmp.txt $ less /tmp/test\ /media/demo1/tmp.txt These processes was and should have been killed: $ less /media/demo1/tmp.txt As you can see in the logs 3 processes were and only 1 should have killed: Jan 15 21:03:03 rh4node-single clurgmgrd: [18978]: <notice> Forcefully unmounting /media/demo1 Jan 15 21:03:03 rh4node-single clurgmgrd: [18978]: <warning> killing process 19428 (root less /media/demo1) Jan 15 21:03:03 rh4node-single clurgmgrd: [18978]: <warning> killing process 19444 (root less /media/demo1) Jan 15 21:03:03 rh4node-single clurgmgrd: [18978]: <warning> killing process 19463 (root less /media/demo1) Jan 15 21:03:08 rh4node-single clurgmgrd: [18978]: <info> unmounting /media/demo1 Jan 15 21:03:08 rh4node-single clurgmgrd[18978]: <notice> Service demo1 is disabled I will attach a patch for fs.sh and fs.sh with the patch applied that fixes this issue and the following sceniros have been tested and only the process that should have been killed was killed. $ less /tmp/media/demo1/tmp.txt $ less /media/demo1\ 2/tmp.txt $ less /tmp/test\ /media/demo1/tmp.txt Version-Release number of selected component (if applicable): rgmanager-1.9.87-1.el4 How reproducible: Everytime Steps to Reproduce: 1. Create a service like it is defined above. 2. Start the service: $ clusvcadm -e demo1 3. Open a couple terminals and then open the following files with less and each one: $ mkdir -p /tmp/media/demo1 $ mkdir -p /tmp/test\ /media/demo1 $ echo "test" > /tmp/media/demo1/tmp.txt $ echo "test" > /tmp/test\ /media/demo1/tmp.txt $ echo "test" > /media/demo1/tmp.txt $ less /tmp/media/demo1/tmp.txt $ less /tmp/test\ /media/demo1/tmp.txt $ less /media/demo1/tmp.txt 4. Stop the service: $ clusvcadm -d demo1 Actual results: After the service is stopped, 2 processes that were not on the mount point that was stopped will be killed. Expected results: Only processes on the mount point should be killed. Additional info: This bz should be cloned to RHEL5 because it is going to have the same bug.
Created attachment 384713 [details] Patch to fix the killing incorrect process
Created attachment 384714 [details] fs.sh patch applied for RHEL4 This patched fs.sh was only tested on RHEL4, so not sure about RHEL5 but should be close to same.
I tested the reproducer outlined in summary of BZ in RHEL5 with the patched fs.sh that lon gave me for RHEL5 and it PASSED. I tested the reproducer outlined in summary of BZ in RHEL4 with the patched fs.sh that lon gave me for RHEL4 and it PASSED. --sbradley
http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=2d0323010f7110452625bc71c9898459a2ca7e85
This patch has additional side effect that it does not kill processes directly on the service mountpoint. Try this: <rm> <resources> <clusterfs device="/dev/vedder/vedder0" force_unmount="1" self_fence="0" fstype="gfs" mountpoint="/mnt/vedder0" name="vedderfs" options=""/> </resources> <service autostart="1" name="jkservice"> <clusterfs ref="vedderfs"/> </service> </rm> then run bash on /mnt/vedder0 and ignore the signals: trap "" SIGTERM Now the service migration will fail: Apr 13 12:29:26 z2 clurgmgrd: [27562]: <notice> Forcefully unmounting /mnt/vedder0 Apr 13 12:29:30 z2 clurgmgrd: [27562]: <notice> Forcefully unmounting /mnt/vedder0 Apr 13 12:29:31 z2 clurgmgrd: [27562]: <err> 'umount /mnt/vedder0' failed, error=0 Apr 13 12:29:31 z2 clurgmgrd[27562]: <notice> stop on clusterfs "vedderfs" returned 2 (invalid argument(s)) Apr 13 12:29:31 z2 clurgmgrd[27562]: <crit> #12: service:jkservice failed to stop; intervention required Apr 13 12:29:31 z2 clurgmgrd[27562]: <notice> Service jkservice is failed The root cause is probably regex at line 733, where it expects that all processes in lsof have the mountpoint listed with "/" at the end which is not the case for processes running directly on the mountpoint. Tested on 4.8.z version (rgmanager-1.9.87-1.el4_8.3) but others are probably affected as well.
I think maybe a better approach to this whole thing is to drop lsof support and just use 'fuser -kvm'.
Created attachment 406649 [details] Patch to use fuser instead.
Created attachment 406650 [details] Patched fs.sh
Created attachment 406652 [details] Automatic test case. This test case requires: - gcc - fs.sh Copy it in to /usr/share/cluster cd /usr/share/cluster ./555901-test.sh
Updated build addresses clusterfs/netfs.sh force_unmount holes.
Tested on clusterfs and fs. All processess accessing the mountpoints were killed and none of the others (including those running in the example given by reporter) were killed.
reverting the state back, It's working but no errata yet. Wrong bz# :/.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, the file system agent could kill a process when an application used a mount point with a similar name to a mount point managed by rgmanager using force_unmount. With this update, the file system agent kills only the processes that access the mount point managed by rgmanager.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0264.html