Bug 555901

Summary: fs.sh can kill processes that are not on the mount point which is being unmounted
Product: [Retired] Red Hat Cluster Suite Reporter: Shane Bradley <sbradley>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4CC: bmr, cluster-maint, djansa, fnadge, iannis, jkortus, jwest, rbinkhor, rrajaram, tao, tdunnon
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: rgmanager-1.9.87-1.4.el4 Doc Type: Bug Fix
Doc Text:
Previously, the file system agent could kill a process when an application used a mount point with a similar name to a mount point managed by rgmanager using force_unmount. With this update, the file system agent kills only the processes that access the mount point managed by rgmanager.
Story Points: ---
Clone Of:
: 582754 (view as bug list) Environment:
Last Closed: 2011-02-16 15:08:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 485811, 572246, 572248, 582754    
Attachments:
Description Flags
Patch to fix the killing incorrect process
none
fs.sh patch applied for RHEL4
none
Patch to use fuser instead.
none
Patched fs.sh
none
Automatic test case. none

Description Shane Bradley 2010-01-15 21:12:56 UTC
Description of problem:

When stopping a service that contains a filesystem resource that is
managed by fs.sh, on a stop operation it can kill a process that is
not located on that mount point.

For example if the service is defined as so:
<service name="demo1" recovery="disable">
  <fs device="/dev/sda1" force_fsck="0" force_unmount="1"
  fsid="7880" fstype="ext3" mountpoint="/media/demo1"
  name="demo1fs" options="" self_fence="0"/>
</service>

There was a couple of sceniros that would kill a process that was not
on that mount point when it was stopped:

These processes should not have been killed:
$ less /tmp/media/demo1/tmp.txt
$ less /tmp/test\ /media/demo1/tmp.txt

These processes was and should have been killed:
$ less /media/demo1/tmp.txt

As you can see in the logs 3 processes were and only 1 should have killed:
Jan 15 21:03:03 rh4node-single clurgmgrd: [18978]: <notice> Forcefully unmounting /media/demo1
Jan 15 21:03:03 rh4node-single clurgmgrd: [18978]: <warning> killing process 19428 (root less /media/demo1)
Jan 15 21:03:03 rh4node-single clurgmgrd: [18978]: <warning> killing process 19444 (root less /media/demo1)
Jan 15 21:03:03 rh4node-single clurgmgrd: [18978]: <warning> killing process 19463 (root less /media/demo1)
Jan 15 21:03:08 rh4node-single clurgmgrd: [18978]: <info> unmounting /media/demo1
Jan 15 21:03:08 rh4node-single clurgmgrd[18978]: <notice> Service demo1 is disabled


I will attach a patch for fs.sh and fs.sh with the patch applied that
fixes this issue and the following sceniros have been tested and only
the process that should have been killed was killed.
  $ less /tmp/media/demo1/tmp.txt
  $ less /media/demo1\ 2/tmp.txt
  $ less /tmp/test\ /media/demo1/tmp.txt


Version-Release number of selected component (if applicable):
rgmanager-1.9.87-1.el4

How reproducible:
Everytime

Steps to Reproduce:
1. Create a service like it is defined above.
2. Start the service: $ clusvcadm -e demo1
3. Open a couple terminals and then open the following files with less and each one:
   $ mkdir -p /tmp/media/demo1
   $ mkdir -p /tmp/test\ /media/demo1
   $ echo "test" > /tmp/media/demo1/tmp.txt
   $ echo "test" > /tmp/test\ /media/demo1/tmp.txt
   $ echo "test" > /media/demo1/tmp.txt
   $ less /tmp/media/demo1/tmp.txt
   $ less /tmp/test\ /media/demo1/tmp.txt
   $ less /media/demo1/tmp.txt
4. Stop the service: $ clusvcadm -d demo1

Actual results:
After the service is stopped, 2 processes that were not on the mount
point that was stopped will be killed.

Expected results:
Only processes on the mount point should be killed.

Additional info:
This bz should be cloned to RHEL5 because it is going to have the same
bug.

Comment 1 Shane Bradley 2010-01-15 21:13:49 UTC
Created attachment 384713 [details]
Patch to fix the killing incorrect process

Comment 2 Shane Bradley 2010-01-15 21:14:46 UTC
Created attachment 384714 [details]
fs.sh patch applied for RHEL4

This patched fs.sh was only tested on RHEL4, so not sure about RHEL5 but should be close to same.

Comment 7 Shane Bradley 2010-02-16 17:54:57 UTC
I tested the reproducer outlined in summary of BZ in RHEL5 with the
patched fs.sh that lon gave me for RHEL5 and it PASSED.

I tested the reproducer outlined in summary of BZ in RHEL4 with the
patched fs.sh that lon gave me for RHEL4 and it PASSED.

--sbradley

Comment 18 Jaroslav Kortus 2010-04-14 09:56:44 UTC
This patch has additional side effect that it does not kill processes directly on the service mountpoint.

Try this:
  <rm>
<resources>
   <clusterfs device="/dev/vedder/vedder0" force_unmount="1" self_fence="0" fstype="gfs" mountpoint="/mnt/vedder0" name="vedderfs" options=""/>
</resources>
<service autostart="1" name="jkservice">
<clusterfs ref="vedderfs"/>
</service>
  </rm>

then run bash on /mnt/vedder0 and ignore the signals:
trap "" SIGTERM

Now the service migration will fail:
Apr 13 12:29:26 z2 clurgmgrd: [27562]: <notice> Forcefully unmounting /mnt/vedder0
Apr 13 12:29:30 z2 clurgmgrd: [27562]: <notice> Forcefully unmounting /mnt/vedder0
Apr 13 12:29:31 z2 clurgmgrd: [27562]: <err> 'umount /mnt/vedder0' failed, error=0
Apr 13 12:29:31 z2 clurgmgrd[27562]: <notice> stop on clusterfs "vedderfs" returned 2 (invalid argument(s))
Apr 13 12:29:31 z2 clurgmgrd[27562]: <crit> #12: service:jkservice failed to stop; intervention required
Apr 13 12:29:31 z2 clurgmgrd[27562]: <notice> Service jkservice is failed

The root cause is probably regex at line 733, where it expects that all processes in lsof have the mountpoint listed with "/" at the end which is not the case for processes running directly on the mountpoint.

Tested on 4.8.z version (rgmanager-1.9.87-1.el4_8.3) but others are probably affected as well.

Comment 19 Lon Hohberger 2010-04-14 20:45:23 UTC
I think maybe a better approach to this whole thing is to drop lsof support and just use 'fuser -kvm'.

Comment 20 Lon Hohberger 2010-04-14 21:36:07 UTC
Created attachment 406649 [details]
Patch to use fuser instead.

Comment 21 Lon Hohberger 2010-04-14 21:36:54 UTC
Created attachment 406650 [details]
Patched fs.sh

Comment 22 Lon Hohberger 2010-04-14 21:38:37 UTC
Created attachment 406652 [details]
Automatic test case.

This test case requires:

- gcc
- fs.sh

Copy it in to /usr/share/cluster
cd /usr/share/cluster
./555901-test.sh

Comment 23 Lon Hohberger 2010-06-28 18:37:04 UTC
Updated build addresses clusterfs/netfs.sh force_unmount holes.

Comment 24 Jaroslav Kortus 2010-07-16 12:53:02 UTC
Tested on clusterfs and fs. All processess accessing the mountpoints were killed and none of the others (including those running in the example given by reporter) were killed.

Comment 25 Jaroslav Kortus 2010-07-16 12:56:03 UTC
reverting the state back, It's working but no errata yet. Wrong bz# :/.

Comment 28 Florian Nadge 2011-01-03 14:06:41 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the file system agent could kill a process when an application used a mount point with a similar name to a mount point managed by rgmanager using force_unmount. With this update, the file system agent kills only the processes that access the mount point managed by rgmanager.

Comment 29 errata-xmlrpc 2011-02-16 15:08:24 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0264.html