Bug 555901

Summary:

fs.sh can kill processes that are not on the mount point which is being unmounted

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Shane Bradley <sbradley>

Component:

rgmanager

Assignee:

Lon Hohberger <lhh>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

CC:

bmr, cluster-maint, djansa, fnadge, iannis, jkortus, jwest, rbinkhor, rrajaram, tao, tdunnon

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

rgmanager-1.9.87-1.4.el4

Doc Type:

Bug Fix

Doc Text:

Previously, the file system agent could kill a process when an application used a mount point with a similar name to a mount point managed by rgmanager using force_unmount. With this update, the file system agent kills only the processes that access the mount point managed by rgmanager.

Story Points:

---

Clone Of:

Clones:

582754 (view as bug list)

Environment:

Last Closed:

2011-02-16 15:08:24 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

485811, 572246, 572248, 582754

Attachments:

Description	Flags
Patch to fix the killing incorrect process	none
fs.sh patch applied for RHEL4	none
Patch to use fuser instead.	none
Patched fs.sh	none
Automatic test case.	none

Description Shane Bradley 2010-01-15 21:12:56 UTC

Description of problem:

When stopping a service that contains a filesystem resource that is
managed by fs.sh, on a stop operation it can kill a process that is
not located on that mount point.

For example if the service is defined as so:
<service name="demo1" recovery="disable">
  <fs device="/dev/sda1" force_fsck="0" force_unmount="1"
  fsid="7880" fstype="ext3" mountpoint="/media/demo1"
  name="demo1fs" options="" self_fence="0"/>
</service>

There was a couple of sceniros that would kill a process that was not
on that mount point when it was stopped:

These processes should not have been killed:
$ less /tmp/media/demo1/tmp.txt
$ less /tmp/test\ /media/demo1/tmp.txt

These processes was and should have been killed:
$ less /media/demo1/tmp.txt

As you can see in the logs 3 processes were and only 1 should have killed:
Jan 15 21:03:03 rh4node-single clurgmgrd: [18978]: <notice> Forcefully unmounting /media/demo1
Jan 15 21:03:03 rh4node-single clurgmgrd: [18978]: <warning> killing process 19428 (root less /media/demo1)
Jan 15 21:03:03 rh4node-single clurgmgrd: [18978]: <warning> killing process 19444 (root less /media/demo1)
Jan 15 21:03:03 rh4node-single clurgmgrd: [18978]: <warning> killing process 19463 (root less /media/demo1)
Jan 15 21:03:08 rh4node-single clurgmgrd: [18978]: <info> unmounting /media/demo1
Jan 15 21:03:08 rh4node-single clurgmgrd[18978]: <notice> Service demo1 is disabled


I will attach a patch for fs.sh and fs.sh with the patch applied that
fixes this issue and the following sceniros have been tested and only
the process that should have been killed was killed.
  $ less /tmp/media/demo1/tmp.txt
  $ less /media/demo1\ 2/tmp.txt
  $ less /tmp/test\ /media/demo1/tmp.txt


Version-Release number of selected component (if applicable):
rgmanager-1.9.87-1.el4

How reproducible:
Everytime

Steps to Reproduce:
1. Create a service like it is defined above.
2. Start the service: $ clusvcadm -e demo1
3. Open a couple terminals and then open the following files with less and each one:
   $ mkdir -p /tmp/media/demo1
   $ mkdir -p /tmp/test\ /media/demo1
   $ echo "test" > /tmp/media/demo1/tmp.txt
   $ echo "test" > /tmp/test\ /media/demo1/tmp.txt
   $ echo "test" > /media/demo1/tmp.txt
   $ less /tmp/media/demo1/tmp.txt
   $ less /tmp/test\ /media/demo1/tmp.txt
   $ less /media/demo1/tmp.txt
4. Stop the service: $ clusvcadm -d demo1

Actual results:
After the service is stopped, 2 processes that were not on the mount
point that was stopped will be killed.

Expected results:
Only processes on the mount point should be killed.

Additional info:
This bz should be cloned to RHEL5 because it is going to have the same
bug.

Comment 1 Shane Bradley 2010-01-15 21:13:49 UTC

Created attachment 384713 [details]
Patch to fix the killing incorrect process

Comment 2 Shane Bradley 2010-01-15 21:14:46 UTC

Created attachment 384714 [details]
fs.sh patch applied for RHEL4

This patched fs.sh was only tested on RHEL4, so not sure about RHEL5 but should be close to same.

Comment 7 Shane Bradley 2010-02-16 17:54:57 UTC

I tested the reproducer outlined in summary of BZ in RHEL5 with the
patched fs.sh that lon gave me for RHEL5 and it PASSED.

I tested the reproducer outlined in summary of BZ in RHEL4 with the
patched fs.sh that lon gave me for RHEL4 and it PASSED.

--sbradley

Comment 13 Lon Hohberger 2010-03-10 15:10:54 UTC

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=2d0323010f7110452625bc71c9898459a2ca7e85

Comment 18 Jaroslav Kortus 2010-04-14 09:56:44 UTC

This patch has additional side effect that it does not kill processes directly on the service mountpoint.

Try this:
  <rm>
<resources>
   <clusterfs device="/dev/vedder/vedder0" force_unmount="1" self_fence="0" fstype="gfs" mountpoint="/mnt/vedder0" name="vedderfs" options=""/>
</resources>
<service autostart="1" name="jkservice">
<clusterfs ref="vedderfs"/>
</service>
  </rm>

then run bash on /mnt/vedder0 and ignore the signals:
trap "" SIGTERM

Now the service migration will fail:
Apr 13 12:29:26 z2 clurgmgrd: [27562]: <notice> Forcefully unmounting /mnt/vedder0
Apr 13 12:29:30 z2 clurgmgrd: [27562]: <notice> Forcefully unmounting /mnt/vedder0
Apr 13 12:29:31 z2 clurgmgrd: [27562]: <err> 'umount /mnt/vedder0' failed, error=0
Apr 13 12:29:31 z2 clurgmgrd[27562]: <notice> stop on clusterfs "vedderfs" returned 2 (invalid argument(s))
Apr 13 12:29:31 z2 clurgmgrd[27562]: <crit> #12: service:jkservice failed to stop; intervention required
Apr 13 12:29:31 z2 clurgmgrd[27562]: <notice> Service jkservice is failed

The root cause is probably regex at line 733, where it expects that all processes in lsof have the mountpoint listed with "/" at the end which is not the case for processes running directly on the mountpoint.

Tested on 4.8.z version (rgmanager-1.9.87-1.el4_8.3) but others are probably affected as well.

Comment 19 Lon Hohberger 2010-04-14 20:45:23 UTC

I think maybe a better approach to this whole thing is to drop lsof support and just use 'fuser -kvm'.

Comment 20 Lon Hohberger 2010-04-14 21:36:07 UTC

Created attachment 406649 [details]
Patch to use fuser instead.

Comment 21 Lon Hohberger 2010-04-14 21:36:54 UTC

Created attachment 406650 [details]
Patched fs.sh

Comment 22 Lon Hohberger 2010-04-14 21:38:37 UTC

Created attachment 406652 [details]
Automatic test case.

This test case requires:

- gcc
- fs.sh

Copy it in to /usr/share/cluster
cd /usr/share/cluster
./555901-test.sh

Comment 23 Lon Hohberger 2010-06-28 18:37:04 UTC

Updated build addresses clusterfs/netfs.sh force_unmount holes.

Comment 24 Jaroslav Kortus 2010-07-16 12:53:02 UTC

Tested on clusterfs and fs. All processess accessing the mountpoints were killed and none of the others (including those running in the example given by reporter) were killed.

Comment 25 Jaroslav Kortus 2010-07-16 12:56:03 UTC

reverting the state back, It's working but no errata yet. Wrong bz# :/.

Comment 28 Florian Nadge 2011-01-03 14:06:41 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the file system agent could kill a process when an application used a mount point with a similar name to a mount point managed by rgmanager using force_unmount. With this update, the file system agent kills only the processes that access the mount point managed by rgmanager.

Comment 29 errata-xmlrpc 2011-02-16 15:08:24 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0264.html