RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1089004 - Cluster FS resource force unmount and linked shared objects.
Summary: Cluster FS resource force unmount and linked shared objects.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: resource-agents
Version: 6.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: David Vossel
QA Contact: Cluster QE
URL:
Whiteboard:
: 1118358 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-04-17 16:05 UTC by John Boero
Modified: 2016-01-27 01:48 UTC (History)
8 users (show)

Fixed In Version: resource-agents-3.9.5-9.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-10-14 05:00:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
unit test (2.07 KB, application/x-shellscript)
2014-06-20 15:38 UTC, David Vossel
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1462793 0 None None None Never
Red Hat Product Errata RHBA-2014:1428 0 normal SHIPPED_LIVE resource-agents bug fix and enhancement update 2014-10-14 01:06:18 UTC

Description John Boero 2014-04-17 16:05:55 UTC
Description of problem:
Customers with shared filesystems notice occasional fencing that reboots a node during service move.  Forced unmount is apparently problematic at several customers.  I've noticed that the script for forced unmount has incomplete kill logic.  fs-lib.sh has a kill_procs_using_mount function to kill all processes with open files on a mountpoint it is trying to unmount.  It does this by searching /proc/*  which does not contain active shared objects/libraries that may be in use by other processes.  If a process on another mountpoint is linked to a *.so on the shared storage, there is a hanging open file and unmount fails.  Then the node fences and restarts.  If we change one line in fs-lib.sh to use lsof instead of searching /proc/*, the problem goes away and all processes are successfully killed before the unmount.  lsof shows all open files including linked shared libraries.

In our case a DBA user was using firefox on our primary node with a $LD_LIBRARY_PATH set by oracle environment.  Firefox linked to libraries inside the shared Oracle mount with the user unaware.

from rgmanager version EL6_5.6 (current for RHEL 6.5)
OLD:
fs-lib.sh:278:  procs=$(find /proc/[0-9]*/ -type l -lname "${mp}/*" -or -lname "${mp}" 2>/dev/null | awk -F/ '{print $3}' | uniq)

FIX:
fs-lib.sh:278:  procs=$(lsof -t $mp | uniq)

Version-Release number of selected component (if applicable):
RHEL 6.5 current as of report

How reproducible:
ALWAYS

Steps to Reproduce:
1. Need to dynamically link to a shared object on clustered storage.  To force this, copy some libraries to active clustered storage.  cp -r /usr/lib64 /SHARED/PATH
2. export LD_LIBRARY_PATH=/SHARED/PATH
3. run bash or some process that will require a .so in /usr/lib64 from a location outside of the clustered storage.
4. Try to stop or move the storage in the cluster from another session.

Actual results:
Forced unmount fails (should be enabled) and the node fences.  There are still open files being the shared objects used by bash process.


Expected results:
The outside bash process should get a kill -15 or kill -9 to exit and free all its shared object files.  Then the cluster should gracefully unmount storage.


Additional info:
This is a one-line fix, but may add lsof to HA dependencies.  If that's OK I suggest the logic is changed as above.  In testing with lsof, we have no problem.  We have our user bring up firefox linked to shared storage, we move storage, lsof finds their firefox process using a library on the shared storage, and gracefully sends it a kill -15.  Then our storage gracefully unmounts.  Note that there may be a possibility of lsof blocking if network storage is slow or down.  Also note that customer's current RHEL 6.5 has lsof version 4.82, which allows 'lsof -t /path' to get all PIDs using /path.  Later versions of lsof (4.87) require 'lsof -t +d /path' for same functionality.

Please research.

Thanks!
John Boero
Platform Consultant NA

Comment 1 John Boero 2014-04-17 16:10:09 UTC
Note that this problem seems to have been introduced in RHEL 6.5.  RHEL 6.4 appears to use fuser instead of find /proc.

Comment 3 David Vossel 2014-04-17 17:56:45 UTC
This logic has gone through several iterations for various reasons. We'll have to go back through the history why we ended up scanning the /proc/ directory to determine if 'lsof' usage is safe or not.

-- Vossel

Comment 4 David Vossel 2014-05-08 19:53:56 UTC
We can't use lsof or fuser because of this issue, https://bugzilla.redhat.com/show_bug.cgi?id=1014298

Those tools use functions that can potentially block the file system resource-agents when nfs is in use somewhere on the system.  There is a lsof '-b' option that apparently tries to reduce the potential for blocking to occur, but after doing an strace I'm not convinced blocking will not occur.

As an alternative we can continue to use our custom shell logic and expand it to handle the use case where the process is using shared libraries that exist on the mount point being unmounted.

This is the logic I've introduced.

grep " ${mp}" /proc/[0-9]*/maps | awk -F/ '{print $3}' | uniq

The full patch can be found in this upstream pull request.

https://github.com/ClusterLabs/resource-agents/pull/422

-- Vossel

Comment 7 David Vossel 2014-06-20 15:38:33 UTC
Created attachment 910804 [details]
unit test

I've attached a unit test. This test requires the bind-mount agent that was introduced in this bugzilla.

https://bugzilla.redhat.com/show_bug.cgi?id=1094789

Make sure you are using a build that has that new agent in it. This test is going to copy shared libraries over to a bind mount then execute commands that use those shared libraries.


#./bind-test_v3.sh
.
.
.
.
PASSED

Comment 8 David Vossel 2014-07-14 15:20:01 UTC
*** Bug 1118358 has been marked as a duplicate of this bug. ***

Comment 9 michal novacek 2014-07-29 12:39:51 UTC
I have verified with unittest from comment 7 that processes blocking  umounting of the system are correctly identified and killed with resource-agents-3.9.5-11.el6.x86_64

-----
# rpm -q resource-agents 
resource-agents-3.9.5-11.el6.x86_64

# ./bind-test_v3.sh 
umount: /root/testsrc: not mounted
umount: /root/target: not mounted
<debug>  Checking fs "testrsc", Level 
[bind-mount.sh] Checking fs "testrsc", Level 
<err>    default: /root/testsrc is not mounted on /root/target
[bind-mount.sh] default: /root/testsrc is not mounted on /root/target
<info>   /root/testsrc is not mounted
[bind-mount.sh] /root/testsrc is not mounted
<info>   mounting /root/testsrc on /root/target
[bind-mount.sh] mounting /root/testsrc on /root/target
<debug>  mount  -o bind /root/testsrc /root/target
[bind-mount.sh] mount  -o bind /root/testsrc /root/target
<debug>  Checking fs "testrsc", Level 
[bind-mount.sh] Checking fs "testrsc", Level 
<debug>  /root/testsrc already mounted
[bind-mount.sh] /root/testsrc already mounted
<info>   unmounting /root/target
[bind-mount.sh] unmounting /root/target
<info>   /root/testsrc is not mounted
[bind-mount.sh] /root/testsrc is not mounted
<debug>  Checking fs "testrsc", Level 
[bind-mount.sh] Checking fs "testrsc", Level 
<err>    default: /root/testsrc is not mounted on /root/target
[bind-mount.sh] default: /root/testsrc is not mounted on /root/target
<info>   mounting /root/testsrc on /root/target
[bind-mount.sh] mounting /root/testsrc on /root/target
<debug>  mount  -o bind /root/testsrc /root/target
[bind-mount.sh] mount  -o bind /root/testsrc /root/target
starting bg process to hold target file open
this is test file
starting bg process to hold target file open
this is test file
<debug>  Checking fs "testrsc", Level 
[bind-mount.sh] Checking fs "testrsc", Level 
<info>   unmounting /root/target
[bind-mount.sh] unmounting /root/target
umount: /root/target: device is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))
<debug>  umount failed: 1
[bind-mount.sh] umount failed: 1
<warning>Sending SIGTERM to processes on /root/target
[bind-mount.sh] Sending SIGTERM to processes on /root/target
<info>   unmounting /root/target
[bind-mount.sh] unmounting /root/target
./bind-test_v3.sh: line 31: 17793 Terminated              tail -f "$target/testfile"
./bind-test_v3.sh: line 31: 17798 Terminated              tail -f "$tmpfile"
./bind-test_v3.sh: line 31: 17799 Terminated              tail -f "$tmpfile"
./bind-test_v3.sh: line 31: 17800 Terminated              tail -f "$tmpfile"
./bind-test_v3.sh: line 31: 17801 Terminated              tail -f "$target/testfile"
<debug>  Checking fs "testrsc", Level 
[bind-mount.sh] Checking fs "testrsc", Level 
<err>    default: /root/testsrc is not mounted on /root/target
[bind-mount.sh] default: /root/testsrc is not mounted on /root/target
PASSED

Comment 10 errata-xmlrpc 2014-10-14 05:00:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1428.html


Note You need to log in before you can comment on or make changes to this bug.