Bug 254111 - rgmanager umounts shared partition gfs although its used by another service
rgmanager umounts shared partition gfs although its used by another service
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager (Show other bugs)
5.0
All Linux
low Severity low
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-08-24 03:46 EDT by Herbert L. Plankl
Modified: 2009-04-16 18:55 EDT (History)
1 user (show)

See Also:
Fixed In Version: RHBA-2008-0353
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-05-21 10:30:32 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Preliminary patch (5.48 KB, patch)
2007-09-18 16:46 EDT, Lon Hohberger
no flags Details | Diff
Adds reference count support to rgmanager & refcount checking to clusterfs.sh (4.03 KB, patch)
2007-12-19 15:59 EST, Lon Hohberger
no flags Details | Diff

  None (edit)
Description Herbert L. Plankl 2007-08-24 03:46:51 EDT
Description of problem:
I have a two_node-cluster and 2 services using the same shared gfs-partition at
the same mountpoint. Both services are running on one member. The
force_unmount="1" option is set for this resource. If I'm relocating one
service, the gfs partition gets unounted although the other service running on
this member uses the partition too. After a few seconds, rgmanager recognizes
that the service running on this nodes has lost its gfs-partition an remounts it.

The log-output is:
Aug 24 09:31:50 rhel5n1 clurgmgrd[6337]: <notice> Stopping service
service:testscript1
Aug 24 09:31:51 rhel5n1 avahi-daemon[2400]: Withdrawing address record for
192.168.111.221 on bond0.
Aug 24 09:32:02 rhel5n1 clurgmgrd[6337]: <notice> Service service:testscript1 is
stopped
Aug 24 09:32:05 rhel5n1 clurgmgrd[6337]: <notice> Service service:testscript1 is
now running on member 2
Aug 24 09:32:55 rhel5n1 clurgmgrd[6337]: <notice> status on clusterfs
"opt_icoserve" returned 1 (generic error)
Aug 24 09:32:55 rhel5n1 clurgmgrd[6337]: <warning> Some independent resources in
service:testscript2 failed; Attempting inline recovery
Aug 24 09:32:56 rhel5n1 avahi-daemon[2400]: Withdrawing address record for
192.168.111.222 on bond0.
Aug 24 09:33:10 rhel5n1 kernel: Trying to join cluster "lock_dlm",
"forty_two:opt_icoserve"
Aug 24 09:33:10 rhel5n1 kernel: Joined cluster. Now mounting FS...
Aug 24 09:33:10 rhel5n1 kernel: GFS: fsid=forty_two:opt_icoserve.1: jid=1:
Trying to acquire journal lock...
Aug 24 09:33:10 rhel5n1 kernel: GFS: fsid=forty_two:opt_icoserve.1: jid=1:
Looking at journal...
Aug 24 09:33:10 rhel5n1 kernel: GFS: fsid=forty_two:opt_icoserve.1: jid=1: Done
Aug 24 09:33:10 rhel5n1 kernel: GFS: fsid=forty_two:opt_icoserve.1: Scanning for
log elements...
Aug 24 09:33:12 rhel5n1 kernel: GFS: fsid=forty_two:opt_icoserve.1: Found 0
unlinked inodes
Aug 24 09:33:12 rhel5n1 kernel: GFS: fsid=forty_two:opt_icoserve.1: Found quota
changes for 0 IDs
Aug 24 09:33:12 rhel5n1 kernel: GFS: fsid=forty_two:opt_icoserve.1: Done
Aug 24 09:33:13 rhel5n1 avahi-daemon[2400]: Registering new address record for
192.168.111.222 on bond0.
Aug 24 09:33:14 rhel5n1 clurgmgrd[6337]: <notice> Inline recovery of
service:testscript2 succeeded


Version-Release number of selected component (if applicable):
cman-2.0.70-1.el5
rgmanager-2.0.28-1.el5
system-config-cluster-1.0.50-1.2
ricci-0.10.0-2.el5
luci-0.10.0-2.el5


How reproducible:
* 2node-cluster
* 2 service running on the same node using the same shared gfs-partition.
* force_unmount="1" set
* cluster.conf:
                <resources>
                        <clusterfs device="/dev/icoserve/opt_icoserve"
force_unmount="1" fsid="63088" fstype="gfs" mountpoint="/opt/icoserve"
name="opt_icoserve" options=""/>
                </resources>
                <service autostart="1" domain="n1_n2" exclusive="0"
name="testscript1" recovery="restart">
                        <ip address="192.168.111.221" monitor_link="1"/>
                        <script file="/etc/init.d/testscript1" name="testscript1"/>
                        <clusterfs ref="opt_icoserve"/>
                </service>
                <service autostart="1" domain="n1_n2" exclusive="0"
name="testscript2" recovery="restart">
                        <ip address="192.168.111.222" monitor_link="1"/>
                        <script file="/etc/init.d/testscript2" name="testscript2"/>
                        <clusterfs ref="opt_icoserve"/>
                </service>


Steps to Reproduce:
1. relocate one service
2. watch the logfile
3. watch output of "mount"
  
Actual results:
Partition gets unmounted although another service on this node is using it.

Expected results:
Partition gets unmounted only if no other service on this node is using it. Its
a shared ressource - not a private one.

Additional info: Thats important for us, because we (will) have a few services
using the same partition. But if on one node no service is running, but the
partition remains mounted, clvmd cannot be stopped and so cman cannot be
stopped. So we have to use force_unmount="1".
Comment 1 Lon Hohberger 2007-09-18 16:10:44 EDT
The simplest way to fix this is to make the resource agent maintain a reference
count after mounting occurs.
Comment 2 Lon Hohberger 2007-09-18 16:46:16 EDT
Created attachment 198841 [details]
Preliminary patch

This allows clusterfs.sh to maintain a reference count after mount/before
unmount.  The caveat which applies here is that the services using a shared
clusterfs resource should be disabled if mount options change.
Comment 3 Lon Hohberger 2007-09-20 09:42:59 EDT
Specifying force_unmount should not be required if reference counts are used. 
This patch still requires force_unmount, so it is a bit inadequate.
Comment 4 Lon Hohberger 2007-09-24 16:31:50 EDT
Does this script solve the problem for you?
Comment 5 Herbert L. Plankl 2007-09-25 03:41:14 EDT
I'm sorry - how can I test that. I'm using rgmanager-2.0.28-1.el5. The files
referenced in the patch are not available on my system and on rhn is no
source-package available. Can you upload the scripts instead of the patch?
Comment 6 Lon Hohberger 2007-09-25 09:28:45 EDT
The reference count store location needs to be configurable so that the changes
to /etc/init.d/rgmanager are not needed.

I'll fix that and build test packages.
Comment 7 RHEL Product and Program Management 2007-10-15 23:44:32 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 8 Kiersten (Kerri) Anderson 2007-11-19 14:57:26 EST
All cluster version 5 defects should be reported under red hat enterprise linux
5 product name - not cluster suite.
Comment 9 Lon Hohberger 2007-12-19 15:59:34 EST
Created attachment 290066 [details]
Adds reference count support to rgmanager & refcount checking to clusterfs.sh
Comment 12 errata-xmlrpc 2008-05-21 10:30:32 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0353.html

Note You need to log in before you can comment on or make changes to this bug.