Bug 208656 - nfs service relocation failed, umount suspected
Summary: nfs service relocation failed, umount suspected
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: rgmanager
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-09-29 21:47 UTC by Nate Straz
Modified: 2009-07-22 19:27 UTC (History)
1 user (show)

(edit)
Clone Of:
(edit)
Last Closed: 2006-10-11 16:44:31 UTC


Attachments (Terms of Use)
Adds killing of lockd to clusterfs teardown when nfslock=1 (645 bytes, patch)
2006-10-05 20:41 UTC, Lon Hohberger
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2006:0712 normal SHIPPED_LIVE rgmanager bug fix update 2006-10-11 16:44:28 UTC

Description Nate Straz 2006-09-29 21:47:32 UTC
Description of problem:

While running a relocation test case (derringer), service relocation from
link-14 to link-13 failed because the service could not be stopped on link-14.

Version-Release number of selected component (if applicable):
rgmanager-1.9.53-0

How reproducible:
20%

Steps to Reproduce:
1. Setup a GFS file system as an NFS cluster resource
2. relocate service between nodes
3. wait
4. goto 2
  
Actual results:

Here is the /var/log/messages output from link-14, where the service was being
stopped.  It looks like rgmanager is trying to umount the file system twice.

Sep 29 16:38:53 link-14 clurgmgrd[7302]: <notice> Stopping service nfs_service 
Sep 29 16:38:53 link-14 clurgmgrd: [7302]: <info> Removing IPv4 address
10.15.89.200 from eth1 
Sep 29 16:39:03 link-14 clurgmgrd: [7302]: <info> Removing export: *:/mnt/gfs1 
Sep 29 16:39:03 link-14 clurgmgrd: [7302]: <warning> Dropping node-wide NFS lock
s 
Sep 29 16:39:04 link-14 clurgmgrd: [7302]: <info> Sending reclaim notifications
via link-14 
Sep 29 16:39:04 link-14 rpc.statd[6447]: Version 1.0.6 Starting
Sep 29 16:39:04 link-14 rpc.statd[6447]: Flags: No-Daemon Notify-Only 
Sep 29 16:39:04 link-14 rpc.statd[6447]: statd running as root. chown
/tmp/statd-link-14.6400/sm to choose different user 
Sep 29 16:39:07 link-14 rpc.statd[6447]: Caught signal 15, un-registering and
exiting.
Sep 29 16:39:07 link-14 clurgmgrd: [7302]: <info> unmounting
/dev/mapper/link_ia64-link_ia640 (/mnt/gfs1) 
Sep 29 16:39:07 link-14 clurgmgrd: [7302]: <notice> Forcefully unmounting /mnt/gfs1 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <info> unmounting
/dev/mapper/link_ia64-link_ia640 (/mnt/gfs1) 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <notice> Forcefully unmounting /mnt/gfs1 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <err> 'umount
/dev/mapper/link_ia64-link_ia640' failed (/mnt/gfs1), error=0 
Sep 29 16:39:12 link-14 clurgmgrd[7302]: <notice> stop on clusterfs "link_ia640"
returned 2 (invalid argument(s)) 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <info> Removing export: *:/mnt/ext3 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <info> unmounting /mnt/ext3 
Sep 29 16:39:12 link-14 clurgmgrd[7302]: <crit> #12: RG nfs_service failed to
stop; intervention required 
Sep 29 16:39:12 link-14 clurgmgrd[7302]: <notice> Service nfs_service is failed 
Sep 29 16:39:13 link-14 clurgmgrd[7302]: <alert> #2: Service nfs_service
returned failure code.  Last Owner: link-14 
Sep 29 16:39:13 link-14 clurgmgrd[7302]: <alert> #4: Administrator intervention
required. 

Expected results:
The file system should umount and relocate to link-13 as expected.

Additional info:

Comment 1 Lon Hohberger 2006-10-05 20:37:00 UTC
On a second pass, it looks like killing lockd isn't dropping all the locks ...
It oculd be just a consistency issue between fs.sh and clusterfs.sh; I'll look
in to that first.  If it isn't, chances are good that there's an open reference
on the file system, preventing umount from suceeding.



Comment 2 Lon Hohberger 2006-10-05 20:40:32 UTC
We're not killing lockd during umount of the cluster file system.

Comment 3 Lon Hohberger 2006-10-05 20:41:36 UTC
Created attachment 137860 [details]
Adds killing of lockd to clusterfs teardown when nfslock=1

Comment 4 Lon Hohberger 2006-10-05 20:42:34 UTC
Can you apply this to clusterfs.sh on your test cluster and see if it fixes the
problem?  It should.

Comment 6 Lon Hohberger 2006-10-05 21:59:46 UTC
Fixes in CVS; awaiting rebuild.

Comment 10 Red Hat Bugzilla 2006-10-11 16:44:31 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0712.html



Note You need to log in before you can comment on or make changes to this bug.