Bug 209544 - nfs service relocation failed, umount suspected
Summary: nfs service relocation failed, umount suspected
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager
Version: 5.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-10-05 22:07 UTC by Lon Hohberger
Modified: 2009-04-16 22:36 UTC (History)
1 user (show)

Fixed In Version: 5.0.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-11-28 21:26:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Lon Hohberger 2006-10-05 22:07:43 UTC
+++ This bug was initially created as a clone of Bug #208656 +++

Description of problem:

While running a relocation test case (derringer), service relocation from
link-14 to link-13 failed because the service could not be stopped on link-14.

Version-Release number of selected component (if applicable):
rgmanager-1.9.53-0

How reproducible:
20%

Steps to Reproduce:
1. Setup a GFS file system as an NFS cluster resource
2. relocate service between nodes
3. wait
4. goto 2
  
Actual results:

Here is the /var/log/messages output from link-14, where the service was being
stopped.  It looks like rgmanager is trying to umount the file system twice.

Sep 29 16:38:53 link-14 clurgmgrd[7302]: <notice> Stopping service nfs_service 
Sep 29 16:38:53 link-14 clurgmgrd: [7302]: <info> Removing IPv4 address
10.15.89.200 from eth1 
Sep 29 16:39:03 link-14 clurgmgrd: [7302]: <info> Removing export: *:/mnt/gfs1 
Sep 29 16:39:03 link-14 clurgmgrd: [7302]: <warning> Dropping node-wide NFS lock
s 
Sep 29 16:39:04 link-14 clurgmgrd: [7302]: <info> Sending reclaim notifications
via link-14 
Sep 29 16:39:04 link-14 rpc.statd[6447]: Version 1.0.6 Starting
Sep 29 16:39:04 link-14 rpc.statd[6447]: Flags: No-Daemon Notify-Only 
Sep 29 16:39:04 link-14 rpc.statd[6447]: statd running as root. chown
/tmp/statd-link-14.6400/sm to choose different user 
Sep 29 16:39:07 link-14 rpc.statd[6447]: Caught signal 15, un-registering and
exiting.
Sep 29 16:39:07 link-14 clurgmgrd: [7302]: <info> unmounting
/dev/mapper/link_ia64-link_ia640 (/mnt/gfs1) 
Sep 29 16:39:07 link-14 clurgmgrd: [7302]: <notice> Forcefully unmounting /mnt/gfs1 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <info> unmounting
/dev/mapper/link_ia64-link_ia640 (/mnt/gfs1) 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <notice> Forcefully unmounting /mnt/gfs1 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <err> 'umount
/dev/mapper/link_ia64-link_ia640' failed (/mnt/gfs1), error=0 
Sep 29 16:39:12 link-14 clurgmgrd[7302]: <notice> stop on clusterfs "link_ia640"
returned 2 (invalid argument(s)) 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <info> Removing export: *:/mnt/ext3 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <info> unmounting /mnt/ext3 
Sep 29 16:39:12 link-14 clurgmgrd[7302]: <crit> #12: RG nfs_service failed to
stop; intervention required 
Sep 29 16:39:12 link-14 clurgmgrd[7302]: <notice> Service nfs_service is failed 
Sep 29 16:39:13 link-14 clurgmgrd[7302]: <alert> #2: Service nfs_service
returned failure code.  Last Owner: link-14 
Sep 29 16:39:13 link-14 clurgmgrd[7302]: <alert> #4: Administrator intervention
required. 

Expected results:
The file system should umount and relocate to link-13 as expected.

Additional info:

-- Additional comment from lhh on 2006-10-05 16:37 EST --
On a second pass, it looks like killing lockd isn't dropping all the locks ...
It oculd be just a consistency issue between fs.sh and clusterfs.sh; I'll look
in to that first.  If it isn't, chances are good that there's an open reference
on the file system, preventing umount from suceeding.



-- Additional comment from lhh on 2006-10-05 16:40 EST --
We're not killing lockd during umount of the cluster file system.

-- Additional comment from lhh on 2006-10-05 16:41 EST --
Created an attachment (id=137860)
Adds killing of lockd to clusterfs teardown when nfslock=1

Comment 2 Kiersten (Kerri) Anderson 2006-10-10 16:07:31 UTC
Devel ACK for 5.0.0.

Comment 3 RHEL Program Management 2006-10-10 16:17:27 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux release.  Product Management has requested further review
of this request by Red Hat Engineering.  This request is not yet committed for
inclusion in release.

Comment 4 Lon Hohberger 2006-10-16 14:40:14 UTC
clusterfs.sh from rhel4 branch copied over which addresses the problem.

Comment 6 Nate Straz 2007-12-13 17:18:46 UTC
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.


Note You need to log in before you can comment on or make changes to this bug.