208656 – nfs service relocation failed, umount suspected

Bug 208656 - nfs service relocation failed, umount suspected

Summary: nfs service relocation failed, umount suspected

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	rgmanager
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-09-29 21:47 UTC by Nate Straz
Modified:	2009-07-22 19:27 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHBA-2006-0712
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-10-11 16:44:31 UTC
Embargoed:

Attachments	(Terms of Use)
Adds killing of lockd to clusterfs teardown when nfslock=1 (645 bytes, patch) 2006-10-05 20:41 UTC, Lon Hohberger	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2006:0712	0	normal	SHIPPED_LIVE	rgmanager bug fix update	2006-10-11 16:44:28 UTC

Description Nate Straz 2006-09-29 21:47:32 UTC

Description of problem:

While running a relocation test case (derringer), service relocation from
link-14 to link-13 failed because the service could not be stopped on link-14.

Version-Release number of selected component (if applicable):
rgmanager-1.9.53-0

How reproducible:
20%

Steps to Reproduce:
1. Setup a GFS file system as an NFS cluster resource
2. relocate service between nodes
3. wait
4. goto 2
  
Actual results:

Here is the /var/log/messages output from link-14, where the service was being
stopped.  It looks like rgmanager is trying to umount the file system twice.

Sep 29 16:38:53 link-14 clurgmgrd[7302]: <notice> Stopping service nfs_service 
Sep 29 16:38:53 link-14 clurgmgrd: [7302]: <info> Removing IPv4 address
10.15.89.200 from eth1 
Sep 29 16:39:03 link-14 clurgmgrd: [7302]: <info> Removing export: *:/mnt/gfs1 
Sep 29 16:39:03 link-14 clurgmgrd: [7302]: <warning> Dropping node-wide NFS lock
s 
Sep 29 16:39:04 link-14 clurgmgrd: [7302]: <info> Sending reclaim notifications
via link-14 
Sep 29 16:39:04 link-14 rpc.statd[6447]: Version 1.0.6 Starting
Sep 29 16:39:04 link-14 rpc.statd[6447]: Flags: No-Daemon Notify-Only 
Sep 29 16:39:04 link-14 rpc.statd[6447]: statd running as root. chown
/tmp/statd-link-14.6400/sm to choose different user 
Sep 29 16:39:07 link-14 rpc.statd[6447]: Caught signal 15, un-registering and
exiting.
Sep 29 16:39:07 link-14 clurgmgrd: [7302]: <info> unmounting
/dev/mapper/link_ia64-link_ia640 (/mnt/gfs1) 
Sep 29 16:39:07 link-14 clurgmgrd: [7302]: <notice> Forcefully unmounting /mnt/gfs1 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <info> unmounting
/dev/mapper/link_ia64-link_ia640 (/mnt/gfs1) 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <notice> Forcefully unmounting /mnt/gfs1 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <err> 'umount
/dev/mapper/link_ia64-link_ia640' failed (/mnt/gfs1), error=0 
Sep 29 16:39:12 link-14 clurgmgrd[7302]: <notice> stop on clusterfs "link_ia640"
returned 2 (invalid argument(s)) 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <info> Removing export: *:/mnt/ext3 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <info> unmounting /mnt/ext3 
Sep 29 16:39:12 link-14 clurgmgrd[7302]: <crit> #12: RG nfs_service failed to
stop; intervention required 
Sep 29 16:39:12 link-14 clurgmgrd[7302]: <notice> Service nfs_service is failed 
Sep 29 16:39:13 link-14 clurgmgrd[7302]: <alert> #2: Service nfs_service
returned failure code.  Last Owner: link-14 
Sep 29 16:39:13 link-14 clurgmgrd[7302]: <alert> #4: Administrator intervention
required. 

Expected results:
The file system should umount and relocate to link-13 as expected.

Additional info:

Comment 1 Lon Hohberger 2006-10-05 20:37:00 UTC

On a second pass, it looks like killing lockd isn't dropping all the locks ...
It oculd be just a consistency issue between fs.sh and clusterfs.sh; I'll look
in to that first.  If it isn't, chances are good that there's an open reference
on the file system, preventing umount from suceeding.

Comment 2 Lon Hohberger 2006-10-05 20:40:32 UTC

We're not killing lockd during umount of the cluster file system.

Comment 3 Lon Hohberger 2006-10-05 20:41:36 UTC

Created attachment 137860 [details]
Adds killing of lockd to clusterfs teardown when nfslock=1

Comment 4 Lon Hohberger 2006-10-05 20:42:34 UTC

Can you apply this to clusterfs.sh on your test cluster and see if it fixes the
problem?  It should.

Comment 6 Lon Hohberger 2006-10-05 21:59:46 UTC

Fixes in CVS; awaiting rebuild.

Comment 10 Red Hat Bugzilla 2006-10-11 16:44:31 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0712.html

Note You need to log in before you can comment on or make changes to this bug.