Bug 208656

Summary: nfs service relocation failed, umount suspected
Product: [Retired] Red Hat Cluster Suite Reporter: Nate Straz <nstraz>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2006-0712 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-10-11 16:44:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Adds killing of lockd to clusterfs teardown when nfslock=1 none

Description Nate Straz 2006-09-29 21:47:32 UTC
Description of problem:

While running a relocation test case (derringer), service relocation from
link-14 to link-13 failed because the service could not be stopped on link-14.

Version-Release number of selected component (if applicable):
rgmanager-1.9.53-0

How reproducible:
20%

Steps to Reproduce:
1. Setup a GFS file system as an NFS cluster resource
2. relocate service between nodes
3. wait
4. goto 2
  
Actual results:

Here is the /var/log/messages output from link-14, where the service was being
stopped.  It looks like rgmanager is trying to umount the file system twice.

Sep 29 16:38:53 link-14 clurgmgrd[7302]: <notice> Stopping service nfs_service 
Sep 29 16:38:53 link-14 clurgmgrd: [7302]: <info> Removing IPv4 address
10.15.89.200 from eth1 
Sep 29 16:39:03 link-14 clurgmgrd: [7302]: <info> Removing export: *:/mnt/gfs1 
Sep 29 16:39:03 link-14 clurgmgrd: [7302]: <warning> Dropping node-wide NFS lock
s 
Sep 29 16:39:04 link-14 clurgmgrd: [7302]: <info> Sending reclaim notifications
via link-14 
Sep 29 16:39:04 link-14 rpc.statd[6447]: Version 1.0.6 Starting
Sep 29 16:39:04 link-14 rpc.statd[6447]: Flags: No-Daemon Notify-Only 
Sep 29 16:39:04 link-14 rpc.statd[6447]: statd running as root. chown
/tmp/statd-link-14.6400/sm to choose different user 
Sep 29 16:39:07 link-14 rpc.statd[6447]: Caught signal 15, un-registering and
exiting.
Sep 29 16:39:07 link-14 clurgmgrd: [7302]: <info> unmounting
/dev/mapper/link_ia64-link_ia640 (/mnt/gfs1) 
Sep 29 16:39:07 link-14 clurgmgrd: [7302]: <notice> Forcefully unmounting /mnt/gfs1 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <info> unmounting
/dev/mapper/link_ia64-link_ia640 (/mnt/gfs1) 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <notice> Forcefully unmounting /mnt/gfs1 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <err> 'umount
/dev/mapper/link_ia64-link_ia640' failed (/mnt/gfs1), error=0 
Sep 29 16:39:12 link-14 clurgmgrd[7302]: <notice> stop on clusterfs "link_ia640"
returned 2 (invalid argument(s)) 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <info> Removing export: *:/mnt/ext3 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <info> unmounting /mnt/ext3 
Sep 29 16:39:12 link-14 clurgmgrd[7302]: <crit> #12: RG nfs_service failed to
stop; intervention required 
Sep 29 16:39:12 link-14 clurgmgrd[7302]: <notice> Service nfs_service is failed 
Sep 29 16:39:13 link-14 clurgmgrd[7302]: <alert> #2: Service nfs_service
returned failure code.  Last Owner: link-14 
Sep 29 16:39:13 link-14 clurgmgrd[7302]: <alert> #4: Administrator intervention
required. 

Expected results:
The file system should umount and relocate to link-13 as expected.

Additional info:

Comment 1 Lon Hohberger 2006-10-05 20:37:00 UTC
On a second pass, it looks like killing lockd isn't dropping all the locks ...
It oculd be just a consistency issue between fs.sh and clusterfs.sh; I'll look
in to that first.  If it isn't, chances are good that there's an open reference
on the file system, preventing umount from suceeding.



Comment 2 Lon Hohberger 2006-10-05 20:40:32 UTC
We're not killing lockd during umount of the cluster file system.

Comment 3 Lon Hohberger 2006-10-05 20:41:36 UTC
Created attachment 137860 [details]
Adds killing of lockd to clusterfs teardown when nfslock=1

Comment 4 Lon Hohberger 2006-10-05 20:42:34 UTC
Can you apply this to clusterfs.sh on your test cluster and see if it fixes the
problem?  It should.

Comment 6 Lon Hohberger 2006-10-05 21:59:46 UTC
Fixes in CVS; awaiting rebuild.

Comment 10 Red Hat Bugzilla 2006-10-11 16:44:31 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0712.html