Bug 208656

Summary:

nfs service relocation failed, umount suspected

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Nate Straz <nstraz>

Component:

rgmanager

Assignee:

Lon Hohberger <lhh>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

cluster-maint

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2006-0712

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2006-10-11 16:44:31 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Adds killing of lockd to clusterfs teardown when nfslock=1	none

Description Nate Straz 2006-09-29 21:47:32 UTC

Description of problem:

While running a relocation test case (derringer), service relocation from
link-14 to link-13 failed because the service could not be stopped on link-14.

Version-Release number of selected component (if applicable):
rgmanager-1.9.53-0

How reproducible:
20%

Steps to Reproduce:
1. Setup a GFS file system as an NFS cluster resource
2. relocate service between nodes
3. wait
4. goto 2
  
Actual results:

Here is the /var/log/messages output from link-14, where the service was being
stopped.  It looks like rgmanager is trying to umount the file system twice.

Sep 29 16:38:53 link-14 clurgmgrd[7302]: <notice> Stopping service nfs_service 
Sep 29 16:38:53 link-14 clurgmgrd: [7302]: <info> Removing IPv4 address
10.15.89.200 from eth1 
Sep 29 16:39:03 link-14 clurgmgrd: [7302]: <info> Removing export: *:/mnt/gfs1 
Sep 29 16:39:03 link-14 clurgmgrd: [7302]: <warning> Dropping node-wide NFS lock
s 
Sep 29 16:39:04 link-14 clurgmgrd: [7302]: <info> Sending reclaim notifications
via link-14 
Sep 29 16:39:04 link-14 rpc.statd[6447]: Version 1.0.6 Starting
Sep 29 16:39:04 link-14 rpc.statd[6447]: Flags: No-Daemon Notify-Only 
Sep 29 16:39:04 link-14 rpc.statd[6447]: statd running as root. chown
/tmp/statd-link-14.6400/sm to choose different user 
Sep 29 16:39:07 link-14 rpc.statd[6447]: Caught signal 15, un-registering and
exiting.
Sep 29 16:39:07 link-14 clurgmgrd: [7302]: <info> unmounting
/dev/mapper/link_ia64-link_ia640 (/mnt/gfs1) 
Sep 29 16:39:07 link-14 clurgmgrd: [7302]: <notice> Forcefully unmounting /mnt/gfs1 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <info> unmounting
/dev/mapper/link_ia64-link_ia640 (/mnt/gfs1) 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <notice> Forcefully unmounting /mnt/gfs1 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <err> 'umount
/dev/mapper/link_ia64-link_ia640' failed (/mnt/gfs1), error=0 
Sep 29 16:39:12 link-14 clurgmgrd[7302]: <notice> stop on clusterfs "link_ia640"
returned 2 (invalid argument(s)) 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <info> Removing export: *:/mnt/ext3 
Sep 29 16:39:12 link-14 clurgmgrd: [7302]: <info> unmounting /mnt/ext3 
Sep 29 16:39:12 link-14 clurgmgrd[7302]: <crit> #12: RG nfs_service failed to
stop; intervention required 
Sep 29 16:39:12 link-14 clurgmgrd[7302]: <notice> Service nfs_service is failed 
Sep 29 16:39:13 link-14 clurgmgrd[7302]: <alert> #2: Service nfs_service
returned failure code.  Last Owner: link-14 
Sep 29 16:39:13 link-14 clurgmgrd[7302]: <alert> #4: Administrator intervention
required. 

Expected results:
The file system should umount and relocate to link-13 as expected.

Additional info:

Comment 1 Lon Hohberger 2006-10-05 20:37:00 UTC

On a second pass, it looks like killing lockd isn't dropping all the locks ...
It oculd be just a consistency issue between fs.sh and clusterfs.sh; I'll look
in to that first.  If it isn't, chances are good that there's an open reference
on the file system, preventing umount from suceeding.

Comment 2 Lon Hohberger 2006-10-05 20:40:32 UTC

We're not killing lockd during umount of the cluster file system.

Comment 3 Lon Hohberger 2006-10-05 20:41:36 UTC

Created attachment 137860 [details]
Adds killing of lockd to clusterfs teardown when nfslock=1

Comment 4 Lon Hohberger 2006-10-05 20:42:34 UTC

Can you apply this to clusterfs.sh on your test cluster and see if it fixes the
problem?  It should.

Comment 6 Lon Hohberger 2006-10-05 21:59:46 UTC

Fixes in CVS; awaiting rebuild.

Comment 10 Red Hat Bugzilla 2006-10-11 16:44:31 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0712.html