129193 – cluster can hang if lock_gulmd logs out on mounted client

Bug 129193 - cluster can hang if lock_gulmd logs out on mounted client

Summary: cluster can hang if lock_gulmd logs out on mounted client

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	michael conrad tadpol tilstra
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-08-04 22:38 UTC by Adam "mantis" Manthei
Modified:	2010-01-12 02:55 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-05-25 16:41:08 UTC
Embargoed:

Attachments	(Terms of Use)
Drop all lock holds on node logout. (2.68 KB, patch) 2004-08-09 17:56 UTC, michael conrad tadpol tilstra	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2005:466	0	normal	SHIPPED_LIVE	GFS bug fix update	2005-05-25 04:00:00 UTC

Description Adam "mantis" Manthei 2004-08-04 22:38:07 UTC

Description of problem:
lock_gulmd can log out of the master on a client node while that
client has GFS mounted.  The result of this is that locks are still in
the locktable for that client which can cause the cluster to hang
while waiting for the logged out client to release its locks.  

Mike Tilstra had mentioned that a possible sollution for this problem
would be a locktable sweep that cleaned up a node's locks on shutdown.

Currently, the work around for this problem is to start lock_gulmd on
the client node and then force it to expire.  (lock_gulmd can not
force expire a node that is not logged in)


Version-Release number of selected component (if applicable):
GFS-modules-smp-6.0.0-1.2; GFS-6.0.0-1.2


How reproducible:
always

Steps to Reproduce:
1. start lock servers
2. mount clients
3. gulm_tool shutdown client1
4. cluster can now hang because locks are still in the locktable for
client1
  
Actual results:
If the node tries to mount after it has logged out cleanly from the
lock_gulmd master AND rebooted, it will produce the following error:

lock_gulm: ERROR On lock 0x47040000000000181f587472696e312e67667300   
  Got a drop lcok request for a lock that we don't know of. state:0x3


Expected results:


Additional info:

Comment 1 Adam "mantis" Manthei 2004-08-05 13:44:25 UTC

As an aside, lock_gulmd logsout cleanly when it receives SIGTERM. 
Given that "bad things" can happen when the lock server logs out
cleanly while there are active resources (e.g. GFS or GNBD) it might
be justifiable to remove the possibility of accidently shooting your
self in the foot with SIGTERM by simply ignoring it altogether.  

Part of the original justification for having lock_gulmd shutdown
cleanly on sigterm was to handle machines shutting down and receiving
SIGTERM from killall.  Since this is generally run *after* networking
has been shutdown, the node will be fenced anyway since it will not be
able to issue a clean shutdown over the downed interface.  Now that
there are init.d scripts for lock_gulmd, it makes even more sense to
ignore SIGTERM.

Comment 2 michael conrad tadpol tilstra 2004-08-09 17:56:41 UTC

Created attachment 102529 [details]
Drop all lock holds on node logout.

This implements the droplocks on logout idea.  Not fully sure what all of the
side effects of this are yes.  So give it a whirl, see what breaks.

Comment 3 michael conrad tadpol tilstra 2004-10-14 15:47:40 UTC

CVS head ignores sigterm now.

Comment 4 Adam "mantis" Manthei 2004-10-14 15:51:32 UTC

OK, but what about RHEL3?

Comment 5 michael conrad tadpol tilstra 2004-10-14 17:58:53 UTC

sig_ign on sigterm in 6.0 sources now too.

Comment 6 michael conrad tadpol tilstra 2004-10-21 21:57:42 UTC

in RHEL3 now too.

Comment 7 michael conrad tadpol tilstra 2004-10-29 21:25:55 UTC

cvs head, core now gets locked by gulm kernel module.  until module
logs out, core will ignore shutdown reqs.

Comment 8 michael conrad tadpol tilstra 2004-12-01 21:26:23 UTC

sigterm ignoring and core locking in 6.0.* now too.

Comment 9 Jay Turner 2005-05-25 16:41:09 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-466.html

Note You need to log in before you can comment on or make changes to this bug.