Bug 160494

Summary: cluster hangs during randomized recovery tests -- drop_expired not being called
Product: [Retired] Red Hat Cluster Suite Reporter: Adam "mantis" Manthei <amanthei>
Component: gulmAssignee: Chris Feist <cfeist>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: medium    
Version: 3CC: cfeist, cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-12-22 20:39:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 153321    
Bug Blocks:    
Attachments:
Description Flags
logs of test run for comment #3
none
fixes race in gulm_jid.c
none
add callback queue none

Description Adam "mantis" Manthei 2005-06-15 14:05:37 UTC
Description of problem:
I'm seeing cluster hangs durring randomized recovery tests where only GFS
clients are being crashed and recovered.  The setup that has produced the
problem for me the quickest is an 8 node GFS setup with 3 RLM servers.  All the
nodes mount the same filesystem and run fsstress to generate load (fsstress -u8
-p10 -d /gfs/fs1).  6 nodes are setup to automatically "reboot -fn" between 1
and 300 seconds after booting.  After a while, the nodes will be unable to mount
because none of their expired locks were dropped.


Version-Release number of selected component (if applicable):
[root@trin-10 root]# rpm -qa | grep -iE "(gfs|kernel|iscsi)" | sort
GFS-6.0.2.20-1
GFS-modules-6.0.2.20-1 (includes the fix for bug #153321
iscsi-initiator-utils-3.6.2-7
kernel-2.4.21-32.EL
kernel-pcmcia-cs-3.1.31-13
kernel-utils-2.4-8.37


How reproducible:
Running with 8 nodes (6 of which crash) I can see the problem in about an hour.
 I ran with 4 nodes overnight (3 of which crash) and I have yet to see the hang
after 13 hours.


Steps to Reproduce:
1.  set cluster up to automatically mount GFS on boot.  Automated fencing 
    is required.  (I use manual fencing with a manual unfencing script that
    ssh's     into the lock server and runs fence_ack_manual prior to 
    starting lock_gulmd)
2.  set cluster up so that all client nodes run a load generator out of 
    /etc/rc3.d/S99testload.  Some nodes should reboot after a random sleep
    interval
3.  wait for nodes to hang on boot.  
  

Actual results:
Cluster hangs.  High expired count in lt getstats while no nodes are expired in
nodelist.


Expected results:
Cluster should not hang.  expired count is 0 in lt getstats when there are no
expired nodes in nodelist


Additional info:
My setup is using iscsi (Netapp) for storage.  I've not yet tryed reproducing
the bug with fibre attached storage to see if that makes a difference.  

I've also been able to on occassion get the cluster unstuck by rebooting nodes
that were alive and mounted when the cluster got hung.  If this doesn't get the
cluster unstuck, then a cluster restart may be neccessary.

Another way of detecting if a node isn't having it's expired locks dropped is to
look at the LT's getstats output.  If there are no expired nodes in core's
nodelist, there should be no expired locks in the LT.  When I've encountered
this bug, I've seen 10's to 1000's of expired locks with no expired nodes in the
nodelist.  This isn't always accurate.  After repeated recovery test iterations,
the expired counter gradually increments even when a lock dump shows that there
are no expired locks in the system.

Comment 2 Adam "mantis" Manthei 2005-06-15 14:20:28 UTC
I think that the critical component to this test is having nodes that can crash
while they are in the process of recovering for other nodes.  Perhaps more
breakpoint recovery testing needs to be done here (i.e. instrument the locking
and module code to crash nodes at specific, controlled breakpoints throughout
the recovery process).

Comment 3 Adam "mantis" Manthei 2005-06-15 17:26:07 UTC
I just got the lock up on the 4 node (3 crasher) w/ 3 RLM server setup.  logs
will be posted shortly.

Here are some potentially interesting things of note:

From the master:
Jun 15 08:40:58 blade-01 lock_gulmd_core[21419]: Client
(trin-11.lab.msp.redhat.com) expired 
Jun 15 08:41:20 blade-01 lock_gulmd_core[21419]: Client
(trin-12.lab.msp.redhat.com) expired 
Jun 15 08:41:28 blade-01 lock_gulmd_core[21419]: Client
(trin-13.lab.msp.redhat.com) expired 
Jun 15 08:43:28 blade-01 lock_gulmd_core[21419]: Client
(trin-12.lab.msp.redhat.com) expired 
Jun 15 08:43:43 blade-01 lock_gulmd_core[21419]: Client
(trin-11.lab.msp.redhat.com) expired 
Jun 15 08:44:05 blade-01 lock_gulmd_core[21419]: Client
(trin-13.lab.msp.redhat.com) expired 

From the slave:
Jun 15 08:42:03 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for
trin-11.lab.msp.redhat.com 
Jun 15 08:42:03 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for
trin-13.lab.msp.redhat.com 
Jun 15 08:42:04 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for
trin-12.lab.msp.redhat.com 
Jun 15 08:42:06 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for
trin-11.lab.msp.redhat.com 
Jun 15 08:42:24 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for
trin-12.lab.msp.redhat.com 
Jun 15 08:42:34 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for
trin-12.lab.msp.redhat.com 
Jun 15 08:42:36 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for
trin-13.lab.msp.redhat.com 
Jun 15 08:42:42 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for
trin-13.lab.msp.redhat.com 
Jun 15 08:44:35 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for
trin-12.lab.msp.redhat.com 
Jun 15 08:44:35 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for
trin-13.lab.msp.redhat.com 

Note 1: I'm not sure why there are no lock_gulmd_LT000 messages on the master
for the "Dropped expired" event.  Are they not being printed?  logged?

Note 2: The Core "Client expired" messages do not line up with the the LT
"Dropped expired" messages.  Should they?  Are the Dropped Expired messages not
 properlly ordered?


Here are the console logs for the nodes.  Something worth noting is that trin-10
tries to access trin-12's journal lock two times immediately following the
replay.  Seems a bit odd to me.

Jun 15 08:42:20 trin-10 kernel: lock_gulm: Checking for journals for node
"trin-12.lab.msp.redhat.com"
Jun 15 08:42:20 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Trying to acquire
journal lock...
Jun 15 08:42:20 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Looking at journal...
Jun 15 08:42:23 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Done

Jun 15 08:42:24 trin-11 kernel: lock_gulm: Checking for journals for node
"trin-12.lab.msp.redhat.com"
Jun 15 08:42:24 trin-11 kernel: GFS: fsid=monty:gfs1.2: jid=1: Trying to acquire
journal lock...
Jun 15 08:42:24 trin-11 kernel: GFS: fsid=monty:gfs1.2: jid=1: Busy

Jun 15 08:42:32 trin-12 kernel: GFS 6.0.2.20 (built May 12 2005 12:28:13) installed

Jun 15 08:42:32 trin-10 kernel: lock_gulm: Checking for journals for node
"trin-13.lab.msp.redhat.com"
Jun 15 08:42:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Trying to acquire
journal lock...
Jun 15 08:42:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Busy

Jun 15 08:42:34 trin-12 kernel: GFS: fsid=monty:gfs1.1: Joined cluster. Now
mounting FS...
Jun 15 08:42:34 trin-12 kernel: GFS: fsid=monty:gfs1.1: jid=1: Trying to acquire
journal lock...
Jun 15 08:42:34 trin-12 kernel: GFS: fsid=monty:gfs1.1: jid=1: Looking at journal...
Jun 15 08:42:35 trin-12 kernel: lock_gulm: Checking for journals for node
"trin-13.lab.msp.redhat.com"

Jun 15 08:42:35 trin-11 kernel: lock_gulm: Checking for journals for node
"trin-13.lab.msp.redhat.com"
Jun 15 08:42:36 trin-11 kernel: GFS: fsid=monty:gfs1.2: jid=3: Trying to acquire
journal lock...
Jun 15 08:42:36 trin-11 kernel: GFS: fsid=monty:gfs1.2: jid=3: Looking at journal...

Jun 15 08:42:36 trin-12 kernel: GFS: fsid=monty:gfs1.1: jid=1: Done
Jun 15 08:42:36 trin-12 kernel: GFS: fsid=monty:gfs1.1: Scanning for log elements...
Jun 15 08:42:36 trin-12 kernel: GFS: fsid=monty:gfs1.1: Found 0 unlinked inodes
Jun 15 08:42:36 trin-12 kernel: GFS: fsid=monty:gfs1.1: Found quota changes for
4 IDs
Jun 15 08:42:36 trin-12 kernel: GFS: fsid=monty:gfs1.1: Done
Jun 15 08:42:36 trin-12 kernel: GFS: fsid=monty:gfs1.1: jid=3: Trying to acquire
journal lock...
Jun 15 08:42:36 trin-12 kernel: GFS: fsid=monty:gfs1.1: jid=3: Busy

Jun 15 08:42:36 trin-12 gfs: Mounting GFS filesystems:  succeeded

Jun 15 08:42:37 trin-11 kernel: GFS: fsid=monty:gfs1.2: jid=3: Done

Jun 15 08:42:40 trin-13 kernel: GFS 6.0.2.20 (built May 12 2005 12:28:13) installed

Jun 15 08:42:42 trin-13 kernel: GFS: fsid=monty:gfs1.3: Joined cluster. Now
mounting FS...
Jun 15 08:42:42 trin-13 kernel: GFS: fsid=monty:gfs1.3: jid=3: Trying to acquire
journal lock...
Jun 15 08:42:42 trin-13 kernel: GFS: fsid=monty:gfs1.3: jid=3: Looking at journal...
Jun 15 08:42:43 trin-13 kernel: GFS: fsid=monty:gfs1.3: jid=3: Done
Jun 15 08:42:43 trin-13 kernel: GFS: fsid=monty:gfs1.3: Scanning for log elements...
Jun 15 08:42:43 trin-13 kernel: GFS: fsid=monty:gfs1.3: Found 0 unlinked inodes
Jun 15 08:42:43 trin-13 kernel: GFS: fsid=monty:gfs1.3: Found quota changes for
0 IDs
Jun 15 08:42:43 trin-13 kernel: GFS: fsid=monty:gfs1.3: Done

Jun 15 08:42:44 trin-13 gfs: Mounting GFS filesystems:  succeeded

Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Trying to acquire
journal lock...
Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Looking at journal...
Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Acquiring the
transaction lock...
Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Replaying journal...
Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Replayed 0 of 9
blocks
Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: replays = 0,
skips = 0, sames = 9
Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Journal replayed
in 1s
Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Done

Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Trying to acquire
journal lock...
Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Looking at journal...
Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Acquiring the
transaction lock...
Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Replaying journal...
Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Replayed 0 of 0
blocks
Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: replays = 0,
skips = 0, sames = 0
Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Journal replayed
in 1s
Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Done

Jun 15 08:44:31 trin-10 kernel: lock_gulm: Checking for journals for node
"trin-12.lab.msp.redhat.com"
Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Trying to acquire
journal lock...
Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Looking at journal...
Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Acquiring the
transaction lock...
Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Replaying journal...
Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Replayed 3 of 5
blocks
Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: replays = 3,
skips = 0, sames = 2
Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Journal replayed
in 1s
Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Done
Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Trying to acquire
journal lock...
Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Busy
Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Trying to acquire
journal lock...
Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Busy

Jun 15 08:44:38 trin-12 kernel: GFS 6.0.2.20 (built May 12 2005 12:28:13) installed

Jun 15 08:44:40 trin-10 kernel: lock_gulm: Checking for journals for node
"trin-11.lab.msp.redhat.com"

Jun 15 08:44:43 trin-12 kernel: lock_gulm: Checking for journals for node
"trin-11.lab.msp.redhat.com"

Jun 15 08:44:47 trin-11 kernel: GFS 6.0.2.20 (built May 12 2005 12:28:13) installed

Jun 15 08:45:16 trin-10 kernel: lock_gulm: Checking for journals for node
"trin-13.lab.msp.redhat.com"

Jun 15 08:45:20 trin-11 kernel: lock_gulm: Checking for journals for node
"trin-13.lab.msp.redhat.com"

Jun 15 08:45:20 trin-12 kernel: lock_gulm: Checking for journals for node
"trin-13.lab.msp.redhat.com"

Jun 15 08:45:24 trin-13 kernel: GFS 6.0.2.20 (built May 12 2005 12:28:13) installed

Comment 4 Adam "mantis" Manthei 2005-06-15 17:33:11 UTC
Created attachment 115493 [details]
logs of test run for comment #3

logs of test run for comment #3

Comment 5 Adam "mantis" Manthei 2005-06-15 18:47:37 UTC
In the test run in comment #3, the node that wasn't chrashing was rebooted when
the cluster got hung.  When this node was marked as expired, the cluster was
able to resume activity again.

Comment 6 michael conrad tadpol tilstra 2005-06-20 15:00:26 UTC
Is the master server bouncing durring this?  are slaves?

in comment #3 note 2:  Drop expired is called *after* a journal is finished
being replayed.  So: Node expires. Node is fenced. Another node replays journal.
Expired locks for replayed journal are dropped.
So yes, they don't line up and shouldn't.



Comment 7 michael conrad tadpol tilstra 2005-06-20 15:26:42 UTC
based on the fel.out from the dumps in attachment #115493 [details] , there is a deadlock
in the jid mapping scan.  Scans are grabbed in order from 0 then up.  which ever
node holds jid map 0, should be the only node holding any maps >0.  the fel dump
has trin-10 holding jid map 2, and trin-12 holding 0 and 1.  This shouldn't
happen.  bleh.  This might be the problem, it needs to be fix in any case.


Comment 8 michael conrad tadpol tilstra 2005-06-20 15:34:16 UTC
*sigh* except you know, there are a couple of cases where that could happen, and
it shouldn't be a problem.  so comment #7 is probably ignorable.

Comment 9 Adam "mantis" Manthei 2005-06-20 18:12:44 UTC
There are no lt servers (master or slaves) that are being crashed in trying to
reproduce this bug.

Comment 10 Adam "mantis" Manthei 2005-06-20 22:52:52 UTC
I got the bug to happen again.  I grabbed a lock dump and ran it through fel.rb.
   The first three journal locks are listed below.  Both trin-11 and trin-10
where holding journal locks exclusive, so I force expired node trin-10 and the
cluster resumed immidiately.  

This might be able to be incorporated into a script that could detect a hung
cluster effected by this bug and execute the appropriate forceexpire to unhang
the cluster as a workaround.


#=================
---
key            : 'R0ZTIEoEZ2ZzMQAJTQAAAAAAAAAAAA=='
ExK            : GFS , J, 4, gfs1, 9, M, 0
state          : gio_lck_st_Exclusive
LVBlen         : 28
LVB            : 'AnRyaW4tMTAubGFiLm1zcC5yZWRoYXQuY29tAA=='
Elb            : 
HolderCount    : 1
Holders        : [ trin-11.lab.msp.redhat.com ]
LVBHolderCount : 5
LVBHolders     : [ trin-12.lab.msp.redhat.com, trin-13.lab.msp.redhat.com,
trin-11.lab.msp.redhat.com, trin-08.lab.msp.redhat.com, trin-10.
lab.msp.redhat.com ]
ExpiredCount   : 0
ExpiredHolders : 
reply_waiter   :
Waiters        :
High_Waiters   :
 - key         : 'R0ZTIEoEZ2ZzMQAJTQAAAAAAAAAAAA=='
   ExK         : GFS , J, 4, gfs1, 9, M, 0
   name        : trin-10.lab.msp.redhat.com
   state       : gio_lck_st_Exclusive
   flags       : hasLVB Cachable 
   LVB         :
'AGAh4MAqEMIDAAAAACYh4NwzGdEAAAAABwAAAAAAAABsoh7gnGSa0gAAAAAAYDbfAQAAANwzGdHcMxnRAAAAAA=='
   Elb         : 
   Slave_rply  : 0x0
   Slave_sent  : 0x0
   idx         : 7
 - key         : 'R0ZTIEoEZ2ZzMQAJTQAAAAAAAAAAAA=='
   ExK         : GFS , J, 4, gfs1, 9, M, 0
   name        : trin-13.lab.msp.redhat.com
   state       : gio_lck_st_Exclusive
   flags       : hasLVB Cachable 
   LVB         :
'AAAAAAAAAAAAAAAAAAAAAAAAAACAqnPfkJhd3aeqc9+ArHPfBAYy4ICqc98SAAAAjphd3YYCAACYmF3dNCwy4A=='
   Elb         : 
   Slave_rply  : 0x0
   Slave_sent  : 0x0
   idx         : 6
 - key         : 'R0ZTIEoEZ2ZzMQAJTQAAAAAAAAAAAA=='
   ExK         : GFS , J, 4, gfs1, 9, M, 0
   name        : trin-12.lab.msp.redhat.com
   state       : gio_lck_st_Exclusive
   flags       : hasLVB Cachable 
   LVB         :
'AAAAAAAAAAAAAAAAAAAAAAAAAACAVVPfkJhB3adVU9+AWFPfBAYy4IBVU98SAAAAjphB3YYCAACYmEHdNCwy4A=='
   Elb         : 
   Slave_rply  : 0x0
   Slave_sent  : 0x0
   idx         : 8
Action_Waiters :
State_Waiters  :
#=================
---
key            : 'R0ZTIEoEZ2ZzMQAJTQEAAAAAAAAAAA=='
ExK            : GFS , J, 4, gfs1, 9, M, 1
state          : gio_lck_st_Exclusive
LVBlen         : 28
LVB            : 'AHRyaW4tMTIubGFiLm1zcC5yZWRoYXQuY29tAA=='
Elb            : 
HolderCount    : 1
Holders        : [ trin-10.lab.msp.redhat.com ]
LVBHolderCount : 5
LVBHolders     : [ trin-12.lab.msp.redhat.com, trin-13.lab.msp.redhat.com,
trin-11.lab.msp.redhat.com, trin-08.lab.msp.redhat.com, trin-10.
lab.msp.redhat.com ]
ExpiredCount   : 0
ExpiredHolders : 
reply_waiter   :
Waiters        :
High_Waiters   :
 - key         : 'R0ZTIEoEZ2ZzMQAJTQEAAAAAAAAAAA=='
   ExK         : GFS , J, 4, gfs1, 9, M, 1
   name        : trin-08.lab.msp.redhat.com
   state       : gio_lck_st_Exclusive
   flags       : hasLVB Cachable 
   LVB         :
'AnRyaW4tMTAubGFiLm1zcC5yZWRoYXQuY29tAAAAAAAA4JXdAQAAAIYAAADI/5XdoNYy4AEAAAAcHljdhgIAAA=='
   Elb         : 
   Slave_rply  : 0x0
   Slave_sent  : 0x0
   idx         : 11
 - key         : 'R0ZTIEoEZ2ZzMQAJTQEAAAAAAAAAAA=='
   ExK         : GFS , J, 4, gfs1, 9, M, 1
   name        : trin-11.lab.msp.redhat.com
   state       : gio_lck_st_Exclusive
   flags       : hasLVB Cachable 
   LVB         :
'AnRyaW4tMTAubGFiLm1zcC5yZWRoYXQuY29tAKc7i8GAPIvBBAYy4IA7i8ESAAAAjhhI3YYCAACYGEjdNCwy4A=='
   Elb         : 
   Slave_rply  : 0x0
   Slave_sent  : 0x0
   idx         : 5
Action_Waiters :
State_Waiters  :
#=================
---
key            : 'R0ZTIEoEZ2ZzMQAJTQIAAAAAAAAAAA=='
ExK            : GFS , J, 4, gfs1, 9, M, 2
state          : gio_lck_st_Unlock
LVBlen         : 28
LVB            : 'AXRyaW4tMTEubGFiLm1zcC5yZWRoYXQuY29tAA=='
Elb            : 
HolderCount    : 0
Holders        : 
LVBHolderCount : 5
LVBHolders     : [ trin-12.lab.msp.redhat.com, trin-13.lab.msp.redhat.com,
trin-11.lab.msp.redhat.com, trin-08.lab.msp.redhat.com, trin-10.
lab.msp.redhat.com ]
ExpiredCount   : 0
ExpiredHolders : 
reply_waiter   :
Waiters        :
High_Waiters   :
Action_Waiters :
State_Waiters  :
#=================


Comment 11 michael conrad tadpol tilstra 2005-06-27 19:14:26 UTC
current thoughts are that the callback handles threads are getting filled with
processes that are waiting on items that are at the tail of the callback handler
queue. 

currently reply requests are pushed onto the callback handlers since they need
to do lock reqs to look up the node<->jid mappings.  With enough nodes, and
other traffic, things might just be getting overlapped enough that they get stuck.

just a possiblity.

Comment 12 Adam "mantis" Manthei 2005-08-03 20:11:57 UTC
test run with 8 clients, 5 of which crashing, and a single dedicated SLM server
also demonstrates this bug.  

Checking to see if it is in 6.1 (cluster-RHEL4)...

Comment 13 Adam "mantis" Manthei 2005-08-10 19:51:10 UTC
I've been able to verify that this issue is present in the cluster-RHEL4 branch
as well using a s single dedicated slm server and 5 other machines.  Each
machine was running `fsstress -p 10`.  Three of the nodes were set to crash
randomly, the other 2 remained always on and running.

RPM versions tested:

[root]# rpm -qa | grep -iE "(gfs|gulm|ccs|lvm)" | sort
ccs-1.0.2-0
ccs-devel-1.0.2-0
GFS-6.1.2-0
GFS-kernel-2.6.9-37.0
GFS-kernel-smp-2.6.9-37.0
GFS-kernheaders-2.6.9-37.0
gulm-1.0.4-0
gulm-devel-1.0.4-0
lvm2-2.01.14-1.0.RHEL4
lvm2-cluster-2.01.09-5.0.RHEL4
system-config-lvm-0.9.18-1.0

Comment 14 Adam "mantis" Manthei 2005-09-16 19:47:43 UTC
I beleive that I have identified the problem(s) with a fairly high level of
certainty.  There are two issues:

1. There is a race condition in the lock_gulm.o module between
lookup_name_by_jid() and check_for_stale_expires() that results in
lookup_name_by_jid() returning a name with garbage in it.  This results in
drop_expired() being called on a node using the wrong the name (leaving it's
expired locks in the lock table)

2. The second issue is that the xdr_send() function in the lock_gulmd daemon
does not always report errors.  If a node crashes using `reboot -fn`, xdr_send()
will report that it successfully sent a message to that node.  If that message
is a callback, it is possible for the callback to never get reissued and cluster
will hang until either: 
  1) lock_gulmd is restarted or 
  2) `gulm_tool rerunqueues`  is called on the master lock server. 

Comment 15 Adam "mantis" Manthei 2005-09-16 19:50:19 UTC
Created attachment 118904 [details]
fixes race in gulm_jid.c

The attached patch fixes a race condition that could result in garbage string
being passed to the drop_expire() function (see issue 1 of comment #14)

Comment 16 Adam "mantis" Manthei 2005-09-16 19:54:17 UTC
Created attachment 118905 [details]
add callback queue

The attached patch adds a queue for keeping track of sent callbacks.  Every
send_drp_lock() request will now add a copy of the lock request to an intrnal
queue that is rerun after a configurable amount of time (default is 15
seconds).  While I believe this fixes the problem, I have not been able to test
it adequately enough that I feel confident it can be checked in at this time. 
I'm allowing my tests to run over the weekend.	If they survive, I will check
it in.

Comment 17 Adam "mantis" Manthei 2005-09-20 19:37:57 UTC
Code has been committed to CVS to resolve this issue for RHEL3.  Similar code
still needs to be added to the RHEL4 branch yet.

Comment 18 Lon Hohberger 2009-12-22 20:39:12 UTC
Cleanup.

This bugzilla should be resolved with the current packages from RHN for Red Hat Global File System for Red Hat Enterprise Linux 3.

GFS-6.0.2.36-9