Bug 160494
Summary: | cluster hangs during randomized recovery tests -- drop_expired not being called | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Adam "mantis" Manthei <amanthei> | ||||||||
Component: | gulm | Assignee: | Chris Feist <cfeist> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 3 | CC: | cfeist, cluster-maint | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2009-12-22 20:39:12 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 153321 | ||||||||||
Bug Blocks: | |||||||||||
Attachments: |
|
Description
Adam "mantis" Manthei
2005-06-15 14:05:37 UTC
I think that the critical component to this test is having nodes that can crash while they are in the process of recovering for other nodes. Perhaps more breakpoint recovery testing needs to be done here (i.e. instrument the locking and module code to crash nodes at specific, controlled breakpoints throughout the recovery process). I just got the lock up on the 4 node (3 crasher) w/ 3 RLM server setup. logs will be posted shortly. Here are some potentially interesting things of note: From the master: Jun 15 08:40:58 blade-01 lock_gulmd_core[21419]: Client (trin-11.lab.msp.redhat.com) expired Jun 15 08:41:20 blade-01 lock_gulmd_core[21419]: Client (trin-12.lab.msp.redhat.com) expired Jun 15 08:41:28 blade-01 lock_gulmd_core[21419]: Client (trin-13.lab.msp.redhat.com) expired Jun 15 08:43:28 blade-01 lock_gulmd_core[21419]: Client (trin-12.lab.msp.redhat.com) expired Jun 15 08:43:43 blade-01 lock_gulmd_core[21419]: Client (trin-11.lab.msp.redhat.com) expired Jun 15 08:44:05 blade-01 lock_gulmd_core[21419]: Client (trin-13.lab.msp.redhat.com) expired From the slave: Jun 15 08:42:03 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for trin-11.lab.msp.redhat.com Jun 15 08:42:03 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for trin-13.lab.msp.redhat.com Jun 15 08:42:04 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for trin-12.lab.msp.redhat.com Jun 15 08:42:06 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for trin-11.lab.msp.redhat.com Jun 15 08:42:24 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for trin-12.lab.msp.redhat.com Jun 15 08:42:34 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for trin-12.lab.msp.redhat.com Jun 15 08:42:36 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for trin-13.lab.msp.redhat.com Jun 15 08:42:42 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for trin-13.lab.msp.redhat.com Jun 15 08:44:35 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for trin-12.lab.msp.redhat.com Jun 15 08:44:35 blade-03 lock_gulmd_LT000[3067]: Dropped expired locks for trin-13.lab.msp.redhat.com Note 1: I'm not sure why there are no lock_gulmd_LT000 messages on the master for the "Dropped expired" event. Are they not being printed? logged? Note 2: The Core "Client expired" messages do not line up with the the LT "Dropped expired" messages. Should they? Are the Dropped Expired messages not properlly ordered? Here are the console logs for the nodes. Something worth noting is that trin-10 tries to access trin-12's journal lock two times immediately following the replay. Seems a bit odd to me. Jun 15 08:42:20 trin-10 kernel: lock_gulm: Checking for journals for node "trin-12.lab.msp.redhat.com" Jun 15 08:42:20 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Trying to acquire journal lock... Jun 15 08:42:20 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Looking at journal... Jun 15 08:42:23 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Done Jun 15 08:42:24 trin-11 kernel: lock_gulm: Checking for journals for node "trin-12.lab.msp.redhat.com" Jun 15 08:42:24 trin-11 kernel: GFS: fsid=monty:gfs1.2: jid=1: Trying to acquire journal lock... Jun 15 08:42:24 trin-11 kernel: GFS: fsid=monty:gfs1.2: jid=1: Busy Jun 15 08:42:32 trin-12 kernel: GFS 6.0.2.20 (built May 12 2005 12:28:13) installed Jun 15 08:42:32 trin-10 kernel: lock_gulm: Checking for journals for node "trin-13.lab.msp.redhat.com" Jun 15 08:42:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Trying to acquire journal lock... Jun 15 08:42:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Busy Jun 15 08:42:34 trin-12 kernel: GFS: fsid=monty:gfs1.1: Joined cluster. Now mounting FS... Jun 15 08:42:34 trin-12 kernel: GFS: fsid=monty:gfs1.1: jid=1: Trying to acquire journal lock... Jun 15 08:42:34 trin-12 kernel: GFS: fsid=monty:gfs1.1: jid=1: Looking at journal... Jun 15 08:42:35 trin-12 kernel: lock_gulm: Checking for journals for node "trin-13.lab.msp.redhat.com" Jun 15 08:42:35 trin-11 kernel: lock_gulm: Checking for journals for node "trin-13.lab.msp.redhat.com" Jun 15 08:42:36 trin-11 kernel: GFS: fsid=monty:gfs1.2: jid=3: Trying to acquire journal lock... Jun 15 08:42:36 trin-11 kernel: GFS: fsid=monty:gfs1.2: jid=3: Looking at journal... Jun 15 08:42:36 trin-12 kernel: GFS: fsid=monty:gfs1.1: jid=1: Done Jun 15 08:42:36 trin-12 kernel: GFS: fsid=monty:gfs1.1: Scanning for log elements... Jun 15 08:42:36 trin-12 kernel: GFS: fsid=monty:gfs1.1: Found 0 unlinked inodes Jun 15 08:42:36 trin-12 kernel: GFS: fsid=monty:gfs1.1: Found quota changes for 4 IDs Jun 15 08:42:36 trin-12 kernel: GFS: fsid=monty:gfs1.1: Done Jun 15 08:42:36 trin-12 kernel: GFS: fsid=monty:gfs1.1: jid=3: Trying to acquire journal lock... Jun 15 08:42:36 trin-12 kernel: GFS: fsid=monty:gfs1.1: jid=3: Busy Jun 15 08:42:36 trin-12 gfs: Mounting GFS filesystems: succeeded Jun 15 08:42:37 trin-11 kernel: GFS: fsid=monty:gfs1.2: jid=3: Done Jun 15 08:42:40 trin-13 kernel: GFS 6.0.2.20 (built May 12 2005 12:28:13) installed Jun 15 08:42:42 trin-13 kernel: GFS: fsid=monty:gfs1.3: Joined cluster. Now mounting FS... Jun 15 08:42:42 trin-13 kernel: GFS: fsid=monty:gfs1.3: jid=3: Trying to acquire journal lock... Jun 15 08:42:42 trin-13 kernel: GFS: fsid=monty:gfs1.3: jid=3: Looking at journal... Jun 15 08:42:43 trin-13 kernel: GFS: fsid=monty:gfs1.3: jid=3: Done Jun 15 08:42:43 trin-13 kernel: GFS: fsid=monty:gfs1.3: Scanning for log elements... Jun 15 08:42:43 trin-13 kernel: GFS: fsid=monty:gfs1.3: Found 0 unlinked inodes Jun 15 08:42:43 trin-13 kernel: GFS: fsid=monty:gfs1.3: Found quota changes for 0 IDs Jun 15 08:42:43 trin-13 kernel: GFS: fsid=monty:gfs1.3: Done Jun 15 08:42:44 trin-13 gfs: Mounting GFS filesystems: succeeded Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Trying to acquire journal lock... Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Looking at journal... Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Acquiring the transaction lock... Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Replaying journal... Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Replayed 0 of 9 blocks Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: replays = 0, skips = 0, sames = 9 Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Journal replayed in 1s Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=1: Done Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Trying to acquire journal lock... Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Looking at journal... Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Acquiring the transaction lock... Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Replaying journal... Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Replayed 0 of 0 blocks Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: replays = 0, skips = 0, sames = 0 Jun 15 08:44:31 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Journal replayed in 1s Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=3: Done Jun 15 08:44:31 trin-10 kernel: lock_gulm: Checking for journals for node "trin-12.lab.msp.redhat.com" Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Trying to acquire journal lock... Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Looking at journal... Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Acquiring the transaction lock... Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Replaying journal... Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Replayed 3 of 5 blocks Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: replays = 3, skips = 0, sames = 2 Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Journal replayed in 1s Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Done Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Trying to acquire journal lock... Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Busy Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Trying to acquire journal lock... Jun 15 08:44:32 trin-10 kernel: GFS: fsid=monty:gfs1.0: jid=2: Busy Jun 15 08:44:38 trin-12 kernel: GFS 6.0.2.20 (built May 12 2005 12:28:13) installed Jun 15 08:44:40 trin-10 kernel: lock_gulm: Checking for journals for node "trin-11.lab.msp.redhat.com" Jun 15 08:44:43 trin-12 kernel: lock_gulm: Checking for journals for node "trin-11.lab.msp.redhat.com" Jun 15 08:44:47 trin-11 kernel: GFS 6.0.2.20 (built May 12 2005 12:28:13) installed Jun 15 08:45:16 trin-10 kernel: lock_gulm: Checking for journals for node "trin-13.lab.msp.redhat.com" Jun 15 08:45:20 trin-11 kernel: lock_gulm: Checking for journals for node "trin-13.lab.msp.redhat.com" Jun 15 08:45:20 trin-12 kernel: lock_gulm: Checking for journals for node "trin-13.lab.msp.redhat.com" Jun 15 08:45:24 trin-13 kernel: GFS 6.0.2.20 (built May 12 2005 12:28:13) installed Created attachment 115493 [details] logs of test run for comment #3 logs of test run for comment #3 In the test run in comment #3, the node that wasn't chrashing was rebooted when the cluster got hung. When this node was marked as expired, the cluster was able to resume activity again. Is the master server bouncing durring this? are slaves? in comment #3 note 2: Drop expired is called *after* a journal is finished being replayed. So: Node expires. Node is fenced. Another node replays journal. Expired locks for replayed journal are dropped. So yes, they don't line up and shouldn't. based on the fel.out from the dumps in attachment #115493 [details] , there is a deadlock
in the jid mapping scan. Scans are grabbed in order from 0 then up. which ever
node holds jid map 0, should be the only node holding any maps >0. the fel dump
has trin-10 holding jid map 2, and trin-12 holding 0 and 1. This shouldn't
happen. bleh. This might be the problem, it needs to be fix in any case.
*sigh* except you know, there are a couple of cases where that could happen, and it shouldn't be a problem. so comment #7 is probably ignorable. There are no lt servers (master or slaves) that are being crashed in trying to reproduce this bug. I got the bug to happen again. I grabbed a lock dump and ran it through fel.rb. The first three journal locks are listed below. Both trin-11 and trin-10 where holding journal locks exclusive, so I force expired node trin-10 and the cluster resumed immidiately. This might be able to be incorporated into a script that could detect a hung cluster effected by this bug and execute the appropriate forceexpire to unhang the cluster as a workaround. #================= --- key : 'R0ZTIEoEZ2ZzMQAJTQAAAAAAAAAAAA==' ExK : GFS , J, 4, gfs1, 9, M, 0 state : gio_lck_st_Exclusive LVBlen : 28 LVB : 'AnRyaW4tMTAubGFiLm1zcC5yZWRoYXQuY29tAA==' Elb : HolderCount : 1 Holders : [ trin-11.lab.msp.redhat.com ] LVBHolderCount : 5 LVBHolders : [ trin-12.lab.msp.redhat.com, trin-13.lab.msp.redhat.com, trin-11.lab.msp.redhat.com, trin-08.lab.msp.redhat.com, trin-10. lab.msp.redhat.com ] ExpiredCount : 0 ExpiredHolders : reply_waiter : Waiters : High_Waiters : - key : 'R0ZTIEoEZ2ZzMQAJTQAAAAAAAAAAAA==' ExK : GFS , J, 4, gfs1, 9, M, 0 name : trin-10.lab.msp.redhat.com state : gio_lck_st_Exclusive flags : hasLVB Cachable LVB : 'AGAh4MAqEMIDAAAAACYh4NwzGdEAAAAABwAAAAAAAABsoh7gnGSa0gAAAAAAYDbfAQAAANwzGdHcMxnRAAAAAA==' Elb : Slave_rply : 0x0 Slave_sent : 0x0 idx : 7 - key : 'R0ZTIEoEZ2ZzMQAJTQAAAAAAAAAAAA==' ExK : GFS , J, 4, gfs1, 9, M, 0 name : trin-13.lab.msp.redhat.com state : gio_lck_st_Exclusive flags : hasLVB Cachable LVB : 'AAAAAAAAAAAAAAAAAAAAAAAAAACAqnPfkJhd3aeqc9+ArHPfBAYy4ICqc98SAAAAjphd3YYCAACYmF3dNCwy4A==' Elb : Slave_rply : 0x0 Slave_sent : 0x0 idx : 6 - key : 'R0ZTIEoEZ2ZzMQAJTQAAAAAAAAAAAA==' ExK : GFS , J, 4, gfs1, 9, M, 0 name : trin-12.lab.msp.redhat.com state : gio_lck_st_Exclusive flags : hasLVB Cachable LVB : 'AAAAAAAAAAAAAAAAAAAAAAAAAACAVVPfkJhB3adVU9+AWFPfBAYy4IBVU98SAAAAjphB3YYCAACYmEHdNCwy4A==' Elb : Slave_rply : 0x0 Slave_sent : 0x0 idx : 8 Action_Waiters : State_Waiters : #================= --- key : 'R0ZTIEoEZ2ZzMQAJTQEAAAAAAAAAAA==' ExK : GFS , J, 4, gfs1, 9, M, 1 state : gio_lck_st_Exclusive LVBlen : 28 LVB : 'AHRyaW4tMTIubGFiLm1zcC5yZWRoYXQuY29tAA==' Elb : HolderCount : 1 Holders : [ trin-10.lab.msp.redhat.com ] LVBHolderCount : 5 LVBHolders : [ trin-12.lab.msp.redhat.com, trin-13.lab.msp.redhat.com, trin-11.lab.msp.redhat.com, trin-08.lab.msp.redhat.com, trin-10. lab.msp.redhat.com ] ExpiredCount : 0 ExpiredHolders : reply_waiter : Waiters : High_Waiters : - key : 'R0ZTIEoEZ2ZzMQAJTQEAAAAAAAAAAA==' ExK : GFS , J, 4, gfs1, 9, M, 1 name : trin-08.lab.msp.redhat.com state : gio_lck_st_Exclusive flags : hasLVB Cachable LVB : 'AnRyaW4tMTAubGFiLm1zcC5yZWRoYXQuY29tAAAAAAAA4JXdAQAAAIYAAADI/5XdoNYy4AEAAAAcHljdhgIAAA==' Elb : Slave_rply : 0x0 Slave_sent : 0x0 idx : 11 - key : 'R0ZTIEoEZ2ZzMQAJTQEAAAAAAAAAAA==' ExK : GFS , J, 4, gfs1, 9, M, 1 name : trin-11.lab.msp.redhat.com state : gio_lck_st_Exclusive flags : hasLVB Cachable LVB : 'AnRyaW4tMTAubGFiLm1zcC5yZWRoYXQuY29tAKc7i8GAPIvBBAYy4IA7i8ESAAAAjhhI3YYCAACYGEjdNCwy4A==' Elb : Slave_rply : 0x0 Slave_sent : 0x0 idx : 5 Action_Waiters : State_Waiters : #================= --- key : 'R0ZTIEoEZ2ZzMQAJTQIAAAAAAAAAAA==' ExK : GFS , J, 4, gfs1, 9, M, 2 state : gio_lck_st_Unlock LVBlen : 28 LVB : 'AXRyaW4tMTEubGFiLm1zcC5yZWRoYXQuY29tAA==' Elb : HolderCount : 0 Holders : LVBHolderCount : 5 LVBHolders : [ trin-12.lab.msp.redhat.com, trin-13.lab.msp.redhat.com, trin-11.lab.msp.redhat.com, trin-08.lab.msp.redhat.com, trin-10. lab.msp.redhat.com ] ExpiredCount : 0 ExpiredHolders : reply_waiter : Waiters : High_Waiters : Action_Waiters : State_Waiters : #================= current thoughts are that the callback handles threads are getting filled with processes that are waiting on items that are at the tail of the callback handler queue. currently reply requests are pushed onto the callback handlers since they need to do lock reqs to look up the node<->jid mappings. With enough nodes, and other traffic, things might just be getting overlapped enough that they get stuck. just a possiblity. test run with 8 clients, 5 of which crashing, and a single dedicated SLM server also demonstrates this bug. Checking to see if it is in 6.1 (cluster-RHEL4)... I've been able to verify that this issue is present in the cluster-RHEL4 branch as well using a s single dedicated slm server and 5 other machines. Each machine was running `fsstress -p 10`. Three of the nodes were set to crash randomly, the other 2 remained always on and running. RPM versions tested: [root]# rpm -qa | grep -iE "(gfs|gulm|ccs|lvm)" | sort ccs-1.0.2-0 ccs-devel-1.0.2-0 GFS-6.1.2-0 GFS-kernel-2.6.9-37.0 GFS-kernel-smp-2.6.9-37.0 GFS-kernheaders-2.6.9-37.0 gulm-1.0.4-0 gulm-devel-1.0.4-0 lvm2-2.01.14-1.0.RHEL4 lvm2-cluster-2.01.09-5.0.RHEL4 system-config-lvm-0.9.18-1.0 I beleive that I have identified the problem(s) with a fairly high level of certainty. There are two issues: 1. There is a race condition in the lock_gulm.o module between lookup_name_by_jid() and check_for_stale_expires() that results in lookup_name_by_jid() returning a name with garbage in it. This results in drop_expired() being called on a node using the wrong the name (leaving it's expired locks in the lock table) 2. The second issue is that the xdr_send() function in the lock_gulmd daemon does not always report errors. If a node crashes using `reboot -fn`, xdr_send() will report that it successfully sent a message to that node. If that message is a callback, it is possible for the callback to never get reissued and cluster will hang until either: 1) lock_gulmd is restarted or 2) `gulm_tool rerunqueues` is called on the master lock server. Created attachment 118904 [details] fixes race in gulm_jid.c The attached patch fixes a race condition that could result in garbage string being passed to the drop_expire() function (see issue 1 of comment #14) Created attachment 118905 [details]
add callback queue
The attached patch adds a queue for keeping track of sent callbacks. Every
send_drp_lock() request will now add a copy of the lock request to an intrnal
queue that is rerun after a configurable amount of time (default is 15
seconds). While I believe this fixes the problem, I have not been able to test
it adequately enough that I feel confident it can be checked in at this time.
I'm allowing my tests to run over the weekend. If they survive, I will check
it in.
Code has been committed to CVS to resolve this issue for RHEL3. Similar code still needs to be added to the RHEL4 branch yet. Cleanup. This bugzilla should be resolved with the current packages from RHN for Red Hat Global File System for Red Hat Enterprise Linux 3. GFS-6.0.2.36-9 |