Bug 139738 - cman leaves cluster while gfs is in use
cman leaves cluster while gfs is in use
Status: CLOSED WORKSFORME
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cman (Show other bugs)
4
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
:
: 128420 142844 (view as bug list)
Depends On:
Blocks: 144795
  Show dependency treegraph
 
Reported: 2004-11-17 14:55 EST by Corey Marthaler
Modified: 2009-12-22 15:40 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-12-22 15:40:47 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2004-11-17 14:55:26 EST
Description of problem:
This happened on morph-04 (with I/O running) when morph-01 and
morph-02 were coming back up and rejoining the cluster after having
been both shot.

Both morph-04 and morph-05 appeared to have hit bz133512 as they had
messages saying: CMAN: we are leaving the cluster. Reason is 5 after
morph-01 and morph-02 joined again.



Nov 17 13:47:36 morph-04 kernel: 2228 qc 2,667a067 3,5 id c005c sts 0 0
Nov 17 13:47:36 morph-04 kernel: 2300 un 2,667a00a 30389 3 0
Nov 17 13:47:36 morph-04 kernel:
Nov 17 13:47:36 morph-04 kernel: lock_dlm:  Assertion failed on line
342 of file /usr/src/c
luster/gfs-kernel/src/dlm/lock.c
Nov 17 13:47:36 morph-04 kernel: lock_dlm:  assertion:  "!error"
Nov 17 13:47:36 morph-04 kernel: lock_dlm:  time = 241736
Nov 17 13:47:36 morph-04 kernel: corey0: error=-22 num=2,667a00a
Nov 17 13:47:36 morph-04 kernel:
Nov 17 13:47:36 morph-04 kernel: ------------[ cut here ]------------
Nov 17 13:47:36 morph-04 kernel: kernel BUG at
/usr/src/cluster/gfs-kernel/src/dlm/lock.c:3
42!
Nov 17 13:47:36 morph-04 kernel: invalid operand: 0000 [#1]
Nov 17 13:47:36 morph-04 kernel: SMP
Nov 17 13:47:36 morph-04 kernel: Modules linked in: lock_gulm
lock_nolock gfs lock_dlm dlm
cman lock_harness lpfc ipv6 parport_pc lp parport autofs4 sunrpc e1000
microcode dm_mod uhc
i_hcd ehci_hcd button battery ac ext3 jbd qla2300 qla2xxx
scsi_transport_fc sd_mod scsi_mod
Nov 17 13:47:36 morph-04 kernel: CPU:    1
Nov 17 13:47:36 morph-04 kernel: EIP:    0060:[<f89f0a95>]    Not
tainted VLI
Nov 17 13:47:36 morph-04 kernel: EFLAGS: 00010282   (2.6.9)
Nov 17 13:47:36 morph-04 kernel: EIP is at do_dlm_unlock+0xf5/0x110
[lock_dlm]
Nov 17 13:47:36 morph-04 kernel: eax: 00000001   ebx: ffffffea   ecx:
f747ff00   edx: 00000292
Nov 17 13:47:36 morph-04 kernel: esi: f3f9d580   edi: f3f91b70   ebp:
f8aac000   esp: f747fefc
Nov 17 13:47:36 morph-04 kernel: ds: 007b   es: 007b   ss: 0068
Nov 17 13:47:36 morph-04 kernel: Process gfs_glockd (pid: 2300,
threadinfo=f747e000 task=c23681b0)
Nov 17 13:47:36 morph-04 kernel: Stack: f89f6106 f5819880 ffffffea
00000002 0667a00a 00000000 00000000 00000001
Nov 17 13:47:36 morph-04 kernel:        f3f91b8c f89f0e0f f8a312d3
f3f91b70 f8a5509c f8a61f60 f4b2dd40 f3f91b70
Nov 17 13:47:36 morph-04 kernel:        f8a61f60 f3f91b8c f8a306b5
f3f91b70 00000000 f3f91c00 f3f91b70 f8a307c9
Nov 17 13:47:36 morph-04 kernel: Call Trace:
Nov 17 13:47:36 morph-04 kernel:  [<f89f0e0f>] lm_dlm_unlock+0xf/0x20
[lock_dlm]
Nov 17 13:47:36 morph-04 kernel:  [<f8a312d3>]
gfs_glock_drop_th+0x83/0x160 [gfs]
Nov 17 13:47:36 morph-04 kernel:  [<f8a5509c>] gmalloc+0x1c/0x50 [gfs]
Nov 17 13:47:36 morph-04 kernel:  [<f8a306b5>] rq_demote+0xb5/0xd0 [gfs]
Nov 17 13:47:36 morph-04 kernel:  [<f8a307c9>] run_queue+0x99/0xd0 [gfs]
Nov 17 13:47:36 morph-04 kernel:  [<f8a30909>]
unlock_on_glock+0x29/0x40 [gfs]
Nov 17 13:47:36 morph-04 kernel:  [<f8a32ddb>]
gfs_reclaim_glock+0x13b/0x190 [gfs]
Nov 17 13:47:36 morph-04 kernel:  [<f8a24357>] gfs_glockd+0x107/0x120
[gfs]
Nov 17 13:47:36 morph-04 kernel:  [<c011f2d0>]
default_wake_function+0x0/0x10
Nov 17 13:47:36 morph-04 kernel:  [<c0105ec2>] ret_from_fork+0x6/0x14
Nov 17 13:47:36 morph-04 kernel:  [<c011f2d0>]
default_wake_function+0x0/0x10
Nov 17 13:47:36 morph-04 kernel:  [<f8a24250>] gfs_glockd+0x0/0x120 [gfs]
Nov 17 13:47:36 morph-04 kernel:  [<c01042b5>]
kernel_thread_helper+0x5/0x10
Nov 17 13:47:36 morph-04 kernel: Code: 46 0c 89 54 24 14 89 44 24 0c
8b 06 8b 40 18 c7 04 24 57 61 9f f8 89 44 24 04 e8 07 25 73 c7 c7 04
24 06 61 9f f8 e8 fb 24 73 c7 <0f> 0b 56 01 48 59 9f f8 c7 04 24 74 59
9f f8 e8 e7 1c 73 c7 8d
Nov 17 13:47:49 morph-04 kernel:  11         66770c6"
Nov 17 13:47:49 morph-04 kernel: corey1 (2429) un c0042 40000 0 0 "  
    7         66970ef"


Also both consoles had these messages streaming non stop:
 [<c0293239>] __kfree_skb+0xa9/0x140
 [<c010c4d0>] timer_interrupt+0xb0/0x120
 [<f894410e>] e1000_clean_tx_irq+0x14e/0x220 [e1000]
 [<f8943fa0>] e1000_clean+0xa0/0xc0 [e1000]
 [<c011bc40>] do_page_fault+0x0/0x54c
 [<c0106a19>] error_code+0x2d/0x38
 [<c0120e1c>] mm_release+0x3c/0xc0
 [<c012ac85>] del_timer_sync+0x85/0xc0
 [<c0124e71>] do_exit+0x81/0x400
 [<c01071f0>] do_divide_error+0x0/0x120
 [<c011bc40>] do_page_fault+0x0/0x54c
 [<c011bf34>] do_page_fault+0x2f4/0x54c
 [<f8944670>] e1000_alloc_rx_buffers+0x40/0xf0 [e1000]
 [<f8944374>] e1000_clean_rx_irq+0x194/0x450 [e1000]


Version-Release number of selected component (if applicable):
Lock_DLM (built Nov 16 2004 10:49:44) installed


How reproducible:
Didn't try
Comment 1 David Teigland 2004-11-17 22:39:45 EST
Where did "CMAN: we are leaving the cluster" appear in relation to
the logs above?  If it was before, then there's no problem with dlm
locking and the panic is not surprising.

Also, the other streaming errors make me suspicious that something
else had gone wrong and was the root cause of the lock_dlm assert
rather than there being a bug in the locking logic.

I see two lines from the dlm log dump at the very end.  Was there
more?  In this assert, when dlm_unlock returns -EINVAL, it also logs
a more specific error that should appear in the log dump.  I suspect
the error may have been "lockspace not found" which would be
consistent with the cluster suddenly shutting down from underneath
it (and consistent with bug 133512).
Comment 2 Corey Marthaler 2004-11-18 16:34:16 EST
The "CMAN: we are leaving the cluster" messages were before the 
assertion trip. Also if there were other lines at the very end of 
the log dump then they were cut off by the restart of the system 
because the next lines I see are just that: 
 
Nov 17 13:47:36 morph-04 kernel:  [<f8a24250>] gfs_glockd+0x0/0x120 
[gfs] 
Nov 17 13:47:36 morph-04 kernel:  [<c01042b5>] 
kernel_thread_helper+0x5/0x10 
Nov 17 13:47:36 morph-04 kernel: Code: 46 0c 89 54 24 14 89 44 24 0c 
8b 06 8b 40 18 c7 04 24 57 61 9f f8 89 44 24 04 e8 07 25 73 c7 c7 04 
24 06 61 9f f8 e8 fb 24 73 c7 <0f> 0b 56 01 48 59 9f f8 c7 04 24 74 
59 9f f8 e8 e7 1c 73 c7 8d 
Nov 17 13:47:49 morph-04 kernel:  11         66770c6" 
Nov 17 13:47:49 morph-04 kernel: corey1 (2429) un c0042 40000 0 0 "       
7         66970ef" 
Nov 17 14:31:31 morph-04 syslogd 1.4.1: restart. 
Nov 17 14:31:31 morph-04 syslog: syslogd startup succeeded 
Nov 17 14:31:31 morph-04 kernel: klogd 1.4.1, log source = 
/proc/kmsg started. 
 
Comment 3 David Teigland 2004-11-18 23:07:22 EST
OK, I'm quite certain this is not a dlm/locking bug.  It's
a side effect of the earlier bug which is the node suddenly
leaving the cluster while the dlm is running.  We should, of
course, find more graceful ways of failing than panicking,
but that's another project.
Comment 4 David Teigland 2005-01-04 22:30:42 EST
Cman deciding to leave the cluster while gfs/dlm are in use is
probably due to some lower root cause.  Patrick may be able to
classify this better.
Comment 5 Christine Caulfield 2005-01-05 09:34:07 EST
*** Bug 128420 has been marked as a duplicate of this bug. ***
Comment 6 David Teigland 2005-01-24 23:26:47 EST
*** Bug 142844 has been marked as a duplicate of this bug. ***
Comment 7 Christine Caulfield 2005-01-25 11:39:43 EST
Latest checkin sets the cman processes to run as realtime so they get
all the CPU they need (which isn't much). 

Seems to fix it for me, let me know how you get on ;-)
Comment 8 Corey Marthaler 2005-02-04 17:05:55 EST
I seemed to have hit a similar case where cman leaves the cluster
while gfs is in use and all hell breaks loose as a result. This was on
a 5 node cluster after 2 nodes (morph-01 and morph-05) had been shot
and then brought back up and a cman join attempt was made on both.


Feb  4 15:24:54 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3:
jid=0: Trying to acquire journal lock...
Feb  4 15:24:54 morph-04 kernel: GFS: fsid=morph-cluster:gfs3.3:
jid=0: Busy
Feb  4 15:24:54 morph-04 kernel: GFS: fsid=morph-cluster:gfs2.3:
jid=0: Busy
Feb  4 15:24:54 morph-04 kernel: GFS: fsid=morph-cluster:gfs1.3:
jid=0: Busy
Feb  4 15:24:54 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3:
jid=0: Busy
Feb  4 15:27:23 morph-04 kernel: CMAN: node morph-05 rejoining
Feb  4 15:27:47 morph-04 kernel: CMAN: node morph-04 has been removed
from the cluster : No response to messag
es
Feb  4 15:27:47 morph-04 kernel: CMAN: killed by NODEDOWN message
Feb  4 15:27:47 morph-04 kernel: CMAN: we are leaving the cluster.
Feb  4 15:27:47 morph-04 kernel: dlm: gfs2: remote_stage error -105 280371
Feb  4 15:27:47 morph-04 kernel: dlm: gfs2: remote_stage error -105 1f0292
Feb  4 15:27:47 morph-04 kernel: dlm: gfs3: remote_stage error -105 2b010f
Feb  4 15:27:47 morph-04 kernel: dlm: gfs1: remote_stage error -105 e02ba
Feb  4 15:27:47 morph-04 kernel: dlm: gfs1: remote_stage error -105 f025f
Feb  4 15:27:47 morph-04 kernel: dlm: gfs3: remote_stage error -105 2d033c
Feb  4 15:27:47 morph-04 kernel: dlm: gfs0: remote_stage error -105 1601df
Feb  4 15:27:49 morph-04 kernel: 123) rq 5 1d0022 "       7        
4daa00a"
Feb  4 15:27:49 morph-04 kernel: gfs0 (7123) cv 0 30282 "      11    
    4daa00a"
Feb  4 15:27:49 morph-04 kernel: gfs3 (7140) cv 5 10214 "      11    
    4daa006"
Feb  4 15:27:49 morph-04 kernel: gfs3 (7140) rq 5 3701dc "       7   
     4daa006"
Feb  4 15:27:49 morph-04 kernel: gfs3 (7140) cv 0 10214 "      11    
    4daa006"
Feb  4 15:27:49 morph-04 kernel: gfs3 (7140) cv 5 10214 "      11    
    4daa006"
Feb  4 15:27:49 morph-04 kernel: gfs3 (7140) un 3701dc 80000 0 0 "   
   7         4daa006"
Feb  4 15:27:49 morph-04 kernel: gfs3 (7140) cv 0 10214 "      11    
    4daa006"
Feb  4 15:27:49 morph-04 kernel: gfs3 (7140) cv 5 10214 "      11    
    4daa006"
Feb  4 15:27:49 morph-04 kernel: gfs3 (7140) rq 5 2e03f3 "       7   
     4daa006"
Feb  4 15:27:49 morph-04 kernel: gfs3 (7140) cv 0 10214 "      11    
    4daa006"
Feb  4 15:27:49 morph-04 kernel: gfs3 (7140) cv 5 10214 "      11    
    4daa006"
Feb  4 15:27:49 morph-04 kernel: gfs3 (7140) un 2e03f3 80000 0 0 "   
   7         4daa006"
Feb  4 15:27:49 morph-04 ccsd[5810]: Cluster manager shutdown. 
Attemping to reconnect...
Feb  4 15:27:49 morph-04 kernel: gfs3 (7140) cv 0 10214 "      11    
    4daa006"
Feb  4 15:27:49 morph-04 kernel: gfs0 (7123) cv 5 30282 "      11    
    4daa00a"
Feb  4 15:27:49 morph-04 kernel: gfs0 (7123) un 1d0022 80000 0 0 "   
   7         4daa00a"
.
.
[snip]
.
.
Feb  4 15:27:55 morph-04 kernel: 6511 qc 7,4daa00a -1,5 id 1a03c0 sts 0 0
Feb  4 15:27:55 morph-04 kernel: 6511 qc 11,4daa00a 5,0 id 30282 sts 0 0
Feb  4 15:27:55 morph-04 kernel: 7123 ex plock 0
Feb  4 15:27:55 morph-04 kernel: 7140 en plock 7,4daa006
Feb  4 15:27:55 morph-04 kernel: 7140 lk 11,4daa006 id 10214 0,5 4
Feb  4 15:27:55 morph-04 kernel:
Feb  4 15:27:55 morph-04 kernel: lock_dlm:  Assertion failed on line
410 of file
/usr/src/build/514108-i686/BUILD/gfs-kernel-2.6.9-16/src/dlm/lock.c
Feb  4 15:27:55 morph-04 kernel: lock_dlm:  assertion:  "!error"
Feb  4 15:27:55 morph-04 kernel: lock_dlm:  time = 670979
Feb  4 15:27:55 morph-04 kernel: gfs3: num=11,4daa006 err=-22 cur=0
req=5 lkf=4
Feb  4 15:27:55 morph-04 kernel:
Feb  4 15:27:55 morph-04 kernel: ------------[ cut here ]------------
Feb  4 15:27:55 morph-04 kernel: kernel BUG at
/usr/src/build/514108-i686/BUILD/gfs-kernel-2.6.9-16/src/dlm/lock.c:410!
Feb  4 15:27:55 morph-04 kernel: invalid operand: 0000 [#1]
Feb  4 15:27:55 morph-04 kernel: Modules linked in: gnbd(U)
lock_nolock(U) gfs(U) lock_dlm(U) dlm(U) cman(U) lock_harness(U) md5
ipv6 parport_pc lp parport autofs4 sunrpc uhci_hcd hw_random e1000
floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx
scsi_transport_fc sd_mod scsi_mod
Feb  4 15:27:55 morph-04 kernel: CPU:    0
Feb  4 15:27:55 morph-04 kernel: EIP:    0060:[<f8972ed9>]    Tainted:
GF     VLI
Feb  4 15:27:55 morph-04 kernel: EFLAGS: 00010246   (2.6.9-5.EL)
Feb  4 15:27:55 morph-04 kernel: EIP is at do_dlm_lock+0x14e/0x168
[lock_dlm]
Feb  4 15:27:55 morph-04 kernel: eax: 00000001   ebx: ffffffea   ecx:
f89785da   edx: f4abde18
Feb  4 15:27:56 morph-04 kernel: esi: f8972ef8   edi: f5127600   ebp:
f1651580   esp: f4abde14
Feb  4 15:27:56 morph-04 kernel: ds: 007b   es: 007b   ss: 0068
Feb  4 15:27:56 morph-04 kernel: Process doio (pid: 7140,
threadinfo=f4abd000 task=f167acd0)
Feb  4 15:27:56 morph-04 kernel: Stack: f89785da 20202020 31312020
20202020 20202020 61643420 36303061 00000018
Feb  4 15:27:56 morph-04 kernel:        f16515c0 f16515c0 f1651580
f4abde50 00000000 f8972f4d f16515bc 1d244b3c
Feb  4 15:27:56 morph-04 kernel:        00000000 0000000a f8978635
00000000 00000000 f1651580 00000001 f4abde88
Feb  4 15:27:56 morph-04 kernel: Call Trace:
Feb  4 15:27:56 morph-04 kernel:  [<f8972f4d>]
do_dlm_lock_sync+0x50/0x63 [lock_dlm]
Feb  4 15:27:56 morph-04 kernel:  [<f8975196>] lock_resource+0x6e/0x91
[lock_dlm]
Feb  4 15:27:56 morph-04 kernel:  [<f8976b01>]
lm_dlm_plock+0x13f/0x248 [lock_dlm]
Feb  4 15:27:56 morph-04 kernel:  [<f8b76606>] gfs_lm_plock+0x33/0x42
[gfs]
Feb  4 15:27:56 morph-04 kernel:  [<f8b8291d>] gfs_lock+0xe3/0xf1 [gfs]
Feb  4 15:27:56 morph-04 kernel:  [<f8b8283a>] gfs_lock+0x0/0xf1 [gfs]
Feb  4 15:27:56 morph-04 kernel:  [<c0179a32>] fcntl_setlk+0x156/0x291
Feb  4 15:27:56 morph-04 kernel:  [<f88800a7>]
qla2x00_done+0x403/0x420 [qla2xxx]
Feb  4 15:27:56 morph-04 kernel:  [<c0175714>] do_fcntl+0x194/0x1dd
Feb  4 15:27:56 morph-04 kernel:  [<c0175823>] sys_fcntl64+0x6c/0x7d
Feb  4 15:27:56 morph-04 kernel:  [<c0301bfb>] syscall_call+0x7/0xb
Feb  4 15:27:56 morph-04 kernel:  [<c030007b>]
interruptible_sleep_on_timeout+0x91/0x1da
Feb  4 15:27:56 morph-04 kernel: Code: 26 50 0f bf 45 24 50 53 ff 75
08 ff 75 04 ff 75 0c ff 77 18 68 35 87 97 f8 e8 69 d1 7a c7 83 c4 38
68 da 85 97 f8 e8 5c d1 7a c7 <0f> 0b 9a 01 6c 84 97 f8 68 dc 85 97 f8
e8 a8 c5 7a c7 83 c4 20
Feb  4 15:27:56 morph-04 kernel:  SM: 00000001 sm_stop: SG still joined
Feb  4 15:27:56 morph-04 kernel: SM: 01000003 sm_stop: SG still joined
Feb  4 15:27:56 morph-04 kernel: SM: 02000005 sm_stop: SG still joined
Feb  4 15:27:56 morph-04 kernel: 123) rq 5 1d0022 "       7        
4daa00a"
Feb  4 15:27:56 morph-04 kernel: gfs0 (7123) cv 0 30282 "      11    
    4daa00a"
Feb  4 15:27:56 morph-04 kernel: gfs3 (7140) cv 5 10214 "      11    
    4daa006"
Feb  4 15:27:56 morph-04 kernel: gfs3 (7140) rq 5 3701dc "       7   
     4daa006"
Feb  4 15:27:56 morph-04 kernel: gfs3 (7140) cv 0 10214 "      11    
    4daa006"
Comment 9 Corey Marthaler 2005-02-04 17:07:07 EST
marking this as failsqa unless you think this is a different bug.
Comment 10 Christine Caulfield 2005-02-08 10:14:02 EST
Internally this is a different cause but as the eternal symptoms are
the same I'll keep it in this bug.

The "all hell breaks loose" is seperate 
Comment 11 Christine Caulfield 2005-02-09 08:49:11 EST
I'm working a bit on hunches here but it looks like it could be the
DLM blocking cman while it does a /lot/ of recovery

grep dlm /proc/slabinfo|cut -b1-40

shows useful info for this sort of thing.

This mod should help greatly.
Checking in lowcomms.c;
/cvs/cluster/cluster/dlm-kernel/src/lowcomms.c,v  <--  lowcomms.c
new revision: 1.28; previous revision: 1.27
done
Checking in lowcomms.c;
/cvs/cluster/cluster/dlm-kernel/src/lowcomms.c,v  <--  lowcomms.c
new revision: 1.22.2.5; previous revision: 1.22.2.4
done

I've set it NEEDINFO rather than MODIFIED because I can't seem to
reproduce it. That might just be because I can't get enough locks up
on my 256M machines or it might be something else. 

I'm sure you'll keep me informed.
Comment 12 Christine Caulfield 2005-02-15 12:45:01 EST
If/when this happens again, please try to include the following from
the remaining nodes (if at all possible)

/proc/slabinfo
/proc/loadavg
/proc/meminfo
/proc/buddyinfo
/proc/vmstat
/proc/cluster/dlm_debug
/proc/cluster/dlm_locks (of all lockspaces if poss)

any possibly interesting syslog errors preceding the node leaving. The
stuff afterwards is not really germane to this bug.

Thanks.
Comment 13 David Teigland 2005-02-22 10:34:30 EST
Seems likely that a dlm thread is hogging the cpu during
some recoveries causing this.  Testing to see if additional
schedule's in dlm recovery will fix this.
Comment 14 Corey Marthaler 2005-02-22 12:21:21 EST
per Dave's comments, I'm adding the comments from Derek and myself
from bz 128318 to this bug and moving from NEEDINFO to REASSIGNED as
we are both seeing this.

comment #5:
I think this may be the same bug.  Was running a four node cluster
with link-08 running 'while :; do placemaker -d 7 -w 3; find; rm -rf
place_root; done'.  link-10 was running a mount/umount loop, and
link-12 was looping on bonnie++.  After a couple of hours link-08 hit
the assertion.  I will put full logfiles in ~danderso/bugs/128318.

Note: This is a mixed-arch cluster.  link-10,link-11,link12 are i686
and link-08 is an x86_64 opteron.

Note: The placemaker tool is in the sistina-test tree if wanted/needed.

lock_dlm:  Assertion failed on line 352 of file
/usr/src/build/522379-x86_64/BUILD/smp/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 4305526027
data1: error=-22 num=2,825c37

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:352
invalid operand: 0000 [1] SMP
CPU 1
Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) parport_pc lp
parport autofs4 dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket
pcmcia_core button battery ac ohci_hcd hw_random tg3 floppy
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx
scsi_transport_fc mptscsih mptbase sd_mod scsi_mod
Pid: 3380, comm: gfs_glockd Tainted: G   M  2.6.9-5.ELsmp
RIP: 0010:[<ffffffffa0268804>]
<ffffffffa0268804>{:lock_dlm:do_dlm_unlock+189}
RSP: 0018:000001001e9f5de8  EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000100000000
RDX: ffffffff803c7508 RSI: 0000000000000246 RDI: ffffffff803c7500
RBP: 000001001cbb6dc0 R08: ffffffff803c7508 R09: 00000000ffffffea
R10: 0000000000000097 R11: 0000000000000097 R12: 000001001c637c9c
R13: ffffff000016a000 R14: ffffffffa0264d20 R15: 000001001c637c70
FS:  0000002a95563b00(0000) GS:ffffffff804bf380(0000)
knlGS:00000000f7ff06c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a95557000 CR3: 000000001ffb2000 CR4: 00000000000006e0
Process gfs_glockd (pid: 3380, threadinfo 000001001e9f4000, task
000001001e507030)
Stack: 0000000000000000 ffffff000016a000 000001001c637c70 ffffffffa0268b7e
       0000000000000001 ffffffffa023051f 0000000000000001 ffffffffa022709c
       000001001fdd4500 000001001c637c70
Call Trace:<ffffffffa0268b7e>{:lock_dlm:lm_dlm_unlock+15}
<ffffffffa023051f>{:gfs:gfs_lm_unlock+41}
       <ffffffffa022709c>{:gfs:gfs_glock_drop_th+290}
<ffffffffa0225845>{:gfs:run_queue+314}
       <ffffffffa0225a9a>{:gfs:unlock_on_glock+37}
<ffffffffa0225b90>{:gfs:gfs_reclaim_glock+234}
       <ffffffffa021a61a>{:gfs:gfs_glockd+61}
<ffffffff8013176a>{default_wake_function+0}
       <ffffffff8013176a>{default_wake_function+0}
<ffffffff80110c23>{child_rip+8}
       <ffffffffa021a5dd>{:gfs:gfs_glockd+0}
<ffffffff80110c1b>{child_rip+0}


Code: 0f 0b 13 cf 26 a0 ff ff ff ff 60 01 48 c7 c7 18 cf 26 a0 31
RIP <ffffffffa0268804>{:lock_dlm:do_dlm_unlock+189} RSP <000001001e9f5de8>
 <0>Kernel panic - not syncing: Oops


comment #7:

 
I've seen this now too, here's everything that I could grab above the
assert:

dlm: dlm_unlock: lkid 50264 lockspace not found
ror -105 1a0019
gfs0 remote_stage error -105 1e0298
gfs1 remote_stage error -105 160189
gfs0 remote_stage error -105 25004d
gfs1 remote_stage error -105 1801e2
gfs5 remote_stage error -105 1900db
gfs3 remote_stage error -105 170275
gfs4 remote_stage error -105 1801ac
gfs4 remote_stage error -105 160310
gfs9 remote_stage error -105 1d01dd
gfs7 remote_stage error -105 1f00cb
gfs9 remote_stage error -105 140104
gfs0 remote_stage error -105 1f018f
gfs5 remote_stage error -105 1202a1
gfs3 remote_stage error -105 1703d5
gfs3 remote_stage error -105 1800c6
gfs5 remote_stage error -105 180295
gfs2 remote_stage error -105 170117
gfs6 remote_stage error -105 130274
gfs4 remote_stage error -105 1603cf
gfs7 remote_stage error -105 1503a0
gfs9 remote_stage error -105 160071
gfs1 remote_stage error -105 1c0216
gfs2 remote_stage error -105 1b00f7
gfs7 remote_stage error -105 140136
gfs2 remote_stage error -105 2403d5
gfs5 remote_stage error -105 1902e1
gfs3 remote_stage error -105 1401ce
gfs4 remote_stage error -105 13025d
3be sts 0 0
9293 ex punlock 0
9293 en plock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 req 7,37 ex 2ec187-2ed3d9 lkf 2000 wait 1
9293 lk 7,37 id 0 -1,5 2000
9293 lk 11,37 id 603be 5,0 4
7131 qc 7,37 -1,5 id 1f0358 sts 0 0
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex plock 0
9293 en punlock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 remove 7,37
9293 un 7,37 1f0358 5 0
7131 qc 7,37 5,5 id 1f0358 sts -65538 0
9293 lk 11,37 id 603be 5,0 4
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex punlock 0
9293 en plock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 req 7,37 ex 2ed3d9-2ed7dd lkf 2000 wait 1
9293 lk 7,37 id 0 -1,5 2000
9293 lk 11,37 id 603be 5,0 4
7131 qc 7,37 -1,5 id 2403a4 sts 0 0
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex plock 0
9293 en punlock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 remove 7,37
9293 un 7,37 2403a4 5 0
7131 qc 7,37 5,5 id 2403a4 sts -65538 0
9293 lk 11,37 id 603be 5,0 4
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex punlock 0
9293 en plock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 req 7,37 ex 2ed7de-2eddd4 lkf 2000 wait 1
9293 lk 7,37 id 0 -1,5 2000
9293 lk 11,37 id 603be 5,0 4
7131 qc 7,37 -1,5 id 1c018a sts 0 0
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex plock 0
9293 en punlock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 remove 7,37
9293 un 7,37 1c018a 5 0
7131 qc 7,37 5,5 id 1c018a sts -65538 0
9293 lk 11,37 id 603be 5,0 4
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex punlock 0
9293 en plock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 req 7,37 ex 2eddd5-2edffc lkf 2000 wait 1
9293 lk 7,37 id 0 -1,5 2000
9293 lk 11,37 id 603be 5,0 4
7131 qc 7,37 -1,5 id 250113 sts 0 0
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex plock 0
9293 en punlock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 remove 7,37
9293 un 7,37 250113 5 0
7131 qc 7,37 5,5 id 250113 sts -65538 0
9293 lk 11,37 id 603be 5,0 4
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex punlock 0
9293 en plock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 req 7,37 ex e12a9-26b17f lkf 2000 wait 1
9293 lk 7,37 id 0 -1,5 2000
9293 lk 11,37 id 603be 5,0 4
7131 qc 7,37 -1,5 id 200044 sts 0 0
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex plock 0
9334 en punlock 7,2d
9334 lk 11,2d id 201d9 0,5 4
7131 qc 11,2d 0,5 id 201d9 sts 0 0
9334 remove 7,2d
9334 un 7,2d 160147 5 0
7131 qc 7,2d 5,5 id 160147 sts -65538 0
9334 lk 11,2d id 201d9 5,0 4
7131 qc 11,2d 5,0 id 201d9 sts 0 0
9334 ex punlock 0
9334 en plock 7,2d
9334 lk 11,2d id 201d9 0,5 4
7131 qc 11,2d 0,5 id 201d9 sts 0 0
9334 req 7,2d ex 0-64da lkf 2000 wait 1
9334 lk 7,2d id 0 -1,5 2000
9334 lk 11,2d id 201d9 5,0 4
7131 qc 7,2d -1,5 id 14038e sts 0 0
7131 qc 11,2d 5,0 id 201d9 sts 0 0
9334 ex plock 0
9293 en punlock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 remove 7,37
9293 un 7,37 200044 5 0
7131 qc 7,37 5,5 id 200044 sts -65538 0
9293 lk 11,37 id 603be 5,0 4
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex punlock 0
9293 en plock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 req 7,37 ex 2ecc8b-2ed161 lkf 2000 wait 1
9293 lk 7,37 id 0 -1,5 2000
9293 lk 11,37 id 603be 5,0 4
7131 qc 7,37 -1,5 id 1e0271 sts 0 0
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex plock 0
9293 en punlock 7,37
9293 lk 11,37 id 603be 0,5 4
7131 qc 11,37 0,5 id 603be sts 0 0
9293 remove 7,37
9293 un 7,37 1e0271 5 0
7131 qc 7,37 5,5 id 1e0271 sts -65538 0
9293 lk 11,37 id 603be 5,0 4
7131 qc 11,37 5,0 id 603be sts 0 0
9293 ex punlock 0
9334 en punlock 7,2d
9334 lk 11,2d id 201d9 0,5 4
9336 en punlock 7,1ffe9
9336 lk 11,1ffe9 id 2006e 0,5 4
8494 un 2,20242 e012f 5 0
8506 un 8,0 100263 5 8
9339 lk 2,20282 id 0 -1,5 0
8429 lk 8,0 id 0 -1,5 8
9315 en punlock 7,10081
9315 lk 11,10081 id 10045 0,5 4
8186 un 2,4ff55 c03b5 5 0
8109 un 2,204a7 12028a 5 0
7964 un 2,400aa a03e0 5 0
8263 un 2,20144 c03c7 5 0
7887 un 2,100a9 200e0 5 0
8340 un 2,20281 50264 5 0

lock_dlm:  Assertion failed on line 352 of file
/usr/src/build/522381-i686/BUILD/gfs-kernel-2.6.9-23/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 1744300
gfs6: error=-22 num=2,20281

------------[ cut here ]------------
kernel BUG at
/usr/src/build/522381-i686/BUILD/gfs-kernel-2.6.9-23/src/dlm/lock.c:352!
invalid operand: 0000 [#13]
Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_dlm(U) dlm(U)
cman(U) lock_harness(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc
button battery ac uhci_hcd hw_random e1000 floppy dm_snapshot dm_zero
dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod
scsi_mod
CPU:    0
EIP:    0060:[<f8962ce9>]    Not tainted VLI
EFLAGS: 00010246   (2.6.9-5.EL)
EIP is at do_dlm_unlock+0xa2/0xb7 [lock_dlm]
eax: 00000001   ebx: ffffffea   ecx: f896854a   edx: c4599f44
esi: f5e4b680   edi: f5e4b680   ebp: f8bc9000   esp: c4599f40
ds: 007b   es: 007b   ss: 0068
Process gfs_glockd (pid: 8340, threadinfo=c4599000 task=c3ffd320)
Stack: f896854a f8bc9000 00000001 f8962ff4 f8b964c0 c67cbd7c f8bc9000
f8bc6640
       f8b89447 c67cbd7c f8bc6640 c4599fb4 f8b87ef9 c67cbd7c 00000001
f8b880b6
       c67cbd7c c67cbd7c f8b8836f c67cbe20 f8b8b9b4 c4599000 c4599fc0
f8b7bbf2
Call Trace:
 [<f8962ff4>] lm_dlm_unlock+0xe/0x16 [lock_dlm]
 [<f8b964c0>] gfs_lm_unlock+0x2b/0x40 [gfs]
 [<f8b89447>] gfs_glock_drop_th+0x17a/0x1b0 [gfs]
 [<f8b87ef9>] rq_demote+0x15c/0x1da [gfs]
 [<f8b880b6>] run_queue+0x5a/0xc1 [gfs]
 [<f8b8836f>] unlock_on_glock+0x6e/0xc8 [gfs]
 [<f8b8b9b4>] gfs_reclaim_glock+0x257/0x2ae [gfs]
 [<f8b7bbf2>] gfs_glockd+0x38/0xde [gfs]
 [<c011b9ea>] default_wake_function+0x0/0xc
 [<c0301b1a>] ret_from_fork+0x6/0x14
 [<c011b9ea>] default_wake_function+0x0/0xc
 [<f8b7bbba>] gfs_glockd+0x0/0xde [gfs]
 [<c01041d9>] kernel_thread_helper+0x5/0xb
Code: e8 72 d3 7b c7 ff 76 08 8b 06 ff 76 04 ff 76 0c 53 ff 70 18 68
6a 86 96 f8 e8 59 d3 7b c7 83 c4 2c 68 4a 85 96 f8 e8 4c d3 7b c7 <0f>
0b 60 01 dc 83 96 f8 68 4c 85 96 f8 e8 98 c7 7b c7 5b 5e c3
Comment 15 Christine Caulfield 2005-02-28 10:00:51 EST
Dave spotted this one. the sequence counter eventually wrapped to
zero, whence the ACK got ignored.

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.9; previous revision: 1.42.2.8
done
Comment 16 David Teigland 2005-03-17 11:01:48 EST
From Corey:
reproduced this on morph-03 last night after 56 iterations of revolver:

From the SYSLOG:
Mar 16 14:15:08 morph-03 kernel: CMAN: node morph-04.lab.msp.redhat.com rejoining
Mar 16 14:15:15 morph-03 kernel: CMAN: node morph-03.lab.msp.redhat.com has been
removed from the cluster : No response to messages
Mar 16 14:15:15 morph-03 kernel: CMAN: killed by NODEDOWN message
Mar 16 14:15:15 morph-03 kernel: CMAN: we are leaving the cluster. No response
to messages

Corey, could you check quick to make sure the cman kernel module isn't
an old one?
Comment 17 David Teigland 2005-03-17 11:10:54 EST
Could you also please check the version of the cman_tool binary you
were using?  A recent cman_tool fix setting socket priorities could
also easily be related to this.
Comment 18 Corey Marthaler 2005-03-17 12:12:24 EST
I was using the rpms built on the 14th:

cman-1.0-0.pre29.0.i386.rpm
cman-kernheaders-2.6.9-28.0.i686.rpm
cman-kernel-hugemem-2.6.9-28.0.i686.rpm

My machines were just reimaged for 6.0 testing so I'm unable to find out what
version that is.
Comment 19 Corey Marthaler 2005-03-29 11:55:44 EST
There must be another cause here as well.

I had I/O running most of the night on a 6 node cluster and tank-06 ended up
kicking out tank-05, and then tank-01 did the fencing.

tank-06:
Mar 29 08:26:23 tank-06 kernel: CMAN: removing node tank-05.lab.msp.redhat.com
from the cluster : Missed too many heartbeats
Mar 29 08:26:29 tank-06 fenced[5563]: fencing deferred to tank-01.lab.msp.redhat.com
Mar 29 08:27:00 tank-06 kernel: GFS: fsid=tank-cluster:gfs1.5: jid=4: Trying to
acquire journal lock...
Mar 29 08:27:00 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.5: jid=4: Trying to
acquire journal lock...
Mar 29 08:27:00 tank-06 kernel: GFS: fsid=tank-cluster:gfs1.5: jid=4: Busy
Mar 29 08:27:00 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.5: jid=4: Busy


tank-01:
Mar 29 08:27:04 tank-01 kernel: CMAN: removing node tank-05.lab.msp.redhat.com
from the cluster : Shutdown
Mar 29 08:27:29 tank-01 fenced[5846]: tank-05.lab.msp.redhat.com not a cluster
member after 20 secpost_fail_delay
Mar 29 08:27:29 tank-01 fenced[5846]: fencing node "tank-05.lab.msp.redhat.com"
Mar 29 08:27:33 tank-01 kernel: lpfc 0000:02:01.0: 0:0263 Cannot block scsi
target. target ptr xf78ea6e0
Mar 29 08:27:34 tank-01 fenced[5846]: fence "tank-05.lab.msp.redhat.com" success
Mar 29 08:27:41 tank-01 kernel: GFS: fsid=tank-cluster:gfs1.0: jid=4: Trying to
acquire journal lock...
Mar 29 08:27:41 tank-01 kernel: GFS: fsid=tank-cluster:gfs0.0: jid=4: Trying to
acquire journal lock...
Mar 29 08:27:41 tank-01 kernel: GFS: fsid=tank-cluster:gfs0.0: jid=4: Busy
Mar 29 08:27:41 tank-01 kernel: GFS: fsid=tank-cluster:gfs1.0: jid=4: Busy


tank-05:
Mar 29 08:25:48 tank-05 kernel: CMAN: Being told to leave the cluster by node 4
Mar 29 08:25:48 tank-05 kernel: CMAN: we are leaving the cluster.
Mar 29 08:25:48 tank-05 ccsd[5655]: Cluster manager shutdown.  Attemping to
reconnect...
Mar 29 08:25:48 tank-05 kernel: SM: 00000001 sm_stop: SG still joined
Mar 29 08:25:48 tank-05 kernel: SM: 01000003 sm_stop: SG still joined
Mar 29 08:25:48 tank-05 kernel: SM: 02000005 sm_stop: SG still joined
Mar 29 08:25:49 tank-05 kernel: dlm: dlm_unlock: lkid 5250061 lockspace not found
Mar 29 08:25:49 tank-05 kernel: resend 3a013e lq 1 flg 200000 node -1/-1 "       7
Mar 29 08:25:49 tank-05 kernel: gfs0 resent 5 requests
Mar 29 08:25:49 tank-05 kernel: gfs0 recover event 27 finished
Mar 29 08:25:49 tank-05 kernel: clvmd mark waiting requests
Mar 29 08:25:49 tank-05 kernel: clvmd marked 0 requests
Mar 29 08:25:49 tank-05 kernel: clvmd purge locks of departed nodes
Mar 29 08:25:49 tank-05 kernel: clvmd purged 0 locks
Mar 29 08:25:49 tank-05 kernel: clvmd update remastered resources
Mar 29 08:25:49 tank-05 kernel: clvmd updated 0 resources
Mar 29 08:25:49 tank-05 kernel: clvmd rebuild locks
Mar 29 08:25:49 tank-05 kernel: clvmd rebuilt 0 locks
Mar 29 08:25:49 tank-05 kernel: clvmd recover event 27 done
Mar 29 08:25:49 tank-05 kernel: clvmd move flags 0,0,1 ids 22,27,27
Mar 29 08:25:49 tank-05 kernel: clvmd process held requests
Mar 29 08:25:49 tank-05 kernel: clvmd processed 0 requests
Mar 29 08:25:49 tank-05 kernel: clvmd resend marked requests
Mar 29 08:25:49 tank-05 kernel: clvmd resent 0 requests
Mar 29 08:25:49 tank-05 kernel: clvmd recover event 27 finished
Mar 29 08:25:49 tank-05 kernel: gfs1 mark waiting requests
Mar 29 08:25:49 tank-05 kernel: gfs1 mark 2c0152 lq 1 nodeid -1
Mar 29 08:25:49 tank-05 kernel: gfs1 marked 1 requests
Mar 29 08:25:49 tank-05 kernel: gfs1 purge locks of departed nodes
Mar 29 08:25:49 tank-05 kernel: gfs1 purged 1 locks
Mar 29 08:25:49 tank-05 kernel: gfs1 update remastered resources
Mar 29 08:25:49 tank-05 kernel: gfs1 updated 2 resources
Mar 29 08:25:49 tank-05 kernel: gfs1 rebuild locks
Mar 29 08:25:49 tank-05 kernel: gfs1 rebuilt 2 locks
Mar 29 08:25:49 tank-05 kernel: gfs1 recover event 27 done
Mar 29 08:25:49 tank-05 kernel: gfs1 move flags 0,0,1 ids 25,27,27
Mar 29 08:25:49 tank-05 kernel: gfs1 process held requests
Mar 29 08:25:49 tank-05 kernel: gfs1 processed 0 requests
Mar 29 08:25:49 tank-05 kernel: gfs1 resend marked requests
Mar 29 08:25:49 tank-05 kernel: gfs1 resend 2c0152 lq 1 flg 200000 node -1/-1 "
      7
Mar 29 08:25:49 tank-05 kernel: gfs1 resent 1 requests
Mar 29 08:25:49 tank-05 kernel: gfs1 recover event 27 finished
Mar 29 08:25:49 tank-05 kernel: emove 7,7307ded
Mar 29 08:25:49 tank-05 kernel: 6540 un 7,7307ded 65802f7 5 0
Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded 5,5 id 65802f7 sts -65538 0
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 ex punlock 0
Mar 29 08:25:49 tank-05 kernel: 6540 en plock 7,7307ded
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 req 7,7307ded ex 16cad4-16d0f7 lkf 2000 wait 1
Mar 29 08:25:49 tank-05 kernel: 6540 lk 7,7307ded id 0 -1,5 2000
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded -1,5 id 65102fe sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 ex plock 0
Mar 29 08:25:49 tank-05 kernel: 6540 en punlock 7,7307ded
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 remove 7,7307ded
Mar 29 08:25:49 tank-05 kernel: 6540 un 7,7307ded 65102fe 5 0
Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded 5,5 id 65102fe sts -65538 0
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 ex punlock 0
Mar 29 08:25:49 tank-05 kernel: 6540 en plock 7,7307ded
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 req 7,7307ded ex 16d0f7-16f243 lkf 2000 wait 1
Mar 29 08:25:49 tank-05 kernel: 6540 lk 7,7307ded id 0 -1,5 2000
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded -1,5 id 69902a0 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 ex plock 0
Mar 29 08:25:49 tank-05 kernel: 6540 en punlock 7,7307ded
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 remove 7,7307ded
Mar 29 08:25:49 tank-05 kernel: 6540 un 7,7307ded 69902a0 5 0
Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded 5,5 id 69902a0 sts -65538 0
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 ex punlock 0
Mar 29 08:25:49 tank-05 kernel: 6540 en plock 7,7307ded
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 req 7,7307ded ex 1747e3-174a23 lkf 2000 wait 1
Mar 29 08:25:49 tank-05 kernel: 6540 lk 7,7307ded id 0 -1,5 2000
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded -1,5 id 63b02ec sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 ex plock 0
Mar 29 08:25:49 tank-05 kernel: 6540 en punlock 7,7307ded
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 remove 7,7307ded
Mar 29 08:25:49 tank-05 kernel: 6540 un 7,7307ded 63b02ec 5 0
Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded 5,5 id 63b02ec sts -65538 0
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 ex punlock 0
Mar 29 08:25:49 tank-05 kernel: 6540 en plock 7,7307ded
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 req 7,7307ded ex 1761a5-176d9d lkf 2000 wait 1
Mar 29 08:25:49 tank-05 kernel: 6540 lk 7,7307ded id 0 -1,5 2000
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded -1,5 id 61401d2 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 ex plock 0
Mar 29 08:25:49 tank-05 kernel: 6540 en punlock 7,7307ded
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 remove 7,7307ded
Mar 29 08:25:49 tank-05 kernel: 6540 un 7,7307ded 61401d2 5 0
Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded 5,5 id 61401d2 sts -65538 0
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 ex punlock 0
Mar 29 08:25:49 tank-05 kernel: 6540 en plock 7,7307ded
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 req 7,7307ded ex 0-156acf lkf 2000 wait 1
Mar 29 08:25:49 tank-05 kernel: 6540 lk 7,7307ded id 0 -1,5 2000
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded -1,5 id 69200eb sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 ex plock 0
Mar 29 08:25:49 tank-05 kernel: 6540 en punlock 7,7307ded
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 remove 7,7307ded
Mar 29 08:25:49 tank-05 kernel: 6540 un 7,7307ded 69200eb 5 0
Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded 5,5 id 69200eb sts -65538 0
Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4
Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0
Mar 29 08:25:49 tank-05 kernel: 6540 ex punlock 0
Mar 29 08:25:50 tank-05 kernel: 6540 en plock 7,7307ded
Mar 29 08:25:50 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4
Mar 29 08:25:50 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0
Mar 29 08:25:50 tank-05 kernel: 6540 req 7,7307ded ex 16a777-1761a7 lkf 2000 wait 1
Mar 29 08:25:50 tank-05 kernel: 6540 lk 7,7307ded id 0 -1,5 2000
Mar 29 08:25:50 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4
Mar 29 08:25:50 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0
Mar 29 08:25:50 tank-05 kernel: 6163 qc 7,7307ded -1,5 id 64e0201 sts 0 0
Mar 29 08:25:50 tank-05 kernel: 6540 ex plock 0
Mar 29 08:25:50 tank-05 kernel: 6540 en punlock 7,7307ded
Mar 29 08:25:50 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4
Mar 29 08:25:50 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0
Mar 29 08:25:50 tank-05 kernel: 6540 remove 7,7307ded
Mar 29 08:25:50 tank-05 kernel: 6540 un 7,7307ded 64e0201 5 0
Mar 29 08:25:50 tank-05 kernel: 6163 qc 7,7307ded 5,5 id 64e0201 sts -65538 0
Mar 29 08:25:50 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4
Mar 29 08:25:50 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0
Mar 29 08:25:50 tank-05 kernel: 6540 ex punlock 0
Mar 29 08:25:50 tank-05 kernel: 6540 en plock 7,7307ded
Mar 29 08:25:50 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4
Mar 29 08:25:50 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0
Mar 29 08:25:50 tank-05 kernel: 6540 req 7,7307ded ex 1761a7-176494 lkf 2000 wait 1
Mar 29 08:25:50 tank-05 kernel: 6540 lk 7,7307ded id 0 -1,5 2000
Mar 29 08:25:50 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4
Mar 29 08:25:50 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0
Mar 29 08:25:50 tank-05 kernel: 6306 un 5,9282610 5250061 3 0
Mar 29 08:25:50 tank-05 kernel:
Mar 29 08:25:50 tank-05 kernel: lock_dlm:  Assertion failed on line 353 of file
/usr/src/build/547226-i686/BUILD/hugemem/src/dlm/lock.c



[root@tank-01 ~]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    6   M   tank-01.lab.msp.redhat.com
   2    1    6   M   tank-04.lab.msp.redhat.com
   3    1    6   M   tank-02.lab.msp.redhat.com
   4    1    6   M   tank-06.lab.msp.redhat.com
   5    1    6   M   tank-03.lab.msp.redhat.com
   6    1    6   X   tank-05.lab.msp.redhat.com



Comment 20 Corey Marthaler 2005-03-30 10:48:40 EST
hit this again last night on tank-06 after 25 iterations of revolver:

Mar 29 19:51:04 tank-06 kernel: CMAN: node tank-01.lab.msp.redhat.com has been
removed from the cluster : Missed too many heartbeats
Mar 29 19:51:12 tank-06 kernel: CMAN: node tank-02.lab.msp.redhat.com has been
removed from the cluster : No response to messages
Mar 29 19:51:13 tank-06 fenced[5567]: fencing deferred to tank-03.lab.msp.redhat.com
Mar 29 19:51:21 tank-06 sshd(pam_unix)[5811]: session opened for user root by
(uid=0)
Mar 29 19:51:22 tank-06 sshd(pam_unix)[5830]: session opened for user root by
(uid=0)
Mar 29 19:52:07 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=5: Trying to
acquire journal lock...
Mar 29 19:52:07 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=5: Looking at
journal...
Mar 29 19:52:08 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=5: Acquiring
the transaction lock...
Mar 29 19:52:09 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=5: Replaying
journal...
Mar 29 19:52:09 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=5: Replayed
172 of 187 blocks
Mar 29 19:52:09 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=5: replays =
172, skips = 0, sames = 15
Mar 29 19:52:09 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=5: Journal
replayed in 1s
Mar 29 19:52:09 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=5: Done
Mar 29 19:52:09 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=3: Trying to
acquire journal lock...
Mar 29 19:52:09 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=3: Busy
Mar 29 19:54:55 tank-06 kernel: CMAN: node tank-01.lab.msp.redhat.com rejoining
Mar 29 19:55:00 tank-06 kernel: CMAN: node tank-02.lab.msp.redhat.com rejoining
Mar 29 19:55:04 tank-06 kernel: CMAN: node tank-06.lab.msp.redhat.com has been
removed from the cluster : No response to messages
Mar 29 19:55:04 tank-06 kernel: CMAN: killed by NODEDOWN message
Mar 29 19:55:04 tank-06 kernel: CMAN: we are leaving the cluster. No response to
messages
Mar 29 19:55:05 tank-06 kernel: dlm: gfs0: remote_stage error -105 3303be
Mar 29 19:55:06 tank-06 kernel: dlm: dlm_unlock: lkid 2c03ea lockspace not found
Mar 29 19:55:06 tank-06 kernel: 1 nodeid -1
Mar 29 19:55:06 tank-06 kernel: gfs0 marked 3 requests
Mar 29 19:55:06 tank-06 kernel: gfs0 purge locks of departed nodes
Mar 29 19:55:06 tank-06 kernel: gfs0 purged 2 locks
Mar 29 19:55:06 tank-06 kernel: gfs0 update remastered resources
Mar 29 19:55:06 tank-06 kernel: gfs0 updated 3129 resources
Mar 29 19:55:06 tank-06 kernel: gfs0 rebuild locks
Mar 29 19:55:06 tank-06 kernel: gfs0 rebuilt 3137 locks
Mar 29 19:55:06 tank-06 kernel: gfs0 recover event 5 done
Mar 29 19:55:06 tank-06 kernel: clvmd mark waiting requests
Mar 29 19:55:06 tank-06 ccsd[5455]: Cluster manager shutdown.  Attemping to
reconnect...


Not sure if the proc info from the other nodes left up is still valid but...

[root@tank-03 ~]# cat /proc/cluster/dlm_debug
rged 2 requests
clvmd mark waiting requests
clvmd marked 0 requests
clvmd purge locks of departed nodes
clvmd purged 1 locks
clvmd update remastered resources
clvmd updated 0 resources
clvmd rebuild locks
clvmd rebuilt 0 locks
clvmd recover event 10 done
gfs0 mark waiting requests
gfs0 mark 220138 lq 1 nodeid -1
gfs0 mark 230230 lq 1 nodeid -1
gfs0 marked 2 requests
gfs0 purge locks of departed nodes
gfs0 purged 3136 locks
gfs0 update remastered resources
gfs0 updated 1 resources
gfs0 rebuild locks
gfs0 rebuilt 0 locks
gfs0 recover event 10 done
gfs0 move flags 0,0,1 ids 9,10,10
gfs0 process held requests
gfs0 processed 0 requests
gfs0 resend marked requests
gfs0 resend 220138 lq 1 flg 200000 node -1/-1 "       2
gfs0 resend 230230 lq 1 flg 200000 node -1/-1 "       7
gfs0 resent 2 requests
gfs0 recover event 10 finished
clvmd move flags 0,0,1 ids 9,10,10
clvmd process held requests
clvmd processed 0 requests
clvmd resend marked requests
clvmd resent 0 requests
clvmd recover event 10 finished


[root@tank-04 ~]# cat /proc/cluster/dlm_debug
urged 0 requests
gfs0 mark waiting requests
gfs0 mark 9c027b lq 1 nodeid -1
gfs0 mark ae0235 lq 1 nodeid -1
gfs0 marked 2 requests
gfs0 purge locks of departed nodes
gfs0 purged 1 locks
gfs0 update remastered resources
gfs0 updated 1 resources
gfs0 rebuild locks
gfs0 rebuilt 1 locks
gfs0 recover event 20 done
gfs0 move flags 0,0,1 ids 19,20,20
gfs0 process held requests
gfs0 processed 0 requests
gfs0 resend marked requests
gfs0 resend 9c027b lq 1 flg 200000 node -1/-1 "       7
gfs0 resend ae0235 lq 1 flg 200000 node -1/-1 "       7
gfs0 resent 2 requests
gfs0 recover event 20 finished
clvmd mark waiting requests
clvmd marked 0 requests
clvmd purge locks of departed nodes
clvmd purged 0 locks
clvmd update remastered resources
clvmd updated 0 resources
clvmd rebuild locks
clvmd rebuilt 0 locks
clvmd recover event 20 done
clvmd move flags 0,0,1 ids 19,20,20
clvmd process held requests
clvmd processed 0 requests
clvmd resend marked requests
clvmd resent 0 requests
clvmd recover event 20 finished


[root@tank-05 ~]# cat /proc/cluster/dlm_debug
 locks
clvmd rebuilt 0 locks
clvmd recover event 58 done
gfs0 mark waiting requests
gfs0 mark 12f01e7 lq 1 nodeid -1
gfs0 mark 15f0142 lq 1 nodeid -1
gfs0 mark 15c038f lq 1 nodeid -1
gfs0 mark 1770141 lq 1 nodeid -1
gfs0 marked 4 requests
gfs0 purge locks of departed nodes
gfs0 purged 2472 locks
gfs0 update remastered resources
gfs0 updated 1 resources
gfs0 rebuild locks
gfs0 rebuilt 1 locks
gfs0 recover event 58 done
gfs0 move flags 0,0,1 ids 57,58,58
gfs0 process held requests
gfs0 processed 0 requests
gfs0 resend marked requests
gfs0 resend 12f01e7 lq 1 flg 200000 node -1/-1 "       5
gfs0 resend 15f0142 lq 1 flg 200000 node -1/-1 "       7
gfs0 resend 15c038f lq 1 flg 200000 node -1/-1 "       7
gfs0 resend 1770141 lq 1 flg 200000 node -1/-1 "       7
gfs0 resent 4 requests
gfs0 recover event 58 finished
clvmd move flags 0,0,1 ids 57,58,58
clvmd process held requests
clvmd processed 0 requests
clvmd resend marked requests
clvmd resent 0 requests
clvmd recover event 58 finished


[root@tank-03 ~]# grep dlm /proc/slabinfo|cut -b1-40
dlm_conn              10     35    112
dlm_lvb/range      10140  12971     32
dlm_resdir(s)       2663   2992     44
dlm_resdir(l)          1     49     80
dlm_lkb            12704  17640    132
dlm_rsb(large)         1     20    196
dlm_rsb(small)      5900   5900    160


[root@tank-04 ~]# grep dlm /proc/slabinfo|cut -b1-40
dlm_conn               8     35    112
dlm_lvb/range        129    238     32
dlm_resdir(s)       2753   3080     44
dlm_resdir(l)          0      0     80
dlm_lkb             5697   5880    132
dlm_rsb(large)         1     20    196
dlm_rsb(small)      5682   5750    160


[root@tank-05 ~]# grep dlm /proc/slabinfo|cut -b1-40
dlm_conn              10     35    112
dlm_lvb/range       7580   9996     32
dlm_resdir(s)       2698   4400     44
dlm_resdir(l)          0      0     80
dlm_lkb            14394  17070    132
dlm_rsb(large)         1     20    196
dlm_rsb(small)      7661   8600    160


Comment 21 Christine Caulfield 2005-03-30 11:10:12 EST
This is going to be a pain but can you tcpdump port 6809 (or whatever you're
running cman over) please?
Comment 22 Christine Caulfield 2005-04-05 10:26:22 EDT
This should be assigned to me
Comment 23 Corey Marthaler 2005-04-05 17:07:33 EDT
Here are the messages with the latest cman kernel. Still trying to get a tcpdump
though...


Apr  5 14:08:07 link-10 kernel: CMAN: removing node link-11.lab.msp.redhat.com
from the cluster : Missed too many heartbeats
Apr  5 14:08:11 link-10 sshd(pam_unix)[6368]: session opened for user root by
(uid=0)
Apr  5 14:08:28 link-10 fenced[5981]: link-11.lab.msp.redhat.com not a cluster
member after 20 sec post_fail_delay
Apr  5 14:08:28 link-10 fenced[5981]: fencing node "link-11.lab.msp.redhat.com"
Apr  5 14:08:32 link-10 fenced[5981]: fence "link-11.lab.msp.redhat.com" success
Apr  5 14:08:39 link-10 kernel: GFS: fsid=link-cluster:corey.1: jid=0: Trying to
acquire journal lock...
Apr  5 14:08:39 link-10 kernel: GFS: fsid=link-cluster:corey.1: jid=0: Busy
Apr  5 14:10:45 link-10 kernel: CMAN: node link-11.lab.msp.redhat.com rejoining
Apr  5 14:10:54 link-10 sshd(pam_unix)[6388]: session opened for user root by
(uid=0)
Apr  5 14:10:55 link-10 sshd(pam_unix)[6407]: session opened for user root by
(uid=0)
Apr  5 14:10:59 link-10 sshd(pam_unix)[6426]: session opened for user root by
(uid=0)
Apr  5 14:11:10 link-10 sshd(pam_unix)[6445]: session opened for user root by
(uid=0)
Apr  5 14:12:44 link-10 kernel: CMAN: removing node link-11.lab.msp.redhat.com
from the cluster : Missed too many heartbeats
Apr  5 14:12:49 link-10 sshd(pam_unix)[6464]: session opened for user root by
(uid=0)
Apr  5 14:12:52 link-10 kernel: CMAN: node link-10.lab.msp.redhat.com has been
removed from the cluster : No response to messages
Apr  5 14:12:52 link-10 kernel: CMAN: killed by NODEDOWN message
Apr  5 14:12:52 link-10 kernel: CMAN: we are leaving the cluster. No response to
messages
Apr  5 14:12:52 link-10 ccsd[3811]: Cluster manager shutdown.  Attemping to
reconnect...
Apr  5 14:12:52 link-10 kernel: SM: 00000001 sm_stop: SG still joined
Apr  5 14:12:52 link-10 kernel: SM: 01000002 sm_stop: SG still joined
Apr  5 14:12:52 link-10 kernel: SM: 02000004 sm_stop: SG still joined
Apr  5 14:12:53 link-10 kernel: ,1,0 ids 5,7,5
Apr  5 14:12:53 link-10 kernel: clvmd move use event 7
Apr  5 14:12:53 link-10 kernel: clvmd recover event 7
Apr  5 14:12:53 link-10 kernel: clvmd add node 3
Apr  5 14:12:53 link-10 kernel: clvmd total nodes 4
Apr  5 14:12:53 link-10 kernel: clvmd rebuild resource directory
Apr  5 14:12:53 link-10 kernel: clvmd rebuilt 0 resources
Apr  5 14:12:53 link-10 kernel: clvmd purge requests
Apr  5 14:12:54 link-10 kernel: clvmd purged 0 requests
Apr  5 14:12:54 link-10 kernel: clvmd mark waiting requests
Apr  5 14:12:54 link-10 kernel: clvmd marked 0 requests
Apr  5 14:12:54 link-10 kernel: clvmd recover event 7 done
Apr  5 14:12:54 link-10 kernel: clvmd move flags 0,0,1 ids 5,7,7
Apr  5 14:12:54 link-10 kernel: clvmd process held requests
Apr  5 14:12:54 link-10 kernel: clvmd processed 0 requests
Apr  5 14:12:54 link-10 kernel: clvmd resend marked requests
Apr  5 14:12:54 link-10 kernel: clvmd resent 0 requests
Apr  5 14:12:54 link-10 kernel: clvmd recover event 7 finished
Apr  5 14:12:54 link-10 kernel: corey move flags 1,0,0 ids 5,5,5
Apr  5 14:12:54 link-10 kernel: corey move flags 0,1,0 ids 5,8,5
Apr  5 14:12:54 link-10 kernel: corey move use event 8
Apr  5 14:12:54 link-10 kernel: corey recover event 8
Apr  5 14:12:54 link-10 kernel: corey add node 3
Apr  5 14:12:54 link-10 kernel: corey total nodes 4
Apr  5 14:12:54 link-10 kernel: corey rebuild resource directory
Apr  5 14:12:54 link-10 kernel: corey rebuilt 751 resources
Apr  5 14:12:54 link-10 kernel: corey purge requests
Apr  5 14:12:54 link-10 kernel: corey purged 0 requests
Apr  5 14:12:54 link-10 kernel: corey mark waiting requests
Apr  5 14:12:54 link-10 kernel: corey marked 0 requests
Apr  5 14:12:54 link-10 kernel: corey recover event 8 done
Apr  5 14:12:54 link-10 kernel: corey add_to_requestq cmd 9 fr 4
Apr  5 14:12:54 link-10 kernel: corey move flags 0,0,1 ids 5,8,8
Apr  5 14:12:54 link-10 kernel: corey process held requests
Apr  5 14:12:54 link-10 kernel: corey process_requestq cmd 9 fr 4
Apr  5 14:12:54 link-10 kernel: corey processed 1 requests
Apr  5 14:12:54 link-10 kernel: corey resend marked requests
Apr  5 14:12:54 link-10 kernel: corey resent 0 requests
Apr  5 14:12:54 link-10 kernel: corey recover event 8 finished

Apr  5 14:12:54 link-10 kernel: 5538 0
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 ex punlock 0
Apr  5 14:12:54 link-10 kernel: 6350 en plock 7,55
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 req 7,55 ex 6504-6974 lkf 2000 wait 1
Apr  5 14:12:54 link-10 kernel: 6350 lk 7,55 id 0 -1,5 2000
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6143 qc 7,55 -1,5 id 600090 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 ex plock 0
Apr  5 14:12:54 link-10 kernel: 6143 qc 7,57 -1,5 id 7202fb sts 0 0
Apr  5 14:12:54 link-10 kernel: 6354 ex plock 0
Apr  5 14:12:54 link-10 kernel: 6350 en punlock 7,55
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 remove 7,55
Apr  5 14:12:54 link-10 kernel: 6350 un 7,55 600090 5 0
Apr  5 14:12:54 link-10 kernel: 6143 qc 7,55 5,5 id 600090 sts -65538 0
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 ex punlock 0
Apr  5 14:12:54 link-10 kernel: 6350 en plock 7,55
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 req 7,55 ex 6975-735d lkf 2000 wait 1
Apr  5 14:12:54 link-10 kernel: 6350 lk 7,55 id 0 -1,5 2000
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6143 qc 7,55 -1,5 id 5d030a sts 0 0
Apr  5 14:12:54 link-10 kernel: 6354 en punlock 7,57
Apr  5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 0,5 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,57 0,5 id 10144 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6354 remove 7,57
Apr  5 14:12:54 link-10 kernel: 6354 un 7,57 7202fb 5 0
Apr  5 14:12:54 link-10 kernel: 6143 qc 7,57 5,5 id 7202fb sts -65538 0
Apr  5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 5,0 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,57 5,0 id 10144 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6354 ex punlock 0
Apr  5 14:12:54 link-10 kernel: 6354 en plock 7,57
Apr  5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 0,5 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,57 0,5 id 10144 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6354 req 7,57 ex 2e984f-2eb2f3 lkf 2000 wait 1
Apr  5 14:12:54 link-10 kernel: 6354 lk 7,57 id 0 -1,5 2000
Apr  5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 5,0 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,57 5,0 id 10144 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 ex plock 0
Apr  5 14:12:54 link-10 kernel: 6350 en punlock 7,55
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 remove 7,55
Apr  5 14:12:54 link-10 kernel: 6350 un 7,55 5d030a 5 0
Apr  5 14:12:54 link-10 kernel: 6143 qc 7,55 5,5 id 5d030a sts -65538 0
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 ex punlock 0
Apr  5 14:12:54 link-10 kernel: 6350 en plock 7,55
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 req 7,55 ex 0-18e5 lkf 2000 wait 1
Apr  5 14:12:54 link-10 kernel: 6350 lk 7,55 id 0 -1,5 2000
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6143 qc 7,55 -1,5 id 5f0330 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 ex plock 0
Apr  5 14:12:54 link-10 kernel: 6350 en punlock 7,55
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 remove 7,55
Apr  5 14:12:54 link-10 kernel: 6350 un 7,55 5f0330 5 0
Apr  5 14:12:54 link-10 kernel: 6143 qc 7,55 5,5 id 5f0330 sts -65538 0
Apr  5 14:12:54 link-10 kernel: 6143 qc 7,57 -1,5 id 57012f sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 ex punlock 0
Apr  5 14:12:54 link-10 kernel: 6350 en plock 7,55
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 req 7,55 ex 18e6-5b60 lkf 2000 wait 1
Apr  5 14:12:54 link-10 kernel: 6350 lk 7,55 id 0 -1,5 2000
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6354 ex plock 0
Apr  5 14:12:54 link-10 kernel: 6143 qc 7,55 -1,5 id 630243 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6354 en punlock 7,57
Apr  5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 0,5 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,57 0,5 id 10144 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6354 remove 7,57
Apr  5 14:12:54 link-10 kernel: 6354 un 7,57 57012f 5 0
Apr  5 14:12:54 link-10 kernel: 6143 qc 7,57 5,5 id 57012f sts -65538 0
Apr  5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 5,0 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,57 5,0 id 10144 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6354 ex punlock 0
Apr  5 14:12:54 link-10 kernel: 6354 en plock 7,57
Apr  5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 0,5 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,57 0,5 id 10144 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6354 req 7,57 ex 2ec9ec-2edddf lkf 2000 wait 1
Apr  5 14:12:54 link-10 kernel: 6354 lk 7,57 id 0 -1,5 2000
Apr  5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 5,0 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,57 5,0 id 10144 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 ex plock 0
Apr  5 14:12:54 link-10 kernel: 6350 en punlock 7,55
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6143 qc 7,57 -1,5 id 600278 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 remove 7,55
Apr  5 14:12:54 link-10 kernel: 6350 un 7,55 630243 5 0
Apr  5 14:12:54 link-10 kernel: 6143 qc 7,55 5,5 id 630243 sts -65538 0
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 ex punlock 0
Apr  5 14:12:54 link-10 kernel: 6350 en plock 7,55
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 req 7,55 ex 5b61-6f3f lkf 2000 wait 1
Apr  5 14:12:54 link-10 kernel: 6350 lk 7,55 id 0 -1,5 2000
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6354 ex plock 0
Apr  5 14:12:54 link-10 kernel: 6354 en punlock 7,57
Apr  5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 0,5 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,57 0,5 id 10144 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6354 remove 7,57
Apr  5 14:12:54 link-10 kernel: 6354 un 7,57 600278 5 0
Apr  5 14:12:54 link-10 kernel: 6143 qc 7,57 5,5 id 600278 sts -65538 0
Apr  5 14:12:54 link-10 kernel: 6143 qc 7,55 -1,5 id 5b00bc sts 0 0
Apr  5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 5,0 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,57 5,0 id 10144 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6354 ex punlock 0
Apr  5 14:12:54 link-10 kernel: 6354 en plock 7,57
Apr  5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 0,5 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,57 0,5 id 10144 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6354 req 7,57 ex 2edde0-2edfe9 lkf 2000 wait 1
Apr  5 14:12:54 link-10 kernel: 6354 lk 7,57 id 0 -1,5 2000
Apr  5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 5,0 4
Apr  5 14:12:54 link-10 kernel: 6143 qc 11,57 5,0 id 10144 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6350 ex plock 0
Apr  5 14:12:54 link-10 kernel: 6143 qc 7,57 -1,5 id 610239 sts 0 0
Apr  5 14:12:54 link-10 kernel: 6354 ex plock 0
Apr  5 14:12:54 link-10 kernel: 6350 en punlock 7,55
Apr  5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4
Apr  5 14:12:54 link-10 kernel:
Apr  5 14:12:54 link-10 kernel: lock_dlm:  Assertion failed on line 411 of file
/usr/src/build/548215-i686/BUILD/smp/src/dlm/lock.c
Apr  5 14:12:54 link-10 kernel: lock_dlm:  assertion:  "!error"
Apr  5 14:12:54 link-10 kernel: lock_dlm:  time = 180423
Apr  5 14:12:54 link-10 kernel: corey: num=11,55 err=-22 cur=0 req=5 lkf=4

Comment 24 Christine Caulfield 2005-04-06 10:34:19 EDT
I'm bordering on being stumped here. We know that the cman processes are all
running fine because the node receives and processes a NODEDOWN message after
the recovery.

The reason is "no response to messages" which tells us that it's ACKs that are
not being sent / received. But the ack processor is on the fast path of the cman
comms process and not going to deadlock with anything.

There are no sendmsg errors, so the ACKs are not failing to be sent, because of
eg. memory shortage

So, the only cause that I can see that makes sense (and if you have any other
ideas pipe up) that some other process is spinning during the crucial 4 seconds
that the node needs to be sending enough ACKs to convince a remote node that
it's still working. 4 seconds is a long time, of course. Maybe co-incidentally,
the node is responding again 4 seconds later on that last log, 8 on the previous.

Can you increase the retries and see if that helps?

echo "10" > /proc/cluster/config/cman/max_retries
   before running cman_tool join

Looking for a spinning thread is going to be hard work (still assuming that's
what it is). I doubt it's the DLM because we're into GFS recovery at this stage
(in both logs) and the lock counts don't look at all high anyway. So it could be
GFS recovery, or it could be the lpfc driver. I'm not pointing any fingers, just
speculating.

Almost needless to say, I've never seen this but I'm using iSCSI - my lpfc disks
are still not working for some reason. 
Comment 25 Christine Caulfield 2005-04-07 10:10:40 EDT
Reproduced it!
I'm not almost certain it's a looping kernel thread. The logs & accompanying
tcpdump show the killed node being totally silent for 5 seconds, despite there
being plenty of work for it to do. Now to find the culprit...
Comment 26 Christine Caulfield 2005-04-08 04:03:05 EDT
s/not/now/ in previous comment (I'm always doing that, sorry).

Courtesy of Dave, added some schedule()s to SM in places where it could spend a
lot of time. I've now had revolver running all night (178 iterations) with no
problems.

Checking in sm.h;
/cvs/cluster/cluster/cman-kernel/src/sm.h,v  <--  sm.h
new revision: 1.2.2.1; previous revision: 1.2
done
Checking in sm_membership.c;
/cvs/cluster/cluster/cman-kernel/src/sm_membership.c,v  <--  sm_membership.c
new revision: 1.1.2.3; previous revision: 1.1.2.2
done
Checking in sm_recover.c;
/cvs/cluster/cluster/cman-kernel/src/sm_recover.c,v  <--  sm_recover.c
new revision: 1.3.2.1; previous revision: 1.3
done
Comment 27 Corey Marthaler 2005-04-12 14:42:56 EDT
I hit this senario today with the latest build:

CMAN 2.6.9-33.0 (built Apr 11 2005 15:15:54) installed

[root@link-10 ~]# rpm -qa | grep cman
cman-1.0-0.pre32.0
cman-kernheaders-2.6.9-33.0
cman-kernel-smp-2.6.9-33.0

I didn't see any SM messages however this time:

Apr 12 11:07:35 link-10 kernel: CMAN: node link-10.lab.msp.redhat.com has been
removed from the cluster : No response to messages
Apr 12 11:07:35 link-10 kernel: CMAN: killed by NODEDOWN message
Apr 12 11:07:35 link-10 kernel: CMAN: we are leaving the cluster. No response to
messages
Apr 12 11:07:35 link-10 kernel: dlm: gfs2: remote_stage error -105 12034c
Apr 12 11:07:35 link-10 kernel: dlm: gfs0: remote_stage error -105 1e01d5
Apr 12 11:07:35 link-10 kernel: dlm: gfs3: remote_stage error -105 4d028f
Apr 12 11:07:35 link-10 kernel: dlm: gfs3: remote_stage error -105 50021d
Apr 12 11:07:36 link-10 kernel:  rebuild locks
Apr 12 11:07:36 link-10 kernel: clvmd rebuilt 0 locks
Apr 12 11:07:36 link-10 kernel: clvmd recover event 32 done
Apr 12 11:07:36 link-10 kernel: gfs2 add_to_requestq cmd 1 fr 4
Apr 12 11:07:36 link-10 kernel: gfs2 move flags 0,0,1 ids 28,32,32
Apr 12 11:07:36 link-10 kernel: gfs2 process held requests
Apr 12 11:07:36 link-10 kernel: gfs2 process_requestq cmd 1 fr 4
Apr 12 11:07:36 link-10 kernel: gfs2 processed 1 requests
Apr 12 11:07:36 link-10 kernel: gfs2 resend marked requests
Apr 12 11:07:36 link-10 ccsd[3971]: Cluster manager shutdown.  Attemping to
reconnect...
Apr 12 11:07:36 link-10 kernel: gfs2 resend 1103c2 lq 1 flg 200000 node -1/-1 "
      7
Apr 12 11:07:38 link-10 kernel: gfs2 resent 1 requests
Apr 12 11:07:38 link-10 kernel: gfs2 recover event 32 finished
Apr 12 11:07:38 link-10 kernel: gfs0 move flags 0,0,1 ids 24,32,32
Apr 12 11:07:38 link-10 kernel: gfs0 process held requests
Apr 12 11:07:38 link-10 kernel: gfs0 processed 0 requests
Apr 12 11:07:38 link-10 kernel: gfs0 resend marked requests
Apr 12 11:07:38 link-10 kernel: gfs0 resent 0 requests
Apr 12 11:07:38 link-10 kernel: gfs0 recover event 32 finished
Apr 12 11:07:38 link-10 kernel: clvmd move flags 0,0,1 ids 15,32,32
Apr 12 11:07:38 link-10 kernel: clvmd process held requests
Apr 12 11:07:38 link-10 kernel: clvmd processed 0 requests
Apr 12 11:07:38 link-10 kernel: clvmd resend marked requests
Apr 12 11:07:38 link-10 kernel: clvmd resent 0 requests
Apr 12 11:07:38 link-10 kernel: clvmd recover event 32 finished
Apr 12 11:07:38 link-10 kernel: gfs1 move flags 0,0,1 ids 26,32,32
Apr 12 11:07:38 link-10 kernel: gfs1 process held requests
Apr 12 11:07:38 link-10 kernel: gfs1 processed 0 requests
[...]


I'll try to gather more info and add to the bug.
Comment 28 Corey Marthaler 2005-04-12 14:55:57 EDT
This was a 4 node cluster (link-10, link-11, link-12, link-08), link-08 was shot
by revolver and caused link-10 to bail.

additional info in:
/home/msp/cmarthal/pub/bugs/139738
Comment 29 Christine Caulfield 2005-04-13 06:56:23 EDT
OK, I've crossed the border into stumped territory now. This certainly looks
like the same problem at least. Have you tried the "echo" I mentioned in comment
#24 ?

It's also helpful to get as much info from the killed node as possible (rather
than just the others) as that's where the problem seems to be. How about:

  tar -czf /tmp/proc.tgz  --exclude=*core* /proc

(you can tell I'm desperate now, can't you?)
Comment 30 Christine Caulfield 2005-05-27 03:51:30 EDT
Interestingly I his something that looked a lot like this last night, testing
for 155729. No GFS involved at all, but /lots/ of disk logging. I wonder if it's
just disk syncing??
Comment 31 Christine Caulfield 2005-05-31 10:26:36 EDT
Set max_retries to default to 5.

5 seems to be enough time for the system to recover.
Comment 32 Corey Marthaler 2005-08-04 11:12:13 EDT
I'm not sure if this is the same issue as this bug, but i saw these messages
with the latest build cluster-i686-2005-08-03-1549 last night while running
recovery tests to a two node cluster (link-10, link-11). Revolver had gone
though 207 iterations before seeing this on both nodes:

link-11:
Aug  3 21:53:32 link-11 kernel: CMAN: sendmsg failed: -101
Aug  3 21:53:32 link-11 kernel: CMAN: No functional network interfaces, leaving
cluster
Aug  3 21:53:32 link-11 kernel: CMAN: sendmsg failed: -101
Aug  3 21:53:32 link-11 kernel: CMAN: sendmsg failed: -101
Aug  3 21:53:32 link-11 kernel: CMAN: we are leaving the cluster.
Aug  3 21:53:32 link-11 kernel: Synchronizing SCSI cache for disk sdd:
Aug  3 21:53:32 link-11 clurgmgrd[2817]: <warning> #67: Shutting down uncleanly
Aug  3 21:53:32 link-11 kernel: SM: 00000001 sm_stop: SG still joined
Aug  3 21:53:32 link-11 kernel: SM: 01000002 sm_stop: SG still joined
Aug  3 21:53:32 link-11 kernel: SM: 02000005 sm_stop: SG still joined
Aug  3 21:53:32 link-11 kernel: SM: 03000003 sm_stop: SG still joined
Aug  3 21:53:32 link-11 ccsd[2278]: Cluster manager shutdown.  Attemping to
reconnect...
Aug  3 21:53:32 link-11 kernel: SM: send_nodeid_message error -107 to 1
Aug  3 21:53:33 link-11 kernel: sts
Aug  3 21:53:34 link-11 kernel: clvmd recover event 10 finished
Aug  3 21:53:34 link-11 kernel: gfs0 move flags 1,0,0 ids 8,8,8
Aug  3 21:53:34 link-11 kernel: gfs0 move flags 0,1,0 ids 8,11,8
Aug  3 21:53:34 link-11 kernel: gfs0 move use event 11
Aug  3 21:53:34 link-11 kernel: gfs0 recover event 11
Aug  3 21:53:34 link-11 kernel: gfs0 add node 1
Aug  3 21:53:34 link-11 kernel: gfs0 total nodes 2
Aug  3 21:53:34 link-11 kernel: gfs0 rebuild resource directory
Aug  3 21:53:34 link-11 kernel: gfs0 rebuilt 1163 resources
Aug  3 21:53:34 link-11 kernel: gfs0 purge requests
Aug  3 21:53:34 link-11 kernel: gfs0 purged 0 requests
[...]
eventually asserts


link-10:
Aug  3 22:21:04 link-10 kernel: CMAN: sendmsg failed: -101
Aug  3 22:21:04 link-10 kernel: CMAN: No functional network interfaces, leaving
cluster
Aug  3 22:21:04 link-10 kernel: CMAN: sendmsg failed: -101
Aug  3 22:21:04 link-10 kernel: CMAN: sendmsg failed: -101
Aug  3 22:21:04 link-10 kernel: CMAN: we are leaving the cluster.
Aug  3 22:21:04 link-10 clurgmgrd[2854]: <warning> #67: Shutting down uncleanly
Aug  3 22:21:04 link-10 kernel: SM: 00000001 sm_stop: SG still joined
Aug  3 22:21:04 link-10 kernel: SM: 01000002 sm_stop: SG still joined
Aug  3 22:21:04 link-10 kernel: SM: 02000005 sm_stop: SG still joined
Aug  3 22:21:04 link-10 kernel: SM: 03000003 sm_stop: SG still joined
Aug  3 22:21:04 link-10 ccsd[2279]: Cluster manager shutdown.  Attemping to
reconnect...
Aug  3 22:21:04 link-10 kernel: SM: send_nodeid_message error -107 to 2
Aug  3 22:21:05 link-10 kernel: ests
Aug  3 22:21:05 link-10 kernel: clvmd recover event 10 finished
Aug  3 22:21:05 link-10 kernel: gfs0 move flags 1,0,0 ids 8,8,8
Aug  3 22:21:05 link-10 kernel: gfs0 move flags 0,1,0 ids 8,11,8
Aug  3 22:21:05 link-10 kernel: gfs0 move use event 11
Aug  3 22:21:06 link-10 kernel: gfs0 recover event 11
Aug  3 22:21:06 link-10 kernel: gfs0 add node 2
Aug  3 22:21:06 link-10 kernel: gfs0 total nodes 2
Aug  3 22:21:06 link-10 kernel: gfs0 rebuild resource directory
Aug  3 22:21:06 link-10 kernel: gfs0 rebuilt 1139 resources
Aug  3 22:21:06 link-10 kernel: gfs0 purge requests
Aug  3 22:21:06 link-10 kernel: gfs0 purged 0 requests
Aug  3 22:21:06 link-10 kernel: gfs0 mark waiting requests
Aug  3 22:21:06 link-10 kernel: gfs0 marked 0 requests
[...]
also eventually asserts
Comment 33 Christine Caulfield 2005-08-04 13:07:47 EDT
No, it's not even slightly the same.

The message "no functional network interfaces" is the clue. It means that all
attempts to send HELLO messages failed (eg eth0 was downed) so cman has to quit.
It's a new message from bz #163687.

I suggest you look at what else was happening when this occurred. Was the
machine being rebooted without the cluster services being shut down first perhaps.
Comment 34 Corey Marthaler 2005-08-04 15:32:16 EDT
filed bz165160 for comments #32 and #33.
Comment 35 Corey Marthaler 2006-11-02 18:27:50 EST
Have not seen this bug in over 10 months, marking verified. Will reopen if seen
again.

Note You need to log in before you can comment on or make changes to this bug.