Description of problem: This happened on morph-04 (with I/O running) when morph-01 and morph-02 were coming back up and rejoining the cluster after having been both shot. Both morph-04 and morph-05 appeared to have hit bz133512 as they had messages saying: CMAN: we are leaving the cluster. Reason is 5 after morph-01 and morph-02 joined again. Nov 17 13:47:36 morph-04 kernel: 2228 qc 2,667a067 3,5 id c005c sts 0 0 Nov 17 13:47:36 morph-04 kernel: 2300 un 2,667a00a 30389 3 0 Nov 17 13:47:36 morph-04 kernel: Nov 17 13:47:36 morph-04 kernel: lock_dlm: Assertion failed on line 342 of file /usr/src/c luster/gfs-kernel/src/dlm/lock.c Nov 17 13:47:36 morph-04 kernel: lock_dlm: assertion: "!error" Nov 17 13:47:36 morph-04 kernel: lock_dlm: time = 241736 Nov 17 13:47:36 morph-04 kernel: corey0: error=-22 num=2,667a00a Nov 17 13:47:36 morph-04 kernel: Nov 17 13:47:36 morph-04 kernel: ------------[ cut here ]------------ Nov 17 13:47:36 morph-04 kernel: kernel BUG at /usr/src/cluster/gfs-kernel/src/dlm/lock.c:3 42! Nov 17 13:47:36 morph-04 kernel: invalid operand: 0000 [#1] Nov 17 13:47:36 morph-04 kernel: SMP Nov 17 13:47:36 morph-04 kernel: Modules linked in: lock_gulm lock_nolock gfs lock_dlm dlm cman lock_harness lpfc ipv6 parport_pc lp parport autofs4 sunrpc e1000 microcode dm_mod uhc i_hcd ehci_hcd button battery ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod Nov 17 13:47:36 morph-04 kernel: CPU: 1 Nov 17 13:47:36 morph-04 kernel: EIP: 0060:[<f89f0a95>] Not tainted VLI Nov 17 13:47:36 morph-04 kernel: EFLAGS: 00010282 (2.6.9) Nov 17 13:47:36 morph-04 kernel: EIP is at do_dlm_unlock+0xf5/0x110 [lock_dlm] Nov 17 13:47:36 morph-04 kernel: eax: 00000001 ebx: ffffffea ecx: f747ff00 edx: 00000292 Nov 17 13:47:36 morph-04 kernel: esi: f3f9d580 edi: f3f91b70 ebp: f8aac000 esp: f747fefc Nov 17 13:47:36 morph-04 kernel: ds: 007b es: 007b ss: 0068 Nov 17 13:47:36 morph-04 kernel: Process gfs_glockd (pid: 2300, threadinfo=f747e000 task=c23681b0) Nov 17 13:47:36 morph-04 kernel: Stack: f89f6106 f5819880 ffffffea 00000002 0667a00a 00000000 00000000 00000001 Nov 17 13:47:36 morph-04 kernel: f3f91b8c f89f0e0f f8a312d3 f3f91b70 f8a5509c f8a61f60 f4b2dd40 f3f91b70 Nov 17 13:47:36 morph-04 kernel: f8a61f60 f3f91b8c f8a306b5 f3f91b70 00000000 f3f91c00 f3f91b70 f8a307c9 Nov 17 13:47:36 morph-04 kernel: Call Trace: Nov 17 13:47:36 morph-04 kernel: [<f89f0e0f>] lm_dlm_unlock+0xf/0x20 [lock_dlm] Nov 17 13:47:36 morph-04 kernel: [<f8a312d3>] gfs_glock_drop_th+0x83/0x160 [gfs] Nov 17 13:47:36 morph-04 kernel: [<f8a5509c>] gmalloc+0x1c/0x50 [gfs] Nov 17 13:47:36 morph-04 kernel: [<f8a306b5>] rq_demote+0xb5/0xd0 [gfs] Nov 17 13:47:36 morph-04 kernel: [<f8a307c9>] run_queue+0x99/0xd0 [gfs] Nov 17 13:47:36 morph-04 kernel: [<f8a30909>] unlock_on_glock+0x29/0x40 [gfs] Nov 17 13:47:36 morph-04 kernel: [<f8a32ddb>] gfs_reclaim_glock+0x13b/0x190 [gfs] Nov 17 13:47:36 morph-04 kernel: [<f8a24357>] gfs_glockd+0x107/0x120 [gfs] Nov 17 13:47:36 morph-04 kernel: [<c011f2d0>] default_wake_function+0x0/0x10 Nov 17 13:47:36 morph-04 kernel: [<c0105ec2>] ret_from_fork+0x6/0x14 Nov 17 13:47:36 morph-04 kernel: [<c011f2d0>] default_wake_function+0x0/0x10 Nov 17 13:47:36 morph-04 kernel: [<f8a24250>] gfs_glockd+0x0/0x120 [gfs] Nov 17 13:47:36 morph-04 kernel: [<c01042b5>] kernel_thread_helper+0x5/0x10 Nov 17 13:47:36 morph-04 kernel: Code: 46 0c 89 54 24 14 89 44 24 0c 8b 06 8b 40 18 c7 04 24 57 61 9f f8 89 44 24 04 e8 07 25 73 c7 c7 04 24 06 61 9f f8 e8 fb 24 73 c7 <0f> 0b 56 01 48 59 9f f8 c7 04 24 74 59 9f f8 e8 e7 1c 73 c7 8d Nov 17 13:47:49 morph-04 kernel: 11 66770c6" Nov 17 13:47:49 morph-04 kernel: corey1 (2429) un c0042 40000 0 0 " 7 66970ef" Also both consoles had these messages streaming non stop: [<c0293239>] __kfree_skb+0xa9/0x140 [<c010c4d0>] timer_interrupt+0xb0/0x120 [<f894410e>] e1000_clean_tx_irq+0x14e/0x220 [e1000] [<f8943fa0>] e1000_clean+0xa0/0xc0 [e1000] [<c011bc40>] do_page_fault+0x0/0x54c [<c0106a19>] error_code+0x2d/0x38 [<c0120e1c>] mm_release+0x3c/0xc0 [<c012ac85>] del_timer_sync+0x85/0xc0 [<c0124e71>] do_exit+0x81/0x400 [<c01071f0>] do_divide_error+0x0/0x120 [<c011bc40>] do_page_fault+0x0/0x54c [<c011bf34>] do_page_fault+0x2f4/0x54c [<f8944670>] e1000_alloc_rx_buffers+0x40/0xf0 [e1000] [<f8944374>] e1000_clean_rx_irq+0x194/0x450 [e1000] Version-Release number of selected component (if applicable): Lock_DLM (built Nov 16 2004 10:49:44) installed How reproducible: Didn't try
Where did "CMAN: we are leaving the cluster" appear in relation to the logs above? If it was before, then there's no problem with dlm locking and the panic is not surprising. Also, the other streaming errors make me suspicious that something else had gone wrong and was the root cause of the lock_dlm assert rather than there being a bug in the locking logic. I see two lines from the dlm log dump at the very end. Was there more? In this assert, when dlm_unlock returns -EINVAL, it also logs a more specific error that should appear in the log dump. I suspect the error may have been "lockspace not found" which would be consistent with the cluster suddenly shutting down from underneath it (and consistent with bug 133512).
The "CMAN: we are leaving the cluster" messages were before the assertion trip. Also if there were other lines at the very end of the log dump then they were cut off by the restart of the system because the next lines I see are just that: Nov 17 13:47:36 morph-04 kernel: [<f8a24250>] gfs_glockd+0x0/0x120 [gfs] Nov 17 13:47:36 morph-04 kernel: [<c01042b5>] kernel_thread_helper+0x5/0x10 Nov 17 13:47:36 morph-04 kernel: Code: 46 0c 89 54 24 14 89 44 24 0c 8b 06 8b 40 18 c7 04 24 57 61 9f f8 89 44 24 04 e8 07 25 73 c7 c7 04 24 06 61 9f f8 e8 fb 24 73 c7 <0f> 0b 56 01 48 59 9f f8 c7 04 24 74 59 9f f8 e8 e7 1c 73 c7 8d Nov 17 13:47:49 morph-04 kernel: 11 66770c6" Nov 17 13:47:49 morph-04 kernel: corey1 (2429) un c0042 40000 0 0 " 7 66970ef" Nov 17 14:31:31 morph-04 syslogd 1.4.1: restart. Nov 17 14:31:31 morph-04 syslog: syslogd startup succeeded Nov 17 14:31:31 morph-04 kernel: klogd 1.4.1, log source = /proc/kmsg started.
OK, I'm quite certain this is not a dlm/locking bug. It's a side effect of the earlier bug which is the node suddenly leaving the cluster while the dlm is running. We should, of course, find more graceful ways of failing than panicking, but that's another project.
Cman deciding to leave the cluster while gfs/dlm are in use is probably due to some lower root cause. Patrick may be able to classify this better.
*** Bug 128420 has been marked as a duplicate of this bug. ***
*** Bug 142844 has been marked as a duplicate of this bug. ***
Latest checkin sets the cman processes to run as realtime so they get all the CPU they need (which isn't much). Seems to fix it for me, let me know how you get on ;-)
I seemed to have hit a similar case where cman leaves the cluster while gfs is in use and all hell breaks loose as a result. This was on a 5 node cluster after 2 nodes (morph-01 and morph-05) had been shot and then brought back up and a cman join attempt was made on both. Feb 4 15:24:54 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3: jid=0: Trying to acquire journal lock... Feb 4 15:24:54 morph-04 kernel: GFS: fsid=morph-cluster:gfs3.3: jid=0: Busy Feb 4 15:24:54 morph-04 kernel: GFS: fsid=morph-cluster:gfs2.3: jid=0: Busy Feb 4 15:24:54 morph-04 kernel: GFS: fsid=morph-cluster:gfs1.3: jid=0: Busy Feb 4 15:24:54 morph-04 kernel: GFS: fsid=morph-cluster:gfs0.3: jid=0: Busy Feb 4 15:27:23 morph-04 kernel: CMAN: node morph-05 rejoining Feb 4 15:27:47 morph-04 kernel: CMAN: node morph-04 has been removed from the cluster : No response to messag es Feb 4 15:27:47 morph-04 kernel: CMAN: killed by NODEDOWN message Feb 4 15:27:47 morph-04 kernel: CMAN: we are leaving the cluster. Feb 4 15:27:47 morph-04 kernel: dlm: gfs2: remote_stage error -105 280371 Feb 4 15:27:47 morph-04 kernel: dlm: gfs2: remote_stage error -105 1f0292 Feb 4 15:27:47 morph-04 kernel: dlm: gfs3: remote_stage error -105 2b010f Feb 4 15:27:47 morph-04 kernel: dlm: gfs1: remote_stage error -105 e02ba Feb 4 15:27:47 morph-04 kernel: dlm: gfs1: remote_stage error -105 f025f Feb 4 15:27:47 morph-04 kernel: dlm: gfs3: remote_stage error -105 2d033c Feb 4 15:27:47 morph-04 kernel: dlm: gfs0: remote_stage error -105 1601df Feb 4 15:27:49 morph-04 kernel: 123) rq 5 1d0022 " 7 4daa00a" Feb 4 15:27:49 morph-04 kernel: gfs0 (7123) cv 0 30282 " 11 4daa00a" Feb 4 15:27:49 morph-04 kernel: gfs3 (7140) cv 5 10214 " 11 4daa006" Feb 4 15:27:49 morph-04 kernel: gfs3 (7140) rq 5 3701dc " 7 4daa006" Feb 4 15:27:49 morph-04 kernel: gfs3 (7140) cv 0 10214 " 11 4daa006" Feb 4 15:27:49 morph-04 kernel: gfs3 (7140) cv 5 10214 " 11 4daa006" Feb 4 15:27:49 morph-04 kernel: gfs3 (7140) un 3701dc 80000 0 0 " 7 4daa006" Feb 4 15:27:49 morph-04 kernel: gfs3 (7140) cv 0 10214 " 11 4daa006" Feb 4 15:27:49 morph-04 kernel: gfs3 (7140) cv 5 10214 " 11 4daa006" Feb 4 15:27:49 morph-04 kernel: gfs3 (7140) rq 5 2e03f3 " 7 4daa006" Feb 4 15:27:49 morph-04 kernel: gfs3 (7140) cv 0 10214 " 11 4daa006" Feb 4 15:27:49 morph-04 kernel: gfs3 (7140) cv 5 10214 " 11 4daa006" Feb 4 15:27:49 morph-04 kernel: gfs3 (7140) un 2e03f3 80000 0 0 " 7 4daa006" Feb 4 15:27:49 morph-04 ccsd[5810]: Cluster manager shutdown. Attemping to reconnect... Feb 4 15:27:49 morph-04 kernel: gfs3 (7140) cv 0 10214 " 11 4daa006" Feb 4 15:27:49 morph-04 kernel: gfs0 (7123) cv 5 30282 " 11 4daa00a" Feb 4 15:27:49 morph-04 kernel: gfs0 (7123) un 1d0022 80000 0 0 " 7 4daa00a" . . [snip] . . Feb 4 15:27:55 morph-04 kernel: 6511 qc 7,4daa00a -1,5 id 1a03c0 sts 0 0 Feb 4 15:27:55 morph-04 kernel: 6511 qc 11,4daa00a 5,0 id 30282 sts 0 0 Feb 4 15:27:55 morph-04 kernel: 7123 ex plock 0 Feb 4 15:27:55 morph-04 kernel: 7140 en plock 7,4daa006 Feb 4 15:27:55 morph-04 kernel: 7140 lk 11,4daa006 id 10214 0,5 4 Feb 4 15:27:55 morph-04 kernel: Feb 4 15:27:55 morph-04 kernel: lock_dlm: Assertion failed on line 410 of file /usr/src/build/514108-i686/BUILD/gfs-kernel-2.6.9-16/src/dlm/lock.c Feb 4 15:27:55 morph-04 kernel: lock_dlm: assertion: "!error" Feb 4 15:27:55 morph-04 kernel: lock_dlm: time = 670979 Feb 4 15:27:55 morph-04 kernel: gfs3: num=11,4daa006 err=-22 cur=0 req=5 lkf=4 Feb 4 15:27:55 morph-04 kernel: Feb 4 15:27:55 morph-04 kernel: ------------[ cut here ]------------ Feb 4 15:27:55 morph-04 kernel: kernel BUG at /usr/src/build/514108-i686/BUILD/gfs-kernel-2.6.9-16/src/dlm/lock.c:410! Feb 4 15:27:55 morph-04 kernel: invalid operand: 0000 [#1] Feb 4 15:27:55 morph-04 kernel: Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_dlm(U) dlm(U) cman(U) lock_harness(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc uhci_hcd hw_random e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod Feb 4 15:27:55 morph-04 kernel: CPU: 0 Feb 4 15:27:55 morph-04 kernel: EIP: 0060:[<f8972ed9>] Tainted: GF VLI Feb 4 15:27:55 morph-04 kernel: EFLAGS: 00010246 (2.6.9-5.EL) Feb 4 15:27:55 morph-04 kernel: EIP is at do_dlm_lock+0x14e/0x168 [lock_dlm] Feb 4 15:27:55 morph-04 kernel: eax: 00000001 ebx: ffffffea ecx: f89785da edx: f4abde18 Feb 4 15:27:56 morph-04 kernel: esi: f8972ef8 edi: f5127600 ebp: f1651580 esp: f4abde14 Feb 4 15:27:56 morph-04 kernel: ds: 007b es: 007b ss: 0068 Feb 4 15:27:56 morph-04 kernel: Process doio (pid: 7140, threadinfo=f4abd000 task=f167acd0) Feb 4 15:27:56 morph-04 kernel: Stack: f89785da 20202020 31312020 20202020 20202020 61643420 36303061 00000018 Feb 4 15:27:56 morph-04 kernel: f16515c0 f16515c0 f1651580 f4abde50 00000000 f8972f4d f16515bc 1d244b3c Feb 4 15:27:56 morph-04 kernel: 00000000 0000000a f8978635 00000000 00000000 f1651580 00000001 f4abde88 Feb 4 15:27:56 morph-04 kernel: Call Trace: Feb 4 15:27:56 morph-04 kernel: [<f8972f4d>] do_dlm_lock_sync+0x50/0x63 [lock_dlm] Feb 4 15:27:56 morph-04 kernel: [<f8975196>] lock_resource+0x6e/0x91 [lock_dlm] Feb 4 15:27:56 morph-04 kernel: [<f8976b01>] lm_dlm_plock+0x13f/0x248 [lock_dlm] Feb 4 15:27:56 morph-04 kernel: [<f8b76606>] gfs_lm_plock+0x33/0x42 [gfs] Feb 4 15:27:56 morph-04 kernel: [<f8b8291d>] gfs_lock+0xe3/0xf1 [gfs] Feb 4 15:27:56 morph-04 kernel: [<f8b8283a>] gfs_lock+0x0/0xf1 [gfs] Feb 4 15:27:56 morph-04 kernel: [<c0179a32>] fcntl_setlk+0x156/0x291 Feb 4 15:27:56 morph-04 kernel: [<f88800a7>] qla2x00_done+0x403/0x420 [qla2xxx] Feb 4 15:27:56 morph-04 kernel: [<c0175714>] do_fcntl+0x194/0x1dd Feb 4 15:27:56 morph-04 kernel: [<c0175823>] sys_fcntl64+0x6c/0x7d Feb 4 15:27:56 morph-04 kernel: [<c0301bfb>] syscall_call+0x7/0xb Feb 4 15:27:56 morph-04 kernel: [<c030007b>] interruptible_sleep_on_timeout+0x91/0x1da Feb 4 15:27:56 morph-04 kernel: Code: 26 50 0f bf 45 24 50 53 ff 75 08 ff 75 04 ff 75 0c ff 77 18 68 35 87 97 f8 e8 69 d1 7a c7 83 c4 38 68 da 85 97 f8 e8 5c d1 7a c7 <0f> 0b 9a 01 6c 84 97 f8 68 dc 85 97 f8 e8 a8 c5 7a c7 83 c4 20 Feb 4 15:27:56 morph-04 kernel: SM: 00000001 sm_stop: SG still joined Feb 4 15:27:56 morph-04 kernel: SM: 01000003 sm_stop: SG still joined Feb 4 15:27:56 morph-04 kernel: SM: 02000005 sm_stop: SG still joined Feb 4 15:27:56 morph-04 kernel: 123) rq 5 1d0022 " 7 4daa00a" Feb 4 15:27:56 morph-04 kernel: gfs0 (7123) cv 0 30282 " 11 4daa00a" Feb 4 15:27:56 morph-04 kernel: gfs3 (7140) cv 5 10214 " 11 4daa006" Feb 4 15:27:56 morph-04 kernel: gfs3 (7140) rq 5 3701dc " 7 4daa006" Feb 4 15:27:56 morph-04 kernel: gfs3 (7140) cv 0 10214 " 11 4daa006"
marking this as failsqa unless you think this is a different bug.
Internally this is a different cause but as the eternal symptoms are the same I'll keep it in this bug. The "all hell breaks loose" is seperate
I'm working a bit on hunches here but it looks like it could be the DLM blocking cman while it does a /lot/ of recovery grep dlm /proc/slabinfo|cut -b1-40 shows useful info for this sort of thing. This mod should help greatly. Checking in lowcomms.c; /cvs/cluster/cluster/dlm-kernel/src/lowcomms.c,v <-- lowcomms.c new revision: 1.28; previous revision: 1.27 done Checking in lowcomms.c; /cvs/cluster/cluster/dlm-kernel/src/lowcomms.c,v <-- lowcomms.c new revision: 1.22.2.5; previous revision: 1.22.2.4 done I've set it NEEDINFO rather than MODIFIED because I can't seem to reproduce it. That might just be because I can't get enough locks up on my 256M machines or it might be something else. I'm sure you'll keep me informed.
If/when this happens again, please try to include the following from the remaining nodes (if at all possible) /proc/slabinfo /proc/loadavg /proc/meminfo /proc/buddyinfo /proc/vmstat /proc/cluster/dlm_debug /proc/cluster/dlm_locks (of all lockspaces if poss) any possibly interesting syslog errors preceding the node leaving. The stuff afterwards is not really germane to this bug. Thanks.
Seems likely that a dlm thread is hogging the cpu during some recoveries causing this. Testing to see if additional schedule's in dlm recovery will fix this.
per Dave's comments, I'm adding the comments from Derek and myself from bz 128318 to this bug and moving from NEEDINFO to REASSIGNED as we are both seeing this. comment #5: I think this may be the same bug. Was running a four node cluster with link-08 running 'while :; do placemaker -d 7 -w 3; find; rm -rf place_root; done'. link-10 was running a mount/umount loop, and link-12 was looping on bonnie++. After a couple of hours link-08 hit the assertion. I will put full logfiles in ~danderso/bugs/128318. Note: This is a mixed-arch cluster. link-10,link-11,link12 are i686 and link-08 is an x86_64 opteron. Note: The placemaker tool is in the sistina-test tree if wanted/needed. lock_dlm: Assertion failed on line 352 of file /usr/src/build/522379-x86_64/BUILD/smp/src/dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 4305526027 data1: error=-22 num=2,825c37 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lock:352 invalid operand: 0000 [1] SMP CPU 1 Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) parport_pc lp parport autofs4 dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc mptscsih mptbase sd_mod scsi_mod Pid: 3380, comm: gfs_glockd Tainted: G M 2.6.9-5.ELsmp RIP: 0010:[<ffffffffa0268804>] <ffffffffa0268804>{:lock_dlm:do_dlm_unlock+189} RSP: 0018:000001001e9f5de8 EFLAGS: 00010212 RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000100000000 RDX: ffffffff803c7508 RSI: 0000000000000246 RDI: ffffffff803c7500 RBP: 000001001cbb6dc0 R08: ffffffff803c7508 R09: 00000000ffffffea R10: 0000000000000097 R11: 0000000000000097 R12: 000001001c637c9c R13: ffffff000016a000 R14: ffffffffa0264d20 R15: 000001001c637c70 FS: 0000002a95563b00(0000) GS:ffffffff804bf380(0000) knlGS:00000000f7ff06c0 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000002a95557000 CR3: 000000001ffb2000 CR4: 00000000000006e0 Process gfs_glockd (pid: 3380, threadinfo 000001001e9f4000, task 000001001e507030) Stack: 0000000000000000 ffffff000016a000 000001001c637c70 ffffffffa0268b7e 0000000000000001 ffffffffa023051f 0000000000000001 ffffffffa022709c 000001001fdd4500 000001001c637c70 Call Trace:<ffffffffa0268b7e>{:lock_dlm:lm_dlm_unlock+15} <ffffffffa023051f>{:gfs:gfs_lm_unlock+41} <ffffffffa022709c>{:gfs:gfs_glock_drop_th+290} <ffffffffa0225845>{:gfs:run_queue+314} <ffffffffa0225a9a>{:gfs:unlock_on_glock+37} <ffffffffa0225b90>{:gfs:gfs_reclaim_glock+234} <ffffffffa021a61a>{:gfs:gfs_glockd+61} <ffffffff8013176a>{default_wake_function+0} <ffffffff8013176a>{default_wake_function+0} <ffffffff80110c23>{child_rip+8} <ffffffffa021a5dd>{:gfs:gfs_glockd+0} <ffffffff80110c1b>{child_rip+0} Code: 0f 0b 13 cf 26 a0 ff ff ff ff 60 01 48 c7 c7 18 cf 26 a0 31 RIP <ffffffffa0268804>{:lock_dlm:do_dlm_unlock+189} RSP <000001001e9f5de8> <0>Kernel panic - not syncing: Oops comment #7: I've seen this now too, here's everything that I could grab above the assert: dlm: dlm_unlock: lkid 50264 lockspace not found ror -105 1a0019 gfs0 remote_stage error -105 1e0298 gfs1 remote_stage error -105 160189 gfs0 remote_stage error -105 25004d gfs1 remote_stage error -105 1801e2 gfs5 remote_stage error -105 1900db gfs3 remote_stage error -105 170275 gfs4 remote_stage error -105 1801ac gfs4 remote_stage error -105 160310 gfs9 remote_stage error -105 1d01dd gfs7 remote_stage error -105 1f00cb gfs9 remote_stage error -105 140104 gfs0 remote_stage error -105 1f018f gfs5 remote_stage error -105 1202a1 gfs3 remote_stage error -105 1703d5 gfs3 remote_stage error -105 1800c6 gfs5 remote_stage error -105 180295 gfs2 remote_stage error -105 170117 gfs6 remote_stage error -105 130274 gfs4 remote_stage error -105 1603cf gfs7 remote_stage error -105 1503a0 gfs9 remote_stage error -105 160071 gfs1 remote_stage error -105 1c0216 gfs2 remote_stage error -105 1b00f7 gfs7 remote_stage error -105 140136 gfs2 remote_stage error -105 2403d5 gfs5 remote_stage error -105 1902e1 gfs3 remote_stage error -105 1401ce gfs4 remote_stage error -105 13025d 3be sts 0 0 9293 ex punlock 0 9293 en plock 7,37 9293 lk 11,37 id 603be 0,5 4 7131 qc 11,37 0,5 id 603be sts 0 0 9293 req 7,37 ex 2ec187-2ed3d9 lkf 2000 wait 1 9293 lk 7,37 id 0 -1,5 2000 9293 lk 11,37 id 603be 5,0 4 7131 qc 7,37 -1,5 id 1f0358 sts 0 0 7131 qc 11,37 5,0 id 603be sts 0 0 9293 ex plock 0 9293 en punlock 7,37 9293 lk 11,37 id 603be 0,5 4 7131 qc 11,37 0,5 id 603be sts 0 0 9293 remove 7,37 9293 un 7,37 1f0358 5 0 7131 qc 7,37 5,5 id 1f0358 sts -65538 0 9293 lk 11,37 id 603be 5,0 4 7131 qc 11,37 5,0 id 603be sts 0 0 9293 ex punlock 0 9293 en plock 7,37 9293 lk 11,37 id 603be 0,5 4 7131 qc 11,37 0,5 id 603be sts 0 0 9293 req 7,37 ex 2ed3d9-2ed7dd lkf 2000 wait 1 9293 lk 7,37 id 0 -1,5 2000 9293 lk 11,37 id 603be 5,0 4 7131 qc 7,37 -1,5 id 2403a4 sts 0 0 7131 qc 11,37 5,0 id 603be sts 0 0 9293 ex plock 0 9293 en punlock 7,37 9293 lk 11,37 id 603be 0,5 4 7131 qc 11,37 0,5 id 603be sts 0 0 9293 remove 7,37 9293 un 7,37 2403a4 5 0 7131 qc 7,37 5,5 id 2403a4 sts -65538 0 9293 lk 11,37 id 603be 5,0 4 7131 qc 11,37 5,0 id 603be sts 0 0 9293 ex punlock 0 9293 en plock 7,37 9293 lk 11,37 id 603be 0,5 4 7131 qc 11,37 0,5 id 603be sts 0 0 9293 req 7,37 ex 2ed7de-2eddd4 lkf 2000 wait 1 9293 lk 7,37 id 0 -1,5 2000 9293 lk 11,37 id 603be 5,0 4 7131 qc 7,37 -1,5 id 1c018a sts 0 0 7131 qc 11,37 5,0 id 603be sts 0 0 9293 ex plock 0 9293 en punlock 7,37 9293 lk 11,37 id 603be 0,5 4 7131 qc 11,37 0,5 id 603be sts 0 0 9293 remove 7,37 9293 un 7,37 1c018a 5 0 7131 qc 7,37 5,5 id 1c018a sts -65538 0 9293 lk 11,37 id 603be 5,0 4 7131 qc 11,37 5,0 id 603be sts 0 0 9293 ex punlock 0 9293 en plock 7,37 9293 lk 11,37 id 603be 0,5 4 7131 qc 11,37 0,5 id 603be sts 0 0 9293 req 7,37 ex 2eddd5-2edffc lkf 2000 wait 1 9293 lk 7,37 id 0 -1,5 2000 9293 lk 11,37 id 603be 5,0 4 7131 qc 7,37 -1,5 id 250113 sts 0 0 7131 qc 11,37 5,0 id 603be sts 0 0 9293 ex plock 0 9293 en punlock 7,37 9293 lk 11,37 id 603be 0,5 4 7131 qc 11,37 0,5 id 603be sts 0 0 9293 remove 7,37 9293 un 7,37 250113 5 0 7131 qc 7,37 5,5 id 250113 sts -65538 0 9293 lk 11,37 id 603be 5,0 4 7131 qc 11,37 5,0 id 603be sts 0 0 9293 ex punlock 0 9293 en plock 7,37 9293 lk 11,37 id 603be 0,5 4 7131 qc 11,37 0,5 id 603be sts 0 0 9293 req 7,37 ex e12a9-26b17f lkf 2000 wait 1 9293 lk 7,37 id 0 -1,5 2000 9293 lk 11,37 id 603be 5,0 4 7131 qc 7,37 -1,5 id 200044 sts 0 0 7131 qc 11,37 5,0 id 603be sts 0 0 9293 ex plock 0 9334 en punlock 7,2d 9334 lk 11,2d id 201d9 0,5 4 7131 qc 11,2d 0,5 id 201d9 sts 0 0 9334 remove 7,2d 9334 un 7,2d 160147 5 0 7131 qc 7,2d 5,5 id 160147 sts -65538 0 9334 lk 11,2d id 201d9 5,0 4 7131 qc 11,2d 5,0 id 201d9 sts 0 0 9334 ex punlock 0 9334 en plock 7,2d 9334 lk 11,2d id 201d9 0,5 4 7131 qc 11,2d 0,5 id 201d9 sts 0 0 9334 req 7,2d ex 0-64da lkf 2000 wait 1 9334 lk 7,2d id 0 -1,5 2000 9334 lk 11,2d id 201d9 5,0 4 7131 qc 7,2d -1,5 id 14038e sts 0 0 7131 qc 11,2d 5,0 id 201d9 sts 0 0 9334 ex plock 0 9293 en punlock 7,37 9293 lk 11,37 id 603be 0,5 4 7131 qc 11,37 0,5 id 603be sts 0 0 9293 remove 7,37 9293 un 7,37 200044 5 0 7131 qc 7,37 5,5 id 200044 sts -65538 0 9293 lk 11,37 id 603be 5,0 4 7131 qc 11,37 5,0 id 603be sts 0 0 9293 ex punlock 0 9293 en plock 7,37 9293 lk 11,37 id 603be 0,5 4 7131 qc 11,37 0,5 id 603be sts 0 0 9293 req 7,37 ex 2ecc8b-2ed161 lkf 2000 wait 1 9293 lk 7,37 id 0 -1,5 2000 9293 lk 11,37 id 603be 5,0 4 7131 qc 7,37 -1,5 id 1e0271 sts 0 0 7131 qc 11,37 5,0 id 603be sts 0 0 9293 ex plock 0 9293 en punlock 7,37 9293 lk 11,37 id 603be 0,5 4 7131 qc 11,37 0,5 id 603be sts 0 0 9293 remove 7,37 9293 un 7,37 1e0271 5 0 7131 qc 7,37 5,5 id 1e0271 sts -65538 0 9293 lk 11,37 id 603be 5,0 4 7131 qc 11,37 5,0 id 603be sts 0 0 9293 ex punlock 0 9334 en punlock 7,2d 9334 lk 11,2d id 201d9 0,5 4 9336 en punlock 7,1ffe9 9336 lk 11,1ffe9 id 2006e 0,5 4 8494 un 2,20242 e012f 5 0 8506 un 8,0 100263 5 8 9339 lk 2,20282 id 0 -1,5 0 8429 lk 8,0 id 0 -1,5 8 9315 en punlock 7,10081 9315 lk 11,10081 id 10045 0,5 4 8186 un 2,4ff55 c03b5 5 0 8109 un 2,204a7 12028a 5 0 7964 un 2,400aa a03e0 5 0 8263 un 2,20144 c03c7 5 0 7887 un 2,100a9 200e0 5 0 8340 un 2,20281 50264 5 0 lock_dlm: Assertion failed on line 352 of file /usr/src/build/522381-i686/BUILD/gfs-kernel-2.6.9-23/src/dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 1744300 gfs6: error=-22 num=2,20281 ------------[ cut here ]------------ kernel BUG at /usr/src/build/522381-i686/BUILD/gfs-kernel-2.6.9-23/src/dlm/lock.c:352! invalid operand: 0000 [#13] Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_dlm(U) dlm(U) cman(U) lock_harness(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc button battery ac uhci_hcd hw_random e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<f8962ce9>] Not tainted VLI EFLAGS: 00010246 (2.6.9-5.EL) EIP is at do_dlm_unlock+0xa2/0xb7 [lock_dlm] eax: 00000001 ebx: ffffffea ecx: f896854a edx: c4599f44 esi: f5e4b680 edi: f5e4b680 ebp: f8bc9000 esp: c4599f40 ds: 007b es: 007b ss: 0068 Process gfs_glockd (pid: 8340, threadinfo=c4599000 task=c3ffd320) Stack: f896854a f8bc9000 00000001 f8962ff4 f8b964c0 c67cbd7c f8bc9000 f8bc6640 f8b89447 c67cbd7c f8bc6640 c4599fb4 f8b87ef9 c67cbd7c 00000001 f8b880b6 c67cbd7c c67cbd7c f8b8836f c67cbe20 f8b8b9b4 c4599000 c4599fc0 f8b7bbf2 Call Trace: [<f8962ff4>] lm_dlm_unlock+0xe/0x16 [lock_dlm] [<f8b964c0>] gfs_lm_unlock+0x2b/0x40 [gfs] [<f8b89447>] gfs_glock_drop_th+0x17a/0x1b0 [gfs] [<f8b87ef9>] rq_demote+0x15c/0x1da [gfs] [<f8b880b6>] run_queue+0x5a/0xc1 [gfs] [<f8b8836f>] unlock_on_glock+0x6e/0xc8 [gfs] [<f8b8b9b4>] gfs_reclaim_glock+0x257/0x2ae [gfs] [<f8b7bbf2>] gfs_glockd+0x38/0xde [gfs] [<c011b9ea>] default_wake_function+0x0/0xc [<c0301b1a>] ret_from_fork+0x6/0x14 [<c011b9ea>] default_wake_function+0x0/0xc [<f8b7bbba>] gfs_glockd+0x0/0xde [gfs] [<c01041d9>] kernel_thread_helper+0x5/0xb Code: e8 72 d3 7b c7 ff 76 08 8b 06 ff 76 04 ff 76 0c 53 ff 70 18 68 6a 86 96 f8 e8 59 d3 7b c7 83 c4 2c 68 4a 85 96 f8 e8 4c d3 7b c7 <0f> 0b 60 01 dc 83 96 f8 68 4c 85 96 f8 e8 98 c7 7b c7 5b 5e c3
Dave spotted this one. the sequence counter eventually wrapped to zero, whence the ACK got ignored. Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v <-- cnxman.c new revision: 1.42.2.9; previous revision: 1.42.2.8 done
From Corey: reproduced this on morph-03 last night after 56 iterations of revolver: From the SYSLOG: Mar 16 14:15:08 morph-03 kernel: CMAN: node morph-04.lab.msp.redhat.com rejoining Mar 16 14:15:15 morph-03 kernel: CMAN: node morph-03.lab.msp.redhat.com has been removed from the cluster : No response to messages Mar 16 14:15:15 morph-03 kernel: CMAN: killed by NODEDOWN message Mar 16 14:15:15 morph-03 kernel: CMAN: we are leaving the cluster. No response to messages Corey, could you check quick to make sure the cman kernel module isn't an old one?
Could you also please check the version of the cman_tool binary you were using? A recent cman_tool fix setting socket priorities could also easily be related to this.
I was using the rpms built on the 14th: cman-1.0-0.pre29.0.i386.rpm cman-kernheaders-2.6.9-28.0.i686.rpm cman-kernel-hugemem-2.6.9-28.0.i686.rpm My machines were just reimaged for 6.0 testing so I'm unable to find out what version that is.
There must be another cause here as well. I had I/O running most of the night on a 6 node cluster and tank-06 ended up kicking out tank-05, and then tank-01 did the fencing. tank-06: Mar 29 08:26:23 tank-06 kernel: CMAN: removing node tank-05.lab.msp.redhat.com from the cluster : Missed too many heartbeats Mar 29 08:26:29 tank-06 fenced[5563]: fencing deferred to tank-01.lab.msp.redhat.com Mar 29 08:27:00 tank-06 kernel: GFS: fsid=tank-cluster:gfs1.5: jid=4: Trying to acquire journal lock... Mar 29 08:27:00 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.5: jid=4: Trying to acquire journal lock... Mar 29 08:27:00 tank-06 kernel: GFS: fsid=tank-cluster:gfs1.5: jid=4: Busy Mar 29 08:27:00 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.5: jid=4: Busy tank-01: Mar 29 08:27:04 tank-01 kernel: CMAN: removing node tank-05.lab.msp.redhat.com from the cluster : Shutdown Mar 29 08:27:29 tank-01 fenced[5846]: tank-05.lab.msp.redhat.com not a cluster member after 20 secpost_fail_delay Mar 29 08:27:29 tank-01 fenced[5846]: fencing node "tank-05.lab.msp.redhat.com" Mar 29 08:27:33 tank-01 kernel: lpfc 0000:02:01.0: 0:0263 Cannot block scsi target. target ptr xf78ea6e0 Mar 29 08:27:34 tank-01 fenced[5846]: fence "tank-05.lab.msp.redhat.com" success Mar 29 08:27:41 tank-01 kernel: GFS: fsid=tank-cluster:gfs1.0: jid=4: Trying to acquire journal lock... Mar 29 08:27:41 tank-01 kernel: GFS: fsid=tank-cluster:gfs0.0: jid=4: Trying to acquire journal lock... Mar 29 08:27:41 tank-01 kernel: GFS: fsid=tank-cluster:gfs0.0: jid=4: Busy Mar 29 08:27:41 tank-01 kernel: GFS: fsid=tank-cluster:gfs1.0: jid=4: Busy tank-05: Mar 29 08:25:48 tank-05 kernel: CMAN: Being told to leave the cluster by node 4 Mar 29 08:25:48 tank-05 kernel: CMAN: we are leaving the cluster. Mar 29 08:25:48 tank-05 ccsd[5655]: Cluster manager shutdown. Attemping to reconnect... Mar 29 08:25:48 tank-05 kernel: SM: 00000001 sm_stop: SG still joined Mar 29 08:25:48 tank-05 kernel: SM: 01000003 sm_stop: SG still joined Mar 29 08:25:48 tank-05 kernel: SM: 02000005 sm_stop: SG still joined Mar 29 08:25:49 tank-05 kernel: dlm: dlm_unlock: lkid 5250061 lockspace not found Mar 29 08:25:49 tank-05 kernel: resend 3a013e lq 1 flg 200000 node -1/-1 " 7 Mar 29 08:25:49 tank-05 kernel: gfs0 resent 5 requests Mar 29 08:25:49 tank-05 kernel: gfs0 recover event 27 finished Mar 29 08:25:49 tank-05 kernel: clvmd mark waiting requests Mar 29 08:25:49 tank-05 kernel: clvmd marked 0 requests Mar 29 08:25:49 tank-05 kernel: clvmd purge locks of departed nodes Mar 29 08:25:49 tank-05 kernel: clvmd purged 0 locks Mar 29 08:25:49 tank-05 kernel: clvmd update remastered resources Mar 29 08:25:49 tank-05 kernel: clvmd updated 0 resources Mar 29 08:25:49 tank-05 kernel: clvmd rebuild locks Mar 29 08:25:49 tank-05 kernel: clvmd rebuilt 0 locks Mar 29 08:25:49 tank-05 kernel: clvmd recover event 27 done Mar 29 08:25:49 tank-05 kernel: clvmd move flags 0,0,1 ids 22,27,27 Mar 29 08:25:49 tank-05 kernel: clvmd process held requests Mar 29 08:25:49 tank-05 kernel: clvmd processed 0 requests Mar 29 08:25:49 tank-05 kernel: clvmd resend marked requests Mar 29 08:25:49 tank-05 kernel: clvmd resent 0 requests Mar 29 08:25:49 tank-05 kernel: clvmd recover event 27 finished Mar 29 08:25:49 tank-05 kernel: gfs1 mark waiting requests Mar 29 08:25:49 tank-05 kernel: gfs1 mark 2c0152 lq 1 nodeid -1 Mar 29 08:25:49 tank-05 kernel: gfs1 marked 1 requests Mar 29 08:25:49 tank-05 kernel: gfs1 purge locks of departed nodes Mar 29 08:25:49 tank-05 kernel: gfs1 purged 1 locks Mar 29 08:25:49 tank-05 kernel: gfs1 update remastered resources Mar 29 08:25:49 tank-05 kernel: gfs1 updated 2 resources Mar 29 08:25:49 tank-05 kernel: gfs1 rebuild locks Mar 29 08:25:49 tank-05 kernel: gfs1 rebuilt 2 locks Mar 29 08:25:49 tank-05 kernel: gfs1 recover event 27 done Mar 29 08:25:49 tank-05 kernel: gfs1 move flags 0,0,1 ids 25,27,27 Mar 29 08:25:49 tank-05 kernel: gfs1 process held requests Mar 29 08:25:49 tank-05 kernel: gfs1 processed 0 requests Mar 29 08:25:49 tank-05 kernel: gfs1 resend marked requests Mar 29 08:25:49 tank-05 kernel: gfs1 resend 2c0152 lq 1 flg 200000 node -1/-1 " 7 Mar 29 08:25:49 tank-05 kernel: gfs1 resent 1 requests Mar 29 08:25:49 tank-05 kernel: gfs1 recover event 27 finished Mar 29 08:25:49 tank-05 kernel: emove 7,7307ded Mar 29 08:25:49 tank-05 kernel: 6540 un 7,7307ded 65802f7 5 0 Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded 5,5 id 65802f7 sts -65538 0 Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 ex punlock 0 Mar 29 08:25:49 tank-05 kernel: 6540 en plock 7,7307ded Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 req 7,7307ded ex 16cad4-16d0f7 lkf 2000 wait 1 Mar 29 08:25:49 tank-05 kernel: 6540 lk 7,7307ded id 0 -1,5 2000 Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded -1,5 id 65102fe sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 ex plock 0 Mar 29 08:25:49 tank-05 kernel: 6540 en punlock 7,7307ded Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 remove 7,7307ded Mar 29 08:25:49 tank-05 kernel: 6540 un 7,7307ded 65102fe 5 0 Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded 5,5 id 65102fe sts -65538 0 Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 ex punlock 0 Mar 29 08:25:49 tank-05 kernel: 6540 en plock 7,7307ded Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 req 7,7307ded ex 16d0f7-16f243 lkf 2000 wait 1 Mar 29 08:25:49 tank-05 kernel: 6540 lk 7,7307ded id 0 -1,5 2000 Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded -1,5 id 69902a0 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 ex plock 0 Mar 29 08:25:49 tank-05 kernel: 6540 en punlock 7,7307ded Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 remove 7,7307ded Mar 29 08:25:49 tank-05 kernel: 6540 un 7,7307ded 69902a0 5 0 Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded 5,5 id 69902a0 sts -65538 0 Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 ex punlock 0 Mar 29 08:25:49 tank-05 kernel: 6540 en plock 7,7307ded Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 req 7,7307ded ex 1747e3-174a23 lkf 2000 wait 1 Mar 29 08:25:49 tank-05 kernel: 6540 lk 7,7307ded id 0 -1,5 2000 Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded -1,5 id 63b02ec sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 ex plock 0 Mar 29 08:25:49 tank-05 kernel: 6540 en punlock 7,7307ded Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 remove 7,7307ded Mar 29 08:25:49 tank-05 kernel: 6540 un 7,7307ded 63b02ec 5 0 Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded 5,5 id 63b02ec sts -65538 0 Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 ex punlock 0 Mar 29 08:25:49 tank-05 kernel: 6540 en plock 7,7307ded Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 req 7,7307ded ex 1761a5-176d9d lkf 2000 wait 1 Mar 29 08:25:49 tank-05 kernel: 6540 lk 7,7307ded id 0 -1,5 2000 Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded -1,5 id 61401d2 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 ex plock 0 Mar 29 08:25:49 tank-05 kernel: 6540 en punlock 7,7307ded Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 remove 7,7307ded Mar 29 08:25:49 tank-05 kernel: 6540 un 7,7307ded 61401d2 5 0 Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded 5,5 id 61401d2 sts -65538 0 Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 ex punlock 0 Mar 29 08:25:49 tank-05 kernel: 6540 en plock 7,7307ded Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 req 7,7307ded ex 0-156acf lkf 2000 wait 1 Mar 29 08:25:49 tank-05 kernel: 6540 lk 7,7307ded id 0 -1,5 2000 Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded -1,5 id 69200eb sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 ex plock 0 Mar 29 08:25:49 tank-05 kernel: 6540 en punlock 7,7307ded Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 remove 7,7307ded Mar 29 08:25:49 tank-05 kernel: 6540 un 7,7307ded 69200eb 5 0 Mar 29 08:25:49 tank-05 kernel: 6163 qc 7,7307ded 5,5 id 69200eb sts -65538 0 Mar 29 08:25:49 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4 Mar 29 08:25:49 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0 Mar 29 08:25:49 tank-05 kernel: 6540 ex punlock 0 Mar 29 08:25:50 tank-05 kernel: 6540 en plock 7,7307ded Mar 29 08:25:50 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4 Mar 29 08:25:50 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0 Mar 29 08:25:50 tank-05 kernel: 6540 req 7,7307ded ex 16a777-1761a7 lkf 2000 wait 1 Mar 29 08:25:50 tank-05 kernel: 6540 lk 7,7307ded id 0 -1,5 2000 Mar 29 08:25:50 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4 Mar 29 08:25:50 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0 Mar 29 08:25:50 tank-05 kernel: 6163 qc 7,7307ded -1,5 id 64e0201 sts 0 0 Mar 29 08:25:50 tank-05 kernel: 6540 ex plock 0 Mar 29 08:25:50 tank-05 kernel: 6540 en punlock 7,7307ded Mar 29 08:25:50 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4 Mar 29 08:25:50 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0 Mar 29 08:25:50 tank-05 kernel: 6540 remove 7,7307ded Mar 29 08:25:50 tank-05 kernel: 6540 un 7,7307ded 64e0201 5 0 Mar 29 08:25:50 tank-05 kernel: 6163 qc 7,7307ded 5,5 id 64e0201 sts -65538 0 Mar 29 08:25:50 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4 Mar 29 08:25:50 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0 Mar 29 08:25:50 tank-05 kernel: 6540 ex punlock 0 Mar 29 08:25:50 tank-05 kernel: 6540 en plock 7,7307ded Mar 29 08:25:50 tank-05 kernel: 6540 lk 11,7307ded id 20365 0,5 4 Mar 29 08:25:50 tank-05 kernel: 6163 qc 11,7307ded 0,5 id 20365 sts 0 0 Mar 29 08:25:50 tank-05 kernel: 6540 req 7,7307ded ex 1761a7-176494 lkf 2000 wait 1 Mar 29 08:25:50 tank-05 kernel: 6540 lk 7,7307ded id 0 -1,5 2000 Mar 29 08:25:50 tank-05 kernel: 6540 lk 11,7307ded id 20365 5,0 4 Mar 29 08:25:50 tank-05 kernel: 6163 qc 11,7307ded 5,0 id 20365 sts 0 0 Mar 29 08:25:50 tank-05 kernel: 6306 un 5,9282610 5250061 3 0 Mar 29 08:25:50 tank-05 kernel: Mar 29 08:25:50 tank-05 kernel: lock_dlm: Assertion failed on line 353 of file /usr/src/build/547226-i686/BUILD/hugemem/src/dlm/lock.c [root@tank-01 ~]# cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 6 M tank-01.lab.msp.redhat.com 2 1 6 M tank-04.lab.msp.redhat.com 3 1 6 M tank-02.lab.msp.redhat.com 4 1 6 M tank-06.lab.msp.redhat.com 5 1 6 M tank-03.lab.msp.redhat.com 6 1 6 X tank-05.lab.msp.redhat.com
hit this again last night on tank-06 after 25 iterations of revolver: Mar 29 19:51:04 tank-06 kernel: CMAN: node tank-01.lab.msp.redhat.com has been removed from the cluster : Missed too many heartbeats Mar 29 19:51:12 tank-06 kernel: CMAN: node tank-02.lab.msp.redhat.com has been removed from the cluster : No response to messages Mar 29 19:51:13 tank-06 fenced[5567]: fencing deferred to tank-03.lab.msp.redhat.com Mar 29 19:51:21 tank-06 sshd(pam_unix)[5811]: session opened for user root by (uid=0) Mar 29 19:51:22 tank-06 sshd(pam_unix)[5830]: session opened for user root by (uid=0) Mar 29 19:52:07 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=5: Trying to acquire journal lock... Mar 29 19:52:07 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=5: Looking at journal... Mar 29 19:52:08 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=5: Acquiring the transaction lock... Mar 29 19:52:09 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=5: Replaying journal... Mar 29 19:52:09 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=5: Replayed 172 of 187 blocks Mar 29 19:52:09 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=5: replays = 172, skips = 0, sames = 15 Mar 29 19:52:09 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=5: Journal replayed in 1s Mar 29 19:52:09 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=5: Done Mar 29 19:52:09 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=3: Trying to acquire journal lock... Mar 29 19:52:09 tank-06 kernel: GFS: fsid=tank-cluster:gfs0.1: jid=3: Busy Mar 29 19:54:55 tank-06 kernel: CMAN: node tank-01.lab.msp.redhat.com rejoining Mar 29 19:55:00 tank-06 kernel: CMAN: node tank-02.lab.msp.redhat.com rejoining Mar 29 19:55:04 tank-06 kernel: CMAN: node tank-06.lab.msp.redhat.com has been removed from the cluster : No response to messages Mar 29 19:55:04 tank-06 kernel: CMAN: killed by NODEDOWN message Mar 29 19:55:04 tank-06 kernel: CMAN: we are leaving the cluster. No response to messages Mar 29 19:55:05 tank-06 kernel: dlm: gfs0: remote_stage error -105 3303be Mar 29 19:55:06 tank-06 kernel: dlm: dlm_unlock: lkid 2c03ea lockspace not found Mar 29 19:55:06 tank-06 kernel: 1 nodeid -1 Mar 29 19:55:06 tank-06 kernel: gfs0 marked 3 requests Mar 29 19:55:06 tank-06 kernel: gfs0 purge locks of departed nodes Mar 29 19:55:06 tank-06 kernel: gfs0 purged 2 locks Mar 29 19:55:06 tank-06 kernel: gfs0 update remastered resources Mar 29 19:55:06 tank-06 kernel: gfs0 updated 3129 resources Mar 29 19:55:06 tank-06 kernel: gfs0 rebuild locks Mar 29 19:55:06 tank-06 kernel: gfs0 rebuilt 3137 locks Mar 29 19:55:06 tank-06 kernel: gfs0 recover event 5 done Mar 29 19:55:06 tank-06 kernel: clvmd mark waiting requests Mar 29 19:55:06 tank-06 ccsd[5455]: Cluster manager shutdown. Attemping to reconnect... Not sure if the proc info from the other nodes left up is still valid but... [root@tank-03 ~]# cat /proc/cluster/dlm_debug rged 2 requests clvmd mark waiting requests clvmd marked 0 requests clvmd purge locks of departed nodes clvmd purged 1 locks clvmd update remastered resources clvmd updated 0 resources clvmd rebuild locks clvmd rebuilt 0 locks clvmd recover event 10 done gfs0 mark waiting requests gfs0 mark 220138 lq 1 nodeid -1 gfs0 mark 230230 lq 1 nodeid -1 gfs0 marked 2 requests gfs0 purge locks of departed nodes gfs0 purged 3136 locks gfs0 update remastered resources gfs0 updated 1 resources gfs0 rebuild locks gfs0 rebuilt 0 locks gfs0 recover event 10 done gfs0 move flags 0,0,1 ids 9,10,10 gfs0 process held requests gfs0 processed 0 requests gfs0 resend marked requests gfs0 resend 220138 lq 1 flg 200000 node -1/-1 " 2 gfs0 resend 230230 lq 1 flg 200000 node -1/-1 " 7 gfs0 resent 2 requests gfs0 recover event 10 finished clvmd move flags 0,0,1 ids 9,10,10 clvmd process held requests clvmd processed 0 requests clvmd resend marked requests clvmd resent 0 requests clvmd recover event 10 finished [root@tank-04 ~]# cat /proc/cluster/dlm_debug urged 0 requests gfs0 mark waiting requests gfs0 mark 9c027b lq 1 nodeid -1 gfs0 mark ae0235 lq 1 nodeid -1 gfs0 marked 2 requests gfs0 purge locks of departed nodes gfs0 purged 1 locks gfs0 update remastered resources gfs0 updated 1 resources gfs0 rebuild locks gfs0 rebuilt 1 locks gfs0 recover event 20 done gfs0 move flags 0,0,1 ids 19,20,20 gfs0 process held requests gfs0 processed 0 requests gfs0 resend marked requests gfs0 resend 9c027b lq 1 flg 200000 node -1/-1 " 7 gfs0 resend ae0235 lq 1 flg 200000 node -1/-1 " 7 gfs0 resent 2 requests gfs0 recover event 20 finished clvmd mark waiting requests clvmd marked 0 requests clvmd purge locks of departed nodes clvmd purged 0 locks clvmd update remastered resources clvmd updated 0 resources clvmd rebuild locks clvmd rebuilt 0 locks clvmd recover event 20 done clvmd move flags 0,0,1 ids 19,20,20 clvmd process held requests clvmd processed 0 requests clvmd resend marked requests clvmd resent 0 requests clvmd recover event 20 finished [root@tank-05 ~]# cat /proc/cluster/dlm_debug locks clvmd rebuilt 0 locks clvmd recover event 58 done gfs0 mark waiting requests gfs0 mark 12f01e7 lq 1 nodeid -1 gfs0 mark 15f0142 lq 1 nodeid -1 gfs0 mark 15c038f lq 1 nodeid -1 gfs0 mark 1770141 lq 1 nodeid -1 gfs0 marked 4 requests gfs0 purge locks of departed nodes gfs0 purged 2472 locks gfs0 update remastered resources gfs0 updated 1 resources gfs0 rebuild locks gfs0 rebuilt 1 locks gfs0 recover event 58 done gfs0 move flags 0,0,1 ids 57,58,58 gfs0 process held requests gfs0 processed 0 requests gfs0 resend marked requests gfs0 resend 12f01e7 lq 1 flg 200000 node -1/-1 " 5 gfs0 resend 15f0142 lq 1 flg 200000 node -1/-1 " 7 gfs0 resend 15c038f lq 1 flg 200000 node -1/-1 " 7 gfs0 resend 1770141 lq 1 flg 200000 node -1/-1 " 7 gfs0 resent 4 requests gfs0 recover event 58 finished clvmd move flags 0,0,1 ids 57,58,58 clvmd process held requests clvmd processed 0 requests clvmd resend marked requests clvmd resent 0 requests clvmd recover event 58 finished [root@tank-03 ~]# grep dlm /proc/slabinfo|cut -b1-40 dlm_conn 10 35 112 dlm_lvb/range 10140 12971 32 dlm_resdir(s) 2663 2992 44 dlm_resdir(l) 1 49 80 dlm_lkb 12704 17640 132 dlm_rsb(large) 1 20 196 dlm_rsb(small) 5900 5900 160 [root@tank-04 ~]# grep dlm /proc/slabinfo|cut -b1-40 dlm_conn 8 35 112 dlm_lvb/range 129 238 32 dlm_resdir(s) 2753 3080 44 dlm_resdir(l) 0 0 80 dlm_lkb 5697 5880 132 dlm_rsb(large) 1 20 196 dlm_rsb(small) 5682 5750 160 [root@tank-05 ~]# grep dlm /proc/slabinfo|cut -b1-40 dlm_conn 10 35 112 dlm_lvb/range 7580 9996 32 dlm_resdir(s) 2698 4400 44 dlm_resdir(l) 0 0 80 dlm_lkb 14394 17070 132 dlm_rsb(large) 1 20 196 dlm_rsb(small) 7661 8600 160
This is going to be a pain but can you tcpdump port 6809 (or whatever you're running cman over) please?
This should be assigned to me
Here are the messages with the latest cman kernel. Still trying to get a tcpdump though... Apr 5 14:08:07 link-10 kernel: CMAN: removing node link-11.lab.msp.redhat.com from the cluster : Missed too many heartbeats Apr 5 14:08:11 link-10 sshd(pam_unix)[6368]: session opened for user root by (uid=0) Apr 5 14:08:28 link-10 fenced[5981]: link-11.lab.msp.redhat.com not a cluster member after 20 sec post_fail_delay Apr 5 14:08:28 link-10 fenced[5981]: fencing node "link-11.lab.msp.redhat.com" Apr 5 14:08:32 link-10 fenced[5981]: fence "link-11.lab.msp.redhat.com" success Apr 5 14:08:39 link-10 kernel: GFS: fsid=link-cluster:corey.1: jid=0: Trying to acquire journal lock... Apr 5 14:08:39 link-10 kernel: GFS: fsid=link-cluster:corey.1: jid=0: Busy Apr 5 14:10:45 link-10 kernel: CMAN: node link-11.lab.msp.redhat.com rejoining Apr 5 14:10:54 link-10 sshd(pam_unix)[6388]: session opened for user root by (uid=0) Apr 5 14:10:55 link-10 sshd(pam_unix)[6407]: session opened for user root by (uid=0) Apr 5 14:10:59 link-10 sshd(pam_unix)[6426]: session opened for user root by (uid=0) Apr 5 14:11:10 link-10 sshd(pam_unix)[6445]: session opened for user root by (uid=0) Apr 5 14:12:44 link-10 kernel: CMAN: removing node link-11.lab.msp.redhat.com from the cluster : Missed too many heartbeats Apr 5 14:12:49 link-10 sshd(pam_unix)[6464]: session opened for user root by (uid=0) Apr 5 14:12:52 link-10 kernel: CMAN: node link-10.lab.msp.redhat.com has been removed from the cluster : No response to messages Apr 5 14:12:52 link-10 kernel: CMAN: killed by NODEDOWN message Apr 5 14:12:52 link-10 kernel: CMAN: we are leaving the cluster. No response to messages Apr 5 14:12:52 link-10 ccsd[3811]: Cluster manager shutdown. Attemping to reconnect... Apr 5 14:12:52 link-10 kernel: SM: 00000001 sm_stop: SG still joined Apr 5 14:12:52 link-10 kernel: SM: 01000002 sm_stop: SG still joined Apr 5 14:12:52 link-10 kernel: SM: 02000004 sm_stop: SG still joined Apr 5 14:12:53 link-10 kernel: ,1,0 ids 5,7,5 Apr 5 14:12:53 link-10 kernel: clvmd move use event 7 Apr 5 14:12:53 link-10 kernel: clvmd recover event 7 Apr 5 14:12:53 link-10 kernel: clvmd add node 3 Apr 5 14:12:53 link-10 kernel: clvmd total nodes 4 Apr 5 14:12:53 link-10 kernel: clvmd rebuild resource directory Apr 5 14:12:53 link-10 kernel: clvmd rebuilt 0 resources Apr 5 14:12:53 link-10 kernel: clvmd purge requests Apr 5 14:12:54 link-10 kernel: clvmd purged 0 requests Apr 5 14:12:54 link-10 kernel: clvmd mark waiting requests Apr 5 14:12:54 link-10 kernel: clvmd marked 0 requests Apr 5 14:12:54 link-10 kernel: clvmd recover event 7 done Apr 5 14:12:54 link-10 kernel: clvmd move flags 0,0,1 ids 5,7,7 Apr 5 14:12:54 link-10 kernel: clvmd process held requests Apr 5 14:12:54 link-10 kernel: clvmd processed 0 requests Apr 5 14:12:54 link-10 kernel: clvmd resend marked requests Apr 5 14:12:54 link-10 kernel: clvmd resent 0 requests Apr 5 14:12:54 link-10 kernel: clvmd recover event 7 finished Apr 5 14:12:54 link-10 kernel: corey move flags 1,0,0 ids 5,5,5 Apr 5 14:12:54 link-10 kernel: corey move flags 0,1,0 ids 5,8,5 Apr 5 14:12:54 link-10 kernel: corey move use event 8 Apr 5 14:12:54 link-10 kernel: corey recover event 8 Apr 5 14:12:54 link-10 kernel: corey add node 3 Apr 5 14:12:54 link-10 kernel: corey total nodes 4 Apr 5 14:12:54 link-10 kernel: corey rebuild resource directory Apr 5 14:12:54 link-10 kernel: corey rebuilt 751 resources Apr 5 14:12:54 link-10 kernel: corey purge requests Apr 5 14:12:54 link-10 kernel: corey purged 0 requests Apr 5 14:12:54 link-10 kernel: corey mark waiting requests Apr 5 14:12:54 link-10 kernel: corey marked 0 requests Apr 5 14:12:54 link-10 kernel: corey recover event 8 done Apr 5 14:12:54 link-10 kernel: corey add_to_requestq cmd 9 fr 4 Apr 5 14:12:54 link-10 kernel: corey move flags 0,0,1 ids 5,8,8 Apr 5 14:12:54 link-10 kernel: corey process held requests Apr 5 14:12:54 link-10 kernel: corey process_requestq cmd 9 fr 4 Apr 5 14:12:54 link-10 kernel: corey processed 1 requests Apr 5 14:12:54 link-10 kernel: corey resend marked requests Apr 5 14:12:54 link-10 kernel: corey resent 0 requests Apr 5 14:12:54 link-10 kernel: corey recover event 8 finished Apr 5 14:12:54 link-10 kernel: 5538 0 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 ex punlock 0 Apr 5 14:12:54 link-10 kernel: 6350 en plock 7,55 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 req 7,55 ex 6504-6974 lkf 2000 wait 1 Apr 5 14:12:54 link-10 kernel: 6350 lk 7,55 id 0 -1,5 2000 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6143 qc 7,55 -1,5 id 600090 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 ex plock 0 Apr 5 14:12:54 link-10 kernel: 6143 qc 7,57 -1,5 id 7202fb sts 0 0 Apr 5 14:12:54 link-10 kernel: 6354 ex plock 0 Apr 5 14:12:54 link-10 kernel: 6350 en punlock 7,55 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 remove 7,55 Apr 5 14:12:54 link-10 kernel: 6350 un 7,55 600090 5 0 Apr 5 14:12:54 link-10 kernel: 6143 qc 7,55 5,5 id 600090 sts -65538 0 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 ex punlock 0 Apr 5 14:12:54 link-10 kernel: 6350 en plock 7,55 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 req 7,55 ex 6975-735d lkf 2000 wait 1 Apr 5 14:12:54 link-10 kernel: 6350 lk 7,55 id 0 -1,5 2000 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6143 qc 7,55 -1,5 id 5d030a sts 0 0 Apr 5 14:12:54 link-10 kernel: 6354 en punlock 7,57 Apr 5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 0,5 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,57 0,5 id 10144 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6354 remove 7,57 Apr 5 14:12:54 link-10 kernel: 6354 un 7,57 7202fb 5 0 Apr 5 14:12:54 link-10 kernel: 6143 qc 7,57 5,5 id 7202fb sts -65538 0 Apr 5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 5,0 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,57 5,0 id 10144 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6354 ex punlock 0 Apr 5 14:12:54 link-10 kernel: 6354 en plock 7,57 Apr 5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 0,5 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,57 0,5 id 10144 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6354 req 7,57 ex 2e984f-2eb2f3 lkf 2000 wait 1 Apr 5 14:12:54 link-10 kernel: 6354 lk 7,57 id 0 -1,5 2000 Apr 5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 5,0 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,57 5,0 id 10144 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 ex plock 0 Apr 5 14:12:54 link-10 kernel: 6350 en punlock 7,55 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 remove 7,55 Apr 5 14:12:54 link-10 kernel: 6350 un 7,55 5d030a 5 0 Apr 5 14:12:54 link-10 kernel: 6143 qc 7,55 5,5 id 5d030a sts -65538 0 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 ex punlock 0 Apr 5 14:12:54 link-10 kernel: 6350 en plock 7,55 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 req 7,55 ex 0-18e5 lkf 2000 wait 1 Apr 5 14:12:54 link-10 kernel: 6350 lk 7,55 id 0 -1,5 2000 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6143 qc 7,55 -1,5 id 5f0330 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 ex plock 0 Apr 5 14:12:54 link-10 kernel: 6350 en punlock 7,55 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 remove 7,55 Apr 5 14:12:54 link-10 kernel: 6350 un 7,55 5f0330 5 0 Apr 5 14:12:54 link-10 kernel: 6143 qc 7,55 5,5 id 5f0330 sts -65538 0 Apr 5 14:12:54 link-10 kernel: 6143 qc 7,57 -1,5 id 57012f sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 ex punlock 0 Apr 5 14:12:54 link-10 kernel: 6350 en plock 7,55 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 req 7,55 ex 18e6-5b60 lkf 2000 wait 1 Apr 5 14:12:54 link-10 kernel: 6350 lk 7,55 id 0 -1,5 2000 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6354 ex plock 0 Apr 5 14:12:54 link-10 kernel: 6143 qc 7,55 -1,5 id 630243 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6354 en punlock 7,57 Apr 5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 0,5 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,57 0,5 id 10144 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6354 remove 7,57 Apr 5 14:12:54 link-10 kernel: 6354 un 7,57 57012f 5 0 Apr 5 14:12:54 link-10 kernel: 6143 qc 7,57 5,5 id 57012f sts -65538 0 Apr 5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 5,0 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,57 5,0 id 10144 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6354 ex punlock 0 Apr 5 14:12:54 link-10 kernel: 6354 en plock 7,57 Apr 5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 0,5 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,57 0,5 id 10144 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6354 req 7,57 ex 2ec9ec-2edddf lkf 2000 wait 1 Apr 5 14:12:54 link-10 kernel: 6354 lk 7,57 id 0 -1,5 2000 Apr 5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 5,0 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,57 5,0 id 10144 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 ex plock 0 Apr 5 14:12:54 link-10 kernel: 6350 en punlock 7,55 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6143 qc 7,57 -1,5 id 600278 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 remove 7,55 Apr 5 14:12:54 link-10 kernel: 6350 un 7,55 630243 5 0 Apr 5 14:12:54 link-10 kernel: 6143 qc 7,55 5,5 id 630243 sts -65538 0 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 ex punlock 0 Apr 5 14:12:54 link-10 kernel: 6350 en plock 7,55 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 0,5 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 req 7,55 ex 5b61-6f3f lkf 2000 wait 1 Apr 5 14:12:54 link-10 kernel: 6350 lk 7,55 id 0 -1,5 2000 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 5,0 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,55 5,0 id 300f1 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6354 ex plock 0 Apr 5 14:12:54 link-10 kernel: 6354 en punlock 7,57 Apr 5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 0,5 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,57 0,5 id 10144 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6354 remove 7,57 Apr 5 14:12:54 link-10 kernel: 6354 un 7,57 600278 5 0 Apr 5 14:12:54 link-10 kernel: 6143 qc 7,57 5,5 id 600278 sts -65538 0 Apr 5 14:12:54 link-10 kernel: 6143 qc 7,55 -1,5 id 5b00bc sts 0 0 Apr 5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 5,0 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,57 5,0 id 10144 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6354 ex punlock 0 Apr 5 14:12:54 link-10 kernel: 6354 en plock 7,57 Apr 5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 0,5 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,57 0,5 id 10144 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6354 req 7,57 ex 2edde0-2edfe9 lkf 2000 wait 1 Apr 5 14:12:54 link-10 kernel: 6354 lk 7,57 id 0 -1,5 2000 Apr 5 14:12:54 link-10 kernel: 6354 lk 11,57 id 10144 5,0 4 Apr 5 14:12:54 link-10 kernel: 6143 qc 11,57 5,0 id 10144 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6350 ex plock 0 Apr 5 14:12:54 link-10 kernel: 6143 qc 7,57 -1,5 id 610239 sts 0 0 Apr 5 14:12:54 link-10 kernel: 6354 ex plock 0 Apr 5 14:12:54 link-10 kernel: 6350 en punlock 7,55 Apr 5 14:12:54 link-10 kernel: 6350 lk 11,55 id 300f1 0,5 4 Apr 5 14:12:54 link-10 kernel: Apr 5 14:12:54 link-10 kernel: lock_dlm: Assertion failed on line 411 of file /usr/src/build/548215-i686/BUILD/smp/src/dlm/lock.c Apr 5 14:12:54 link-10 kernel: lock_dlm: assertion: "!error" Apr 5 14:12:54 link-10 kernel: lock_dlm: time = 180423 Apr 5 14:12:54 link-10 kernel: corey: num=11,55 err=-22 cur=0 req=5 lkf=4
I'm bordering on being stumped here. We know that the cman processes are all running fine because the node receives and processes a NODEDOWN message after the recovery. The reason is "no response to messages" which tells us that it's ACKs that are not being sent / received. But the ack processor is on the fast path of the cman comms process and not going to deadlock with anything. There are no sendmsg errors, so the ACKs are not failing to be sent, because of eg. memory shortage So, the only cause that I can see that makes sense (and if you have any other ideas pipe up) that some other process is spinning during the crucial 4 seconds that the node needs to be sending enough ACKs to convince a remote node that it's still working. 4 seconds is a long time, of course. Maybe co-incidentally, the node is responding again 4 seconds later on that last log, 8 on the previous. Can you increase the retries and see if that helps? echo "10" > /proc/cluster/config/cman/max_retries before running cman_tool join Looking for a spinning thread is going to be hard work (still assuming that's what it is). I doubt it's the DLM because we're into GFS recovery at this stage (in both logs) and the lock counts don't look at all high anyway. So it could be GFS recovery, or it could be the lpfc driver. I'm not pointing any fingers, just speculating. Almost needless to say, I've never seen this but I'm using iSCSI - my lpfc disks are still not working for some reason.
Reproduced it! I'm not almost certain it's a looping kernel thread. The logs & accompanying tcpdump show the killed node being totally silent for 5 seconds, despite there being plenty of work for it to do. Now to find the culprit...
s/not/now/ in previous comment (I'm always doing that, sorry). Courtesy of Dave, added some schedule()s to SM in places where it could spend a lot of time. I've now had revolver running all night (178 iterations) with no problems. Checking in sm.h; /cvs/cluster/cluster/cman-kernel/src/sm.h,v <-- sm.h new revision: 1.2.2.1; previous revision: 1.2 done Checking in sm_membership.c; /cvs/cluster/cluster/cman-kernel/src/sm_membership.c,v <-- sm_membership.c new revision: 1.1.2.3; previous revision: 1.1.2.2 done Checking in sm_recover.c; /cvs/cluster/cluster/cman-kernel/src/sm_recover.c,v <-- sm_recover.c new revision: 1.3.2.1; previous revision: 1.3 done
I hit this senario today with the latest build: CMAN 2.6.9-33.0 (built Apr 11 2005 15:15:54) installed [root@link-10 ~]# rpm -qa | grep cman cman-1.0-0.pre32.0 cman-kernheaders-2.6.9-33.0 cman-kernel-smp-2.6.9-33.0 I didn't see any SM messages however this time: Apr 12 11:07:35 link-10 kernel: CMAN: node link-10.lab.msp.redhat.com has been removed from the cluster : No response to messages Apr 12 11:07:35 link-10 kernel: CMAN: killed by NODEDOWN message Apr 12 11:07:35 link-10 kernel: CMAN: we are leaving the cluster. No response to messages Apr 12 11:07:35 link-10 kernel: dlm: gfs2: remote_stage error -105 12034c Apr 12 11:07:35 link-10 kernel: dlm: gfs0: remote_stage error -105 1e01d5 Apr 12 11:07:35 link-10 kernel: dlm: gfs3: remote_stage error -105 4d028f Apr 12 11:07:35 link-10 kernel: dlm: gfs3: remote_stage error -105 50021d Apr 12 11:07:36 link-10 kernel: rebuild locks Apr 12 11:07:36 link-10 kernel: clvmd rebuilt 0 locks Apr 12 11:07:36 link-10 kernel: clvmd recover event 32 done Apr 12 11:07:36 link-10 kernel: gfs2 add_to_requestq cmd 1 fr 4 Apr 12 11:07:36 link-10 kernel: gfs2 move flags 0,0,1 ids 28,32,32 Apr 12 11:07:36 link-10 kernel: gfs2 process held requests Apr 12 11:07:36 link-10 kernel: gfs2 process_requestq cmd 1 fr 4 Apr 12 11:07:36 link-10 kernel: gfs2 processed 1 requests Apr 12 11:07:36 link-10 kernel: gfs2 resend marked requests Apr 12 11:07:36 link-10 ccsd[3971]: Cluster manager shutdown. Attemping to reconnect... Apr 12 11:07:36 link-10 kernel: gfs2 resend 1103c2 lq 1 flg 200000 node -1/-1 " 7 Apr 12 11:07:38 link-10 kernel: gfs2 resent 1 requests Apr 12 11:07:38 link-10 kernel: gfs2 recover event 32 finished Apr 12 11:07:38 link-10 kernel: gfs0 move flags 0,0,1 ids 24,32,32 Apr 12 11:07:38 link-10 kernel: gfs0 process held requests Apr 12 11:07:38 link-10 kernel: gfs0 processed 0 requests Apr 12 11:07:38 link-10 kernel: gfs0 resend marked requests Apr 12 11:07:38 link-10 kernel: gfs0 resent 0 requests Apr 12 11:07:38 link-10 kernel: gfs0 recover event 32 finished Apr 12 11:07:38 link-10 kernel: clvmd move flags 0,0,1 ids 15,32,32 Apr 12 11:07:38 link-10 kernel: clvmd process held requests Apr 12 11:07:38 link-10 kernel: clvmd processed 0 requests Apr 12 11:07:38 link-10 kernel: clvmd resend marked requests Apr 12 11:07:38 link-10 kernel: clvmd resent 0 requests Apr 12 11:07:38 link-10 kernel: clvmd recover event 32 finished Apr 12 11:07:38 link-10 kernel: gfs1 move flags 0,0,1 ids 26,32,32 Apr 12 11:07:38 link-10 kernel: gfs1 process held requests Apr 12 11:07:38 link-10 kernel: gfs1 processed 0 requests [...] I'll try to gather more info and add to the bug.
This was a 4 node cluster (link-10, link-11, link-12, link-08), link-08 was shot by revolver and caused link-10 to bail. additional info in: /home/msp/cmarthal/pub/bugs/139738
OK, I've crossed the border into stumped territory now. This certainly looks like the same problem at least. Have you tried the "echo" I mentioned in comment #24 ? It's also helpful to get as much info from the killed node as possible (rather than just the others) as that's where the problem seems to be. How about: tar -czf /tmp/proc.tgz --exclude=*core* /proc (you can tell I'm desperate now, can't you?)
Interestingly I his something that looked a lot like this last night, testing for 155729. No GFS involved at all, but /lots/ of disk logging. I wonder if it's just disk syncing??
Set max_retries to default to 5. 5 seems to be enough time for the system to recover.
I'm not sure if this is the same issue as this bug, but i saw these messages with the latest build cluster-i686-2005-08-03-1549 last night while running recovery tests to a two node cluster (link-10, link-11). Revolver had gone though 207 iterations before seeing this on both nodes: link-11: Aug 3 21:53:32 link-11 kernel: CMAN: sendmsg failed: -101 Aug 3 21:53:32 link-11 kernel: CMAN: No functional network interfaces, leaving cluster Aug 3 21:53:32 link-11 kernel: CMAN: sendmsg failed: -101 Aug 3 21:53:32 link-11 kernel: CMAN: sendmsg failed: -101 Aug 3 21:53:32 link-11 kernel: CMAN: we are leaving the cluster. Aug 3 21:53:32 link-11 kernel: Synchronizing SCSI cache for disk sdd: Aug 3 21:53:32 link-11 clurgmgrd[2817]: <warning> #67: Shutting down uncleanly Aug 3 21:53:32 link-11 kernel: SM: 00000001 sm_stop: SG still joined Aug 3 21:53:32 link-11 kernel: SM: 01000002 sm_stop: SG still joined Aug 3 21:53:32 link-11 kernel: SM: 02000005 sm_stop: SG still joined Aug 3 21:53:32 link-11 kernel: SM: 03000003 sm_stop: SG still joined Aug 3 21:53:32 link-11 ccsd[2278]: Cluster manager shutdown. Attemping to reconnect... Aug 3 21:53:32 link-11 kernel: SM: send_nodeid_message error -107 to 1 Aug 3 21:53:33 link-11 kernel: sts Aug 3 21:53:34 link-11 kernel: clvmd recover event 10 finished Aug 3 21:53:34 link-11 kernel: gfs0 move flags 1,0,0 ids 8,8,8 Aug 3 21:53:34 link-11 kernel: gfs0 move flags 0,1,0 ids 8,11,8 Aug 3 21:53:34 link-11 kernel: gfs0 move use event 11 Aug 3 21:53:34 link-11 kernel: gfs0 recover event 11 Aug 3 21:53:34 link-11 kernel: gfs0 add node 1 Aug 3 21:53:34 link-11 kernel: gfs0 total nodes 2 Aug 3 21:53:34 link-11 kernel: gfs0 rebuild resource directory Aug 3 21:53:34 link-11 kernel: gfs0 rebuilt 1163 resources Aug 3 21:53:34 link-11 kernel: gfs0 purge requests Aug 3 21:53:34 link-11 kernel: gfs0 purged 0 requests [...] eventually asserts link-10: Aug 3 22:21:04 link-10 kernel: CMAN: sendmsg failed: -101 Aug 3 22:21:04 link-10 kernel: CMAN: No functional network interfaces, leaving cluster Aug 3 22:21:04 link-10 kernel: CMAN: sendmsg failed: -101 Aug 3 22:21:04 link-10 kernel: CMAN: sendmsg failed: -101 Aug 3 22:21:04 link-10 kernel: CMAN: we are leaving the cluster. Aug 3 22:21:04 link-10 clurgmgrd[2854]: <warning> #67: Shutting down uncleanly Aug 3 22:21:04 link-10 kernel: SM: 00000001 sm_stop: SG still joined Aug 3 22:21:04 link-10 kernel: SM: 01000002 sm_stop: SG still joined Aug 3 22:21:04 link-10 kernel: SM: 02000005 sm_stop: SG still joined Aug 3 22:21:04 link-10 kernel: SM: 03000003 sm_stop: SG still joined Aug 3 22:21:04 link-10 ccsd[2279]: Cluster manager shutdown. Attemping to reconnect... Aug 3 22:21:04 link-10 kernel: SM: send_nodeid_message error -107 to 2 Aug 3 22:21:05 link-10 kernel: ests Aug 3 22:21:05 link-10 kernel: clvmd recover event 10 finished Aug 3 22:21:05 link-10 kernel: gfs0 move flags 1,0,0 ids 8,8,8 Aug 3 22:21:05 link-10 kernel: gfs0 move flags 0,1,0 ids 8,11,8 Aug 3 22:21:05 link-10 kernel: gfs0 move use event 11 Aug 3 22:21:06 link-10 kernel: gfs0 recover event 11 Aug 3 22:21:06 link-10 kernel: gfs0 add node 2 Aug 3 22:21:06 link-10 kernel: gfs0 total nodes 2 Aug 3 22:21:06 link-10 kernel: gfs0 rebuild resource directory Aug 3 22:21:06 link-10 kernel: gfs0 rebuilt 1139 resources Aug 3 22:21:06 link-10 kernel: gfs0 purge requests Aug 3 22:21:06 link-10 kernel: gfs0 purged 0 requests Aug 3 22:21:06 link-10 kernel: gfs0 mark waiting requests Aug 3 22:21:06 link-10 kernel: gfs0 marked 0 requests [...] also eventually asserts
No, it's not even slightly the same. The message "no functional network interfaces" is the clue. It means that all attempts to send HELLO messages failed (eg eth0 was downed) so cman has to quit. It's a new message from bz #163687. I suggest you look at what else was happening when this occurred. Was the machine being rebooted without the cluster services being shut down first perhaps.
filed bz165160 for comments #32 and #33.
Have not seen this bug in over 10 months, marking verified. Will reopen if seen again.