Description of problem: I saw machines assert quite a few times last night after seeing these "network interfaces" messages with the latest build (cluster-i686-2005-08-03-1549) while running recovery tests to an x86 two node cluster (link-10, link-11). Revolver had gone though 207 iterations of killing (reboot -fin) one node at a time, which caused these to happen: link-11: Aug 3 21:53:32 link-11 kernel: CMAN: sendmsg failed: -101 Aug 3 21:53:32 link-11 kernel: CMAN: No functional network interfaces, leaving cluster Aug 3 21:53:32 link-11 kernel: CMAN: sendmsg failed: -101 Aug 3 21:53:32 link-11 kernel: CMAN: sendmsg failed: -101 Aug 3 21:53:32 link-11 kernel: CMAN: we are leaving the cluster. Aug 3 21:53:32 link-11 kernel: Synchronizing SCSI cache for disk sdd: Aug 3 21:53:32 link-11 clurgmgrd[2817]: <warning> #67: Shutting down uncleanly Aug 3 21:53:32 link-11 kernel: SM: 00000001 sm_stop: SG still joined Aug 3 21:53:32 link-11 kernel: SM: 01000002 sm_stop: SG still joined Aug 3 21:53:32 link-11 kernel: SM: 02000005 sm_stop: SG still joined Aug 3 21:53:32 link-11 kernel: SM: 03000003 sm_stop: SG still joined Aug 3 21:53:32 link-11 ccsd[2278]: Cluster manager shutdown. Attemping to reconnect... Aug 3 21:53:32 link-11 kernel: SM: send_nodeid_message error -107 to 1 Aug 3 21:53:33 link-11 kernel: sts Aug 3 21:53:34 link-11 kernel: clvmd recover event 10 finished Aug 3 21:53:34 link-11 kernel: gfs0 move flags 1,0,0 ids 8,8,8 Aug 3 21:53:34 link-11 kernel: gfs0 move flags 0,1,0 ids 8,11,8 Aug 3 21:53:34 link-11 kernel: gfs0 move use event 11 Aug 3 21:53:34 link-11 kernel: gfs0 recover event 11 Aug 3 21:53:34 link-11 kernel: gfs0 add node 1 Aug 3 21:53:34 link-11 kernel: gfs0 total nodes 2 Aug 3 21:53:34 link-11 kernel: gfs0 rebuild resource directory Aug 3 21:53:34 link-11 kernel: gfs0 rebuilt 1163 resources Aug 3 21:53:34 link-11 kernel: gfs0 purge requests Aug 3 21:53:34 link-11 kernel: gfs0 purged 0 requests [...] eventually asserts link-10: Aug 3 22:21:04 link-10 kernel: CMAN: sendmsg failed: -101 Aug 3 22:21:04 link-10 kernel: CMAN: No functional network interfaces, leaving cluster Aug 3 22:21:04 link-10 kernel: CMAN: sendmsg failed: -101 Aug 3 22:21:04 link-10 kernel: CMAN: sendmsg failed: -101 Aug 3 22:21:04 link-10 kernel: CMAN: we are leaving the cluster. Aug 3 22:21:04 link-10 clurgmgrd[2854]: <warning> #67: Shutting down uncleanly Aug 3 22:21:04 link-10 kernel: SM: 00000001 sm_stop: SG still joined Aug 3 22:21:04 link-10 kernel: SM: 01000002 sm_stop: SG still joined Aug 3 22:21:04 link-10 kernel: SM: 02000005 sm_stop: SG still joined Aug 3 22:21:04 link-10 kernel: SM: 03000003 sm_stop: SG still joined Aug 3 22:21:04 link-10 ccsd[2279]: Cluster manager shutdown. Attemping to reconnect... Aug 3 22:21:04 link-10 kernel: SM: send_nodeid_message error -107 to 2 Aug 3 22:21:05 link-10 kernel: ests Aug 3 22:21:05 link-10 kernel: clvmd recover event 10 finished Aug 3 22:21:05 link-10 kernel: gfs0 move flags 1,0,0 ids 8,8,8 Aug 3 22:21:05 link-10 kernel: gfs0 move flags 0,1,0 ids 8,11,8 Aug 3 22:21:05 link-10 kernel: gfs0 move use event 11 Aug 3 22:21:06 link-10 kernel: gfs0 recover event 11 Aug 3 22:21:06 link-10 kernel: gfs0 add node 2 Aug 3 22:21:06 link-10 kernel: gfs0 total nodes 2 Aug 3 22:21:06 link-10 kernel: gfs0 rebuild resource directory Aug 3 22:21:06 link-10 kernel: gfs0 rebuilt 1139 resources Aug 3 22:21:06 link-10 kernel: gfs0 purge requests Aug 3 22:21:06 link-10 kernel: gfs0 purged 0 requests Aug 3 22:21:06 link-10 kernel: gfs0 mark waiting requests Aug 3 22:21:06 link-10 kernel: gfs0 marked 0 requests [...] also eventually asserts Version-Release number of selected component (if applicable): CMAN 2.6.9-38.2 (built Aug 3 2005 16:22:35) [root@link-10 ~]# cman_tool -V cman_tool 1.0.2 (built Aug 1 2005 14:45:22) Copyright (C) Red Hat, Inc. 2004 All rights reserved. Actual results: assertion Expected results: withdraw from the cluster
err, can you post the assert please ??
Aug 3 22:02:31 link-10 kernel: 3638 req 7,3b ex 0-651e77 lkf 2000 wait 1 Aug 3 22:02:31 link-10 kernel: 3638 ex plock 0 Aug 3 22:02:31 link-10 kernel: 3638 en punlock 7,3b Aug 3 22:02:31 link-10 kernel: 3638 remove 7,3b Aug 3 22:02:31 link-10 kernel: 3638 ex punlock 0 Aug 3 22:02:31 link-10 kernel: 3638 en plock 7,3b Aug 3 22:02:31 link-10 kernel: 3638 req 7,3b ex 651e78-6901fb lkf 2000 wait 1 Aug 3 22:02:31 link-10 kernel: 3638 ex plock 0 Aug 3 22:02:31 link-10 kernel: 3638 en punlock 7,3b Aug 3 22:02:31 link-10 kernel: Aug 3 22:02:31 link-10 kernel: lock_dlm: Assertion failed on line 428 of file /usr/src/build/598004-i686/BUILD/gfs-kernel-2.6.9-36/ smp/src/dlm/lock.c Aug 3 22:02:31 link-10 kernel: lock_dlm: assertion: "!error" Aug 3 22:02:31 link-10 kernel: lock_dlm: time = 4294825539 Aug 3 22:02:31 link-10 kernel: gfs0: num=11,3b err=-22 cur=0 req=5 lkf=4 Aug 3 22:02:31 link-10 kernel: Aug 3 22:02:31 link-10 kernel: ------------[ cut here ]------------ Aug 3 22:02:31 link-10 kernel: kernel BUG at /usr/src/build/598004-i686/BUILD/gfs-kernel-2.6.9-36/smp/src/dlm/lock.c:428! Aug 3 22:02:31 link-10 kernel: invalid operand: 0000 [#1] Aug 3 22:02:31 link-10 kernel: SMP Aug 3 22:02:31 link-10 kernel: Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dm_mirror dlm(U) cman(U) md5 ipv6 sunrpc button battery ac uhci_hcd ehci_hcd e1000 floppy ext3 jbd dm_mod qla2300 qla2xxx scsi_transp ort_fc sd_mod scsi_mod Aug 3 22:02:31 link-10 kernel: CPU: 0 Aug 3 22:02:31 link-10 kernel: EIP: 0060:[<e0374779>] Not tainted VLI Aug 3 22:02:31 link-10 kernel: EFLAGS: 00010246 (2.6.9-12.ELsmp) Aug 3 22:02:31 link-10 kernel: EIP is at do_dlm_lock+0x134/0x14e [lock_dlm] Aug 3 22:02:31 link-10 kernel: eax: 00000001 ebx: ffffffea ecx: dd328e4c edx: e0379175 Aug 3 22:02:31 link-10 kernel: esi: e0374798 edi: df700e00 ebp: d9886800 esp: dd328e48 Aug 3 22:02:31 link-10 kernel: ds: 007b es: 007b ss: 0068 Aug 3 22:02:31 link-10 kernel: Process doio (pid: 3638, threadinfo=dd328000 task=d79188b0) Aug 3 22:02:31 link-10 kernel: Stack: e0379175 20202020 31312020 20202020 20202020 20202020 62332020 00000018 Aug 3 22:02:31 link-10 kernel: 00000000 dead4ead d9886800 d9886840 d988683c e03747d0 dd328e98 d44aa580 Aug 3 22:02:31 link-10 kernel: dd328f08 dcc1112c e037626d d9886800 0000003b 00000000 00000011 d44aa580 Aug 3 22:02:31 link-10 kernel: Call Trace: Aug 3 22:02:31 link-10 kernel: [<e03747d0>] do_dlm_lock_sync+0x33/0x42 [lock_dlm] Aug 3 22:02:31 link-10 kernel: [<e037626d>] lock_resource+0x70/0x93 [lock_dlm] Aug 3 22:02:31 link-10 kernel: [<e0377b38>] lm_dlm_punlock+0xcc/0x14f [lock_dlm] Aug 3 22:02:31 link-10 kernel: [<e03ba684>] gfs_lm_punlock+0x30/0x3f [gfs] Aug 3 22:02:31 link-10 kernel: [<e03c59f0>] gfs_lock+0xd4/0xf2 [gfs] Aug 3 22:02:31 link-10 kernel: [<e03c591c>] gfs_lock+0x0/0xf2 [gfs] Aug 3 22:02:31 link-10 kernel: [<c016cb66>] fcntl_setlk+0x156/0x291 Aug 3 22:02:31 link-10 kernel: [<c0168fdf>] do_fcntl+0x10c/0x155 Aug 3 22:02:31 link-10 kernel: [<c01690ee>] sys_fcntl64+0x6c/0x7d Aug 3 22:02:31 link-10 kernel: [<c02d0ecf>] syscall_call+0x7/0xb Aug 3 22:02:31 link-10 kernel: [<c02d007b>] __lock_text_end+0x326/0x101b Aug 3 22:02:31 link-10 kernel: Code: 26 50 0f bf 45 24 50 53 ff 75 08 ff 75 04 ff 75 0c ff 77 18 68 a0 92 37 e0 e8 9e db da df 83 c4 38 68 75 91 37 e0 e8 91 db da df <0f> 0b ac 01 b2 90 37 e0 68 77 91 37 e0 e8 4c d3 da df 83 c4 20 Aug 3 22:02:31 link-10 kernel: <0>Fatal exception: panic in 5 seconds
Oh, this is the "ape shit" bug we had ages ago that got marked "resolved" even though it never was. When cman shuts down peremptorily the subsystems can't cope. it needs reassiging to whatever thing lock_dlm is part of.
Looks like lock_dlm needs to do something a little less drastic when it gets EINVAL back from the DLM
There have been so many bz entries dealing with this one problem it's hard to know which to pick. I tried once to consolidate them into bz 148788, but that didn't help very much. For the record again, when cman shuts down, it also shuts down the dlm. This causes gfs's lock requests to return an error. GFS/lock_dlm have little to do when lock requests fail except to panic. Cman has shut down, of course, because the network has failed. So, ultimately, network failure -> gfs assertion failure/panic. The one alternative to panicking is to just sit and wait to be fenced.
I hit this again twice over the weekend. After reading your last comment, is this just going to be the expected behavior when the network interface "fails"? We just assert or panic? And if so, shouldn't this bug be closed NOTABUG (expected behavior) or WONTFIX? What got checked in to make it marked MODIFIED? And if it is not supposed to behave like this then it should go to FAILSQA or back to ASSIGNED. Is the network actually failing or is it just because one node can't talk to the other nodes? If it is really failing then why is this only seen during recovery tests? CMAN: sendmsg failed: -101 CMAN: No functional network interfaces, leaving cluster CMAN: sendmsg failed: -101 CMAN: sendmsg failed: -101 1 add node 2 gfs1 total nodes 3 gfs1 rebuild resource SM: 00000001 sm_stop: SG still joined SM: 01000003 sm_stop: SG still joined SM: 02000005 sm_stop: SG still joined SM: 0300000a sm_stop: SG still joined directory gfs1 rebuilt 365 resources gfs1 purge requests gfs1 purged 0 requests gfs2 recover event 11 done gfs1 mark waiting requests gfs1 marked 0 requests gfs1 recover event 12 done gfs2 move flags 0,0,1 ids 0,11,11 gfs2 process held requests gfs2 processed 0 requests gfs2 recover event 11 finished gfs1 move flags 0,0,1 ids 7,12,12 gfs1 process held requests gfs1 processed 0 requests gfs1 resend marked requests gfs1 resent 0 requests gfs1 recover event 12 finished gfs2 move flags 1,0,0 ids 11,11,11 gfs2 move flags 0,1,0 ids 11,15,11 gfs2 move use event 15 gfs2 recover event 15 gfs2 add node 2 gfs2 total nodes 3 gfs2 rebuild resource directory gfs2 rebuilt 388 resources gfs2 purge requests gfs2 purged 0 requests gfs2 mark waiting requests gfs2 marked 0 requests gfs2 recover event 15 done gfs2 move flags 0,0,1 ids 11,15,15 gfs2 process held requests gfs2 processed 0 requests gfs2 resend marked requests gfs2 resent 0 requests gfs2 recover event 15 finished 71 add node 2 gfs1 total nodes 3 gfs1 rebuild resource directory gfs1 rebuilt 365 resources gfs1 purge requests gfs1 purged 0 requests gfs2 recover event 11 done gfs1 mark waiting requests gfs1 marked 0 requests gfs1 recover event 12 done gfs2 move flags 0,0,1 ids 0,11,11 gfs2 process held requests [...] 3768 req 7,3c ex 418c1a-55d28b lkf 2000 wait 1 3731 req 7,fffc ex 74affa-752481 lkf 2000 wait 1 3768 ex plock 0 3731 ex plock 0 3731 en punlock 7,fffc 3731 remove 7,fffc 3768 en punlock 7,3c 3731 ex punlock 0 3731 en plock 7,fffc lock_dlm<4:> 7 68As sreeqr t7io,n3 cf eaxi le0d- 55o8n 9laib nlek 4f2 820 0o0f wfialite 1/ r/3s7r68c /ebuxi lpdlo/c6k03 106 us 0-3x7836_16 4en/ BUpIlLoDck/ gf7,s-fkffecr nel3-7231. 6r.9eq- 379,/sfmffp/c serxc/ dbl34m0/clo-c5k24.c59 llkocfk _20d0l0m: w aiats s1er e i3o7n3: 1 "ex! eprrloorc"k t 0 lo3ck7_3d1 lemn: ptunilmoe c=k 47,2f9f49f8c7 4367 2 1 gfrsem2o:v neu m7=,f1f1f,3cc e3r7r=3-12 e2x cpuru=n0lo ckr eq0= 5 3l7kf3=1 4 p en lock 7,fffc 3731 req 7,fffc ex 52459e-58333e lkf 2000 wait 1 3731 ex plock 0 3731 en punlock 7,fffc 3731 remove 7,fffc 3731 ex punlock 0 3731 en plock 7,fffc 3731 req 7,fffc ex 73fb89-743b9e lkf 2000 wait 1 3731 ex plock 0 3768 en punlock 7,3c 3768 remove 7,3c 3731 en punlock 7,fffc 3731 remove 7,fffc 3768 ex punlock 0 3731 ex punlock 0 3731 en plock 7,fffc 3731 req 7,fffc ex 752978-752cec lkf 2000 wait 1 3731 ex plock 0 3731 en punlock 7,fffc 3731 remove 7,fffc 3731 ex punlock 0 3731 en plock 7,fffc 3731 req 7,fffc ex 752cec-752f52 lkf 2000 wait 1 3731 ex plock 0 3731 en punlock 7,fffc 3731 remove 7,fffc 3731 ex punlock 0 3768 en plock 7,3c 3768 req 7,3c ex 74f39b-750714 lkf 2000 wait 1 [...] 3731 ex punlock 0 3731 en plock 7,fffc 3768 req 7,3c ex 418c1a-55d28b lkf 2000 wait 1 3731 req 7,fffc ex 74affa-752481 lkf 2000 wait 1 3768 ex plock 0 3731 ex plock 0 3731 en punlock 7,fffc 3731 remove 7,fffc 3768 en punlock 7,3c 3731 ex punlock 0 3731 en plock 7,fffc lock_dlm: Assertion failed on line 428 of file /usr/src/build/603160-x86_64/BUILD/gfs-kernel-2.6.9-39/smp/src/dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 4294987247 gfs1: num=11,fffc err=-22 cur=0 req=5 lkf=4 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lock:428 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod Pid: 3731, comm: doio Not tainted 2.6.9-16.ELsmp RIP: 0010:[<ffffffffa02389e7>] <ffffffffa02389e7>{:lock_dlm:do_dlm_lock+365} RSP: 0018:0000010033b47c78 EFLAGS: 00010212 RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000020000 RDX: 0000000000006d42 RSI: 0000000000000246 RDI: ffffffff803d7be0 RBP: 0000010033f3c780 R08: 00000000fffffffb R09: 00000000ffffffea R10: 0000000000000000 R11: 0000000000000000 R12: 00000100022cce00 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000752481 FS: 0000002a95576b00(0000) GS:ffffffff804d3300(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003a1b929000 CR3: 0000000000101000 CR4: 00000000000006e0 Process doio (pid: 3731, threadinfo 0000010033b46000, task 0000010033a62030) Stack: 0000000000000005 0000000000000004 3131202020202020 2020202020202020 6366666620202020 0000000000000018 0000010033f3c780 0000010030638880 0000010033f3c7d8 00000100022cce00 Call Trace:<ffffffffa0238a44>{:lock_dlm:do_dlm_lock_sync+54} <ffffffffa023a660>{:lock_dlm:lock_resource+129} <ffffffffa023b909>{:lock_dlm:lm_dlm_plock+588} <ffffffff8013199d>{try_to_wake_up+734} <ffffffff80134797>{autoremove_wake_function+9} <ffffffff80132f4a>{__wake_up_common+67} <ffffffffa01ff9bb>{:gfs:gfs_lm_plock+46} <ffffffffa020a6a0>{:gfs:gfs_lock+191} <ffffffff8018b66a>{fcntl_setlk+296} <ffffffff80302ea4>{thread_return+42} <ffffffff80187a34>{sys_fcntl+733} <ffffffff80110052>{system_call+126} Code: 0f 0b 52 d1 23 a0 ff ff ff ff ac 01 48 c7 c7 57 d1 23 a0 31 RIP <ffffffffa02389e7>{:lock_dlm:do_dlm_lock+365} RSP <0000010033b47c78> -<0-->-Ke-r-n--e-l -p-a [ncicu t -h neoret s]y -n-c-in-g--: -O-o- p[s lea se bite here ] --------- p Kernel BUG at lock:428 invalid operand: 0000 [2] SMP CPU 1 Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod Pid: 3768, comm: doio Not tainted 2.6.9-16.ELsmp RIP: 0010:[<ffffffffa02389e7>] <ffffffffa02389e7>{:lock_dlm:do_dlm_lock+365} RSP: 0018:00000100323d1d48 EFLAGS: 00010212 RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000000246 RDX: 0000000000006db6 RSI: 0000000000000246 RDI: ffffffff803d7be0 RBP: 0000010031493d80 R08: 000000000000000d R09: 00000000ffffffea R10: 0000000000000000 R11: 0000000000000000 R12: 000001003ff52200 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000418c1a FS: 0000002a95576b00(0000) GS:ffffffff804d3380(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000006c1a10 CR3: 000000003ff58000 CR4: 00000000000006e0 Process doio (pid: 3768, threadinfo 00000100323d0000, task 0000010032361030) Stack: 0000000000000005 0000000000000004 3131202020202020 2020202020202020 6333202020202020 0000000000550018 0000010031493d80 00000100301891c0 0000010031493dd8 0000010031a06060 Call Trace:<ffffffffa0238a44>{:lock_dlm:do_dlm_lock_sync+54} <ffffffffa023a660>{:lock_dlm:lock_resource+129} <ffffffffa023b213>{:lock_dlm:lm_dlm_punlock+207} <ffffffffa01ff9f6>{:gfs:gfs_lm_punlock+46} <ffffffffa020a68e>{:gfs:gfs_lock+173} <ffffffff8018b66a>{fcntl_setlk+296} <ffffffff80302ea4>{thread_return+42} <ffffffff80187a34>{sys_fcntl+733} <ffffffff80110052>{system_call+126} Code: 0f 0b 52 d1 23 a0 ff ff ff ff ac 01 48 c7 c7 57 d1 23 a0 31 RIP <ffffffffa02389e7>{:lock_dlm:do_dlm_lock+365} RSP <00000100323d1d48> Badness in do_unblank_screen at drivers/char/vt.c:2876 Call Trace:<ffffffff80231642>{do_unblank_screen+61} <ffffffff80122e60>{bust_spinlocks+28} <ffffffff80111834>{oops_end+18} <ffffffff80111961>{die+54} <ffffffff80111d24>{do_invalid_op+145} <ffffffffa02389e7>{:lock_dlm:do_dlm_lock+365} <ffffffff8013725a>{release_console_sem+369} <ffffffff80137488>{vprintk+498} <ffffffff80137532>{printk+141} <ffffffff80110aed>{error_exit+0} <ffffffffa02389e7>{:lock_dlm:do_dlm_lock+365} <ffffffffa02389e7>{:lock_dlm:do_dlm_lock+365} <ffffffffa0238a44>{:lock_dlm:do_dlm_lock_sync+54} <ffffffffa023a660>{:lock_dlm:lock_resource+129} <ffffffffa023b213>{:lock_dlm:lm_dlm_punlock+207} <ffffffffa01ff9f6>{:gfs:gfs_lm_punlock+46} <ffffffffa020a68e>{:gfs:gfs_lock+173} <ffffffff8018b66a>{fcntl_setlk+296} <ffffffff80302ea4>{thread_return+42} <ffffffff80187a34>{sys_fcntl+733} <ffffffff80110052>{system_call+126}
Marking this FAILSQA inorder to get it back on the radar
and reassigning to Patrick...
I've no idea what I'm expected to do with this. If I leave cman to not shutdown when the network goes away* then I get a bug saying the cman doesn't shutdown and loops with an error message. If I make it shutdown then I get a bugzilla saying that someone else's code is panicking! Think about this sensibly please. If the cluster manager cannot continue because it's communication paths are severed - what should happen ? cman itself shuts down...it notifies (as best it can) the subsystems that rely on it - it can do no more. What those subsystems do then is up to them. This problem is not going to go away by being batted around between different people who deny it's their problem, and it's certainly not going to go away with the new code for RHEL5. Cluster managers shut down - code that depends on them need to be able to deal with that. Now, if we need a channel from the cluster manager that says explicitly "I am shutting down NOW, whether you like it or not" then I can provide that - but the code that hooks into it needs to know what it will do under those circumstances. Whatever that is I imagine it won't be pretty but perhaps it needn't be as ugly as a kernel panic - even if it is functionally the same thing ;-) * By that I mean the interface has been downed by ifconfig or similar, or the whole IP stack has been shut down. Though this "bug" also happens when cman is killed for other more clustery reasons such as being isolated from a quorate cluster for too long.
I just dont't understand why the network interface is going "down" on these machines once they come back up from being rebooted? Cluster = smith-01,smith-02,smith-03 Smith-01 and smith-02 are rebooted be revolver, taken out of the cluster, come back up, rejoin the cluster, start the proper services, and mount the filesystems. Then all of a sudden (in one in 20 to 50 revolver iterations) you see this "CMAN: sendmsg failed: -101" (which isn't that supposed to be fixed for bz 163587) followed by the "No functional network interfaces" mesasge (on smith-01). The network really dies? It was fine just a second ago and is just fine on the other nodes in the cluster (smith-02, smith-03). :) This causes the next node (who just came back up) to notice the dead node and then die as well: smith-02: CMAN: removing node smith-01 from the cluster : Missed too many heartbeats GFS: fsid=smith-cluster:gfs1.0: jid=1: Trying to acquire journal lock... GFS: fsid=smith-cluster:gfs0.0: jid=1: Trying to acquire journal lock... GFS: fsid=smith-cluster:gfs1.0: jid=1: Looking at journal... GFS: fsid=smith-cluster:gfs0.0: jid=1: Looking at journal... GFS: fsid=smith-cluster:gfs1.0: jid=1: Acquiring the transaction lock... GFS: fsid=smith-cluster:gfs0.0: jid=1: Acquiring the transaction lock... GFS: fsid=smith-cluster:gfs1.0: jid=1: Replaying journal... GFS: fsid=smith-cluster:gfs0.0: jid=1: Replaying journal... GFS: fsid=smith-cluster:gfs0.0: jid=1: Replayed 26 of 243 blocks GFS: fsid=smith-cluster:gfs0.0: jid=1: replays = 26, skips = 77, sames = 140 GFS: fsid=smith-cluster:gfs0.0: jid=1: Journal replayed in 1s GFS: fsid=smith-cluster:gfs0.0: jid=1: Done GFS: fsid=smith-cluster:gfs1.0: jid=1: Replayed 2317 of 2317 blocks GFS: fsid=smith-cluster:gfs1.0: jid=1: replays = 2317, skips = 0, sames = 0 GFS: fsid=smith-cluster:gfs1.0: jid=1: Journal replayed in 1s GFS: fsid=smith-cluster:gfs1.0: jid=1: Done CMAN: removing node smith-01 from the cluster : Missed too many heartbeats CMAN: removing node smith-03 from the cluster : No response to messages CMAN: quorum lost, blocking activity CMAN: quorum regained, resuming activity GFS: fsid=smith-cluster:gfs1.0: jid=1: Trying to acquire journal lock... GFS: fsid=smith-cluster:gfs0.0: jid=1: Trying to acquire journal lock... GFS: fsid=smith-cluster:gfs1.0: jid=1: Looking at journal... GFS: fsid=smith-cluster:gfs0.0: jid=1: Looking at journal... GFS: fsid=smith-cluster:gfs1.0: jid=1: Acquiring the transaction lock... GFS: fsid=smith-cluster:gfs0.0: jid=1: Acquiring the transaction lock... GFS: fsid=smith-cluster:gfs0.0: jid=1: Replaying journal... GFS: fsid=smith-cluster:gfs1.0: jid=1: Replaying journal... GFS: fsid=smith-cluster:gfs1.0: jid=1: Replayed 1557 of 1712 blocks GFS: fsid=smith-cluster:gfs1.0: jid=1: replays = 1557, skips = 141, sames = 14 GFS: fsid=smith-cluster:gfs1.0: jid=1: Journal replayed in 1s GFS: fsid=smith-cluster:gfs1.0: jid=1: Done GFS: fsid=smith-cluster:gfs1.0: jid=2: Trying to acquire journal lock... GFS: fsid=smith-cluster:gfs1.0: jid=2: Looking at journal... GFS: fsid=smith-cluster:gfs1.0: jid=2: Acquiring the transaction lock... GFS: fsid=smith-cluster:gfs1.0: jid=2: Replaying journal... GFS: fsid=smith-cluster:gfs0.0: jid=1: Replayed 1177 of 2572 blocks GFS: fsid=smith-cluster:gfs0.0: jid=1: replays = 1177, skips = 910, sames = 485 GFS: fsid=smith-cluster:gfs0.0: jid=1: Journal replayed in 2s GFS: fsid=smith-cluster:gfs0.0: jid=1: Done GFS: fsid=smith-cluster:gfs0.0: jid=2: Trying to acquire journal lock... GFS: fsid=smith-cluster:gfs0.0: jid=2: Looking at journal... GFS: fsid=smith-cluster:gfs1.0: jid=2: Replayed 555 of 579 blocks GFS: fsid=smith-cluster:gfs1.0: jid=2: replays = 555, skips = 1, sames = 23 GFS: fsid=smith-cluster:gfs0.0: jid=2: Acquiring the transaction lock... GFS: fsid=smith-cluster:gfs0.0: jid=2: Replaying journal... GFS: fsid=smith-cluster:gfs1.0: jid=2: Journal replayed in 1s GFS: fsid=smith-cluster:gfs1.0: jid=2: Done GFS: fsid=smith-cluster:gfs0.0: jid=2: Replayed 111 of 111 blocks GFS: fsid=smith-cluster:gfs0.0: jid=2: replays = 111, skips = 0, sames = 0 GFS: fsid=smith-cluster:gfs0.0: jid=2: Journal replayed in 1s GFS: fsid=smith-cluster:gfs0.0: jid=2: Done CMAN: sendmsg failed: -101 CMAN: No functional network interfaces, leaving cluster CMAN: sendmsg failed: -101 CMAN: sendmsg failed: -101 The node which has been up this whole time again notices that they are dead, again: smith-03: CMAN: removing node smith-01 from the cluster : Missed too many heartbeats CMAN: removing node smith-02 from the cluster : No response to messages CMAN: quorum lost, blocking activity Could there be a bigger linux issue here with the networking?
Hang on which bug is this? If it's "CMAN: sendmsg -101" then it's a duplicate of 163587 (and see my latest comments). I though it was about the lock_dlm assertion. I know they get seen at the same time (because one is often "caused" by the other) but we need to try and keep them separate or we're all going to get confused.
/me doesn't know which bug this is anymore. :) All I know is that I can no longer run extended recovery stress testing on cman/dlm clusters with out hitting this network issue/message, which then causes the notorious cman assertion/panic. I wish I knew why cman thinks the network is dead? Nothing I did caused the network to die. The nodes come up after being rebooted, and the init scripts join the cluster and mount gfs just fine. So why after that are there network problems for cman, which cause panics?
The thing is, there's nothing I've done to cman that would change the frequency of occurence of this message. All I've done is added a colon to it and made it exit (it doesn't panic, lock_dlm panics!) rather than looping endlessly. Now, I suppose its possible that you were getting them before but then the network came back up again so it went unnoticed. I've no idea how that would happen though. I suppose the thing to do is one or more of the following: - Check the interfaces & routes as soon as possible afer cman quits. - tcpdump check for icmp messages that might tell the kernel that the broadcast route is down (I have no idea if these even exist, let alone where they might come from) - Remove the "quit_threads=1" from the code and see if the messages still appear (I can't see how they would not) and whether the system recovers. If the latter happens then I suppose we could make it dependant on 2 or 3 fatal network errors before quitting, but in that case the node is still probably going to be fenced anyway on three heartbeat failures.
Oh, one other thing...does it make a difference if you're using broadcast or multicast ? (and I still think this is the wrong bug report to discuss this problem!).
Right, I saw the error this morning using revolver. But it was on a node that was being shut down, so this may not be relevant. Now my test nodes never shut down properly anyway for a number of reasons unrelated to this but can you check that the node that dies after the -101 message isn't one that revolver has just tried to kill ?
See comment #10 and comment #12. The nodes which see the -101 messages and then end up panicing _are indeed_ the node(s) which revolver just shot. HOWEVER, after being shot by revolver and *before* the -101 messages and panic, these nodes come back up, just like they are supposed to, and join the cluster, and mount the filesystems, and everything appears just fine for a moment until that message happens and the machine(s) then panic and goes down *again* for good. This only happens when revolver shoots more than one node and quorum is lost and since I don't run with multicast, this is only seen with broadcast. Again from the time that these nodes are shot by revolver to the time they see that -101 message and panic is about 4 or 5 minutes, during which time the nodes come back up and join the cluster.
The reasons for -101 (ENETUNREACH) are often to do with routing. is there any chance that something is mucking about with the routing tables at all? I can stick a MSG_DONTROUTE flag on the messages when we get failure, just to try to force the messages out in that instance - if you could test it.
setting to NEEDINFO as no-one has yet answered any of the questions I've asked in this bz.
OK, I did a bit more work on this today. The only way I can reproduce the -101 is to actually take the interface down. However, it only takes one instance of this to bring CMAN down which is obviously too intolerant. This checkin should give it "max_retries" attempts before giving up. Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v <-- cnxman.c new revision: 1.42.2.17; previous revision: 1.42.2.16 done
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0166.html