Description of problem: Trying to get my cluster running I ran into the following BUG with: kernel BUG at /usr/src/build/496234-i686/BUILD/cman-kernel-2.6.9-3/src/membership.c:244! invalid operand: 0000 [#1] Modules linked in: cman(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc dm_mod button battery ac uhci_hcd hw_random e1000 floppy ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<f8b0c030>] Not tainted VLI EFLAGS: 00010246 (2.6.9-1.906_EL) EIP is at do_process_endtrans+0x240/0x407 [cman] eax: 00000014 ebx: f8b20a20 ecx: f8b14ec5 edx: f5f56f38 esi: f8b14ec5 edi: f5e5a160 ebp: f8b14eb0 esp: f5f56f3c ds: 007b es: 007b ss: 0068 Process cman_memb (pid: 9979, threadinfo=f5f56000 task=f5f9e230) Stack: f5f3f700 00000007 f8b20a20 f8b20a20 00000014 f5f56f84 f5f56f7c f8b0a9f1 ffffffff f5f56f7c 00000000 f5f56f84 f699b500 f8b0d3ce f8b20a34 000005c8 01000000 00000001 f5f56f7c 00000008 f5f56f74 00000001 00000000 00000000 Call Trace: [<f8b0a9f1>] do_membership_packet+0x141/0x1c0 [cman] [<f8b0d3ce>] dispatch_messages+0x85/0xb2 [cman] [<f8b096d8>] membership_kthread+0x3f8/0x577 [cman] [<c011b9fa>] default_wake_function+0x0/0xc [<f8b092e0>] membership_kthread+0x0/0x577 [cman] [<c01041d9>] kernel_thread_helper+0x5/0xb Code: b2 f8 5a 8b 54 24 04 8b 04 90 ff 70 08 68 b0 4e b1 f8 e8 29 40 61 c7 5d 58 8b 0c 24 ff 71 08 68 c5 4e b1 f8 e8 17 40 61 c7 5e 5f <0f> 0b f4 00 5b 4d b1 f8 81 3d 40 16 b2 f8 3c 4b 24 1d 74 0f 68 Version-Release number of selected component (if applicable): [root@trin-01 ~]# rpm -qa | grep -E "(cman|ccs|906)" kernel-2.6.9-1.906_EL cman-kernel-2.6.9-3.3 ccs-0.9-0 cman-1.0-0.pre5.0 How reproducible: I've only seen it once. Mike Tilstra has also reported seeing the BUG Steps to Reproduce: All the nodes where initially not running ccsd. I don't think that I had the cman modules loaded on the nodes initially either, but am not as certain about that. 1. broadcast root@trin-0{1,2,3,4,6,7,8,9} -- killall ccsd 2. broadcast root@trin-0{1,2,3,4,6,7,8,9} -- ccsd 3. broadcast root@trin-0{1,2,3,4,6,7,8,9} -- modprobe cman 4. broadcast root@trin-0{1,2,3,4,6,7,8,9} -- cman_tool join -c mantis broadcast is basically just a script that wraps calls to ssh in parallel. It is similar to: for n in root@trin-0{1,2,3,4,6,7,8,9} do ssh $n $command & done wait Additional info: One other interesting item, trin-06 didn't ever join the cluster. not sure why.... this is from the logs on trin-06: Dec 13 18:19:34 trin-06 kernel: CMAN <CVS> (built Dec 13 2004 14:56:03) installed Dec 13 18:19:34 trin-06 kernel: NET: Registered protocol family 30 Dec 13 18:19:38 trin-06 kernel: CMAN: Waiting to join or form a Linux-cluster Dec 13 18:20:10 trin-06 kernel: CMAN: sending membership request Dec 13 18:20:57 trin-06 last message repeated 9 times Dec 13 18:22:00 trin-06 last message repeated 86 times Dec 13 18:23:01 trin-06 last message repeated 86 times Dec 13 18:24:02 trin-06 last message repeated 81 times Dec 13 18:25:05 trin-06 last message repeated 74 times Dec 13 18:25:26 trin-06 last message repeated 24 times Dec 13 18:25:27 trin-06 kernel: CMAN: got node trin-04 Dec 13 18:25:27 trin-06 kernel: CMAN: got node trin-02 Dec 13 18:25:27 trin-06 kernel: CMAN: got node trin-08 Dec 13 18:25:27 trin-06 kernel: CMAN: got node trin-03 Dec 13 18:25:27 trin-06 kernel: CMAN: got node trin-01 Dec 13 18:25:27 trin-06 kernel: CMAN: got node trin-09 Dec 13 18:25:27 trin-06 kernel: CMAN: got node trin-07 Dec 13 18:25:27 trin-06 kernel: CMAN: quorum regained, resuming activity This appeared in the logs of trin-03 (the node that produced the above BUG) immediately before the node BUGed out (ignore the timestamps as they don't match those of node trin-06) Dec 13 19:26:21 trin-03 kernel: CMAN <CVS> (built Dec 13 2004 14:56:03) installed Dec 13 19:26:21 trin-03 kernel: NET: Registered protocol family 30 Dec 13 19:26:25 trin-03 kernel: CMAN: Waiting to join or form a Linux-cluster Dec 13 19:26:57 trin-03 kernel: CMAN: sending membership request Dec 13 19:27:07 trin-03 last message repeated 3 times Dec 13 19:27:08 trin-03 kernel: CMAN: got node trin-07 Dec 13 19:27:08 trin-03 kernel: CMAN: got node trin-09 Dec 13 19:27:08 trin-03 kernel: CMAN: got node trin-01 Dec 13 19:27:08 trin-03 kernel: CMAN: got node trin-08 Dec 13 19:27:09 trin-03 kernel: CMAN: quorum regained, resuming activity Dec 13 19:27:12 trin-03 kernel: CMAN: got node trin-02 Dec 13 19:27:13 trin-03 kernel: CMAN: got node trin-06 Dec 13 19:27:20 trin-03 kernel: CMAN: nmembers in HELLO message from 2 does not match our view (got 7, exp 6) Dec 13 19:27:20 trin-03 kernel: CMAN: got node trin-04 Dec 13 19:27:34 trin-03 kernel: Attempt to re-add node with id 7 Dec 13 19:27:34 trin-03 kernel: existing node is trin-06 Dec 13 19:27:34 trin-03 kernel: new node is trin-04 Some more cryptic messages that appeared on node trin-01 shortly after node trin-03 BUG'ed (again...the time stamps are a bit off). Dec 13 19:28:32 trin-01 kernel: CMAN <CVS> (built Dec 13 2004 14:56:03) installed Dec 13 19:28:32 trin-01 kernel: NET: Registered protocol family 30 Dec 13 19:28:36 trin-01 kernel: CMAN: Waiting to join or form a Linux-cluster Dec 13 19:29:08 trin-01 kernel: CMAN: forming a new cluster Dec 13 19:29:08 trin-01 kernel: CMAN: got node trin-09 Dec 13 19:29:13 trin-01 kernel: CMAN: got node trin-07 Dec 13 19:29:18 trin-01 kernel: CMAN: got node trin-03 Dec 13 19:29:19 trin-01 kernel: CMAN: got node trin-08 Dec 13 19:29:20 trin-01 kernel: CMAN: quorum regained, resuming activity Dec 13 19:29:23 trin-01 kernel: CMAN: got node trin-02 Dec 13 19:29:24 trin-01 kernel: CMAN: nmembers in HELLO message from 4 does not match our view (got 5, exp 6) Dec 13 19:29:31 trin-01 kernel: CMAN: got node trin-04 Dec 13 19:32:30 trin-01 kernel: CMAN: too many transition restarts - will die Dec 13 19:32:30 trin-01 kernel: CMAN: we are leaving the cluster. Reason is 5 Dec 13 19:32:30 trin-01 kernel:
see attachment #108576 [details] for bug #133512 for logs of test run producing BUG()
see Attachment #108628 [details] for bug #142984 comment #1 for the init script that will produce BUG()
Ths seems to work for me. I'll run my tests overnight before changing the status of this bug though. Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.42; previous revision: 1.41 done
*** This bug has been marked as a duplicate of 133512 ***
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.