Bug 142853 - kernel BUG as membership.c:244!
Summary: kernel BUG as membership.c:244!
Keywords:
Status: CLOSED DUPLICATE of bug 133512
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cman
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On: 133512
Blocks: 133240
TreeView+ depends on / blocked
 
Reported: 2004-12-14 18:03 UTC by Adam "mantis" Manthei
Modified: 2009-04-16 19:59 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2006-02-21 19:07:41 UTC
Embargoed:


Attachments (Terms of Use)

Description Adam "mantis" Manthei 2004-12-14 18:03:10 UTC
Description of problem:
Trying to get my cluster running I ran into the following BUG with:

kernel BUG at
/usr/src/build/496234-i686/BUILD/cman-kernel-2.6.9-3/src/membership.c:244!
invalid operand: 0000 [#1]
Modules linked in: cman(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev
i2c_core sunrpc dm_mod button battery ac uhci_hcd hw_random e1000 floppy
ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<f8b0c030>]    Not tainted VLI
EFLAGS: 00010246   (2.6.9-1.906_EL)
EIP is at do_process_endtrans+0x240/0x407 [cman]
eax: 00000014   ebx: f8b20a20   ecx: f8b14ec5   edx: f5f56f38
esi: f8b14ec5   edi: f5e5a160   ebp: f8b14eb0   esp: f5f56f3c
ds: 007b   es: 007b   ss: 0068
Process cman_memb (pid: 9979, threadinfo=f5f56000 task=f5f9e230)
Stack: f5f3f700 00000007 f8b20a20 f8b20a20 00000014 f5f56f84 f5f56f7c
f8b0a9f1
       ffffffff f5f56f7c 00000000 f5f56f84 f699b500 f8b0d3ce f8b20a34
000005c8
       01000000 00000001 f5f56f7c 00000008 f5f56f74 00000001 00000000
00000000
Call Trace:
 [<f8b0a9f1>] do_membership_packet+0x141/0x1c0 [cman]
 [<f8b0d3ce>] dispatch_messages+0x85/0xb2 [cman]
 [<f8b096d8>] membership_kthread+0x3f8/0x577 [cman]
 [<c011b9fa>] default_wake_function+0x0/0xc
 [<f8b092e0>] membership_kthread+0x0/0x577 [cman]
 [<c01041d9>] kernel_thread_helper+0x5/0xb
Code: b2 f8 5a 8b 54 24 04 8b 04 90 ff 70 08 68 b0 4e b1 f8 e8 29 40 61 c7
5d 58 8b 0c 24 ff 71 08 68 c5 4e b1 f8 e8 17 40 61 c7 5e 5f <0f> 0b f4
00 5b
4d b1 f8 81 3d 40 16 b2 f8 3c 4b 24 1d 74 0f 68




Version-Release number of selected component (if applicable):
[root@trin-01 ~]# rpm -qa | grep -E "(cman|ccs|906)"
kernel-2.6.9-1.906_EL
cman-kernel-2.6.9-3.3
ccs-0.9-0
cman-1.0-0.pre5.0

How reproducible:
I've only seen it once.  Mike Tilstra has also reported seeing the BUG

Steps to Reproduce:

All the nodes where initially not running ccsd.  I don't think that I
had the cman modules loaded on the nodes initially either, but am not
as certain
about that.

1. broadcast root@trin-0{1,2,3,4,6,7,8,9} -- killall ccsd
2. broadcast root@trin-0{1,2,3,4,6,7,8,9} -- ccsd
3. broadcast root@trin-0{1,2,3,4,6,7,8,9} -- modprobe cman
4. broadcast root@trin-0{1,2,3,4,6,7,8,9} -- cman_tool join -c mantis

broadcast is basically just a script that wraps calls to ssh in parallel.
It is similar to:

        for n in root@trin-0{1,2,3,4,6,7,8,9}
        do
                ssh $n $command &
        done
        wait


Additional info:
One other interesting item, trin-06 didn't ever join the cluster.  not
sure why.... this is from the logs on trin-06:

Dec 13 18:19:34 trin-06 kernel: CMAN <CVS> (built Dec 13 2004
14:56:03) installed
Dec 13 18:19:34 trin-06 kernel: NET: Registered protocol family 30
Dec 13 18:19:38 trin-06 kernel: CMAN: Waiting to join or form a
Linux-cluster
Dec 13 18:20:10 trin-06 kernel: CMAN: sending membership request
Dec 13 18:20:57 trin-06 last message repeated 9 times
Dec 13 18:22:00 trin-06 last message repeated 86 times
Dec 13 18:23:01 trin-06 last message repeated 86 times
Dec 13 18:24:02 trin-06 last message repeated 81 times
Dec 13 18:25:05 trin-06 last message repeated 74 times
Dec 13 18:25:26 trin-06 last message repeated 24 times
Dec 13 18:25:27 trin-06 kernel: CMAN: got node trin-04
Dec 13 18:25:27 trin-06 kernel: CMAN: got node trin-02
Dec 13 18:25:27 trin-06 kernel: CMAN: got node trin-08
Dec 13 18:25:27 trin-06 kernel: CMAN: got node trin-03
Dec 13 18:25:27 trin-06 kernel: CMAN: got node trin-01
Dec 13 18:25:27 trin-06 kernel: CMAN: got node trin-09
Dec 13 18:25:27 trin-06 kernel: CMAN: got node trin-07
Dec 13 18:25:27 trin-06 kernel: CMAN: quorum regained, resuming activity



This appeared in the logs of trin-03 (the node that produced the above
BUG) immediately before the node BUGed out (ignore the timestamps as
they don't match those of node trin-06)

Dec 13 19:26:21 trin-03 kernel: CMAN <CVS> (built Dec 13 2004
14:56:03) installed
Dec 13 19:26:21 trin-03 kernel: NET: Registered protocol family 30
Dec 13 19:26:25 trin-03 kernel: CMAN: Waiting to join or form a
Linux-cluster
Dec 13 19:26:57 trin-03 kernel: CMAN: sending membership request
Dec 13 19:27:07 trin-03 last message repeated 3 times
Dec 13 19:27:08 trin-03 kernel: CMAN: got node trin-07
Dec 13 19:27:08 trin-03 kernel: CMAN: got node trin-09
Dec 13 19:27:08 trin-03 kernel: CMAN: got node trin-01
Dec 13 19:27:08 trin-03 kernel: CMAN: got node trin-08
Dec 13 19:27:09 trin-03 kernel: CMAN: quorum regained, resuming activity
Dec 13 19:27:12 trin-03 kernel: CMAN: got node trin-02
Dec 13 19:27:13 trin-03 kernel: CMAN: got node trin-06
Dec 13 19:27:20 trin-03 kernel: CMAN: nmembers in HELLO message from 2
does
not match our view (got 7, exp 6)
Dec 13 19:27:20 trin-03 kernel: CMAN: got node trin-04
Dec 13 19:27:34 trin-03 kernel: Attempt to re-add node with id 7
Dec 13 19:27:34 trin-03 kernel: existing node is trin-06
Dec 13 19:27:34 trin-03 kernel: new node is trin-04


Some more cryptic messages that appeared on node trin-01 shortly after
node trin-03 BUG'ed (again...the time stamps are a bit off).

Dec 13 19:28:32 trin-01 kernel: CMAN <CVS> (built Dec 13 2004
14:56:03) installed
Dec 13 19:28:32 trin-01 kernel: NET: Registered protocol family 30
Dec 13 19:28:36 trin-01 kernel: CMAN: Waiting to join or form a
Linux-cluster
Dec 13 19:29:08 trin-01 kernel: CMAN: forming a new cluster
Dec 13 19:29:08 trin-01 kernel: CMAN: got node trin-09
Dec 13 19:29:13 trin-01 kernel: CMAN: got node trin-07
Dec 13 19:29:18 trin-01 kernel: CMAN: got node trin-03
Dec 13 19:29:19 trin-01 kernel: CMAN: got node trin-08
Dec 13 19:29:20 trin-01 kernel: CMAN: quorum regained, resuming activity
Dec 13 19:29:23 trin-01 kernel: CMAN: got node trin-02
Dec 13 19:29:24 trin-01 kernel: CMAN: nmembers in HELLO message from 4
does
not match our view (got 5, exp 6)
Dec 13 19:29:31 trin-01 kernel: CMAN: got node trin-04
Dec 13 19:32:30 trin-01 kernel: CMAN: too many transition restarts - will
die
Dec 13 19:32:30 trin-01 kernel: CMAN: we are leaving the cluster.
Reason is
5
Dec 13 19:32:30 trin-01 kernel:

Comment 1 Adam "mantis" Manthei 2004-12-14 22:42:28 UTC
see attachment #108576 [details] for bug #133512 for logs of test run producing BUG()

Comment 2 Adam "mantis" Manthei 2004-12-15 16:28:29 UTC
see Attachment #108628 [details] for bug #142984 comment #1 for the init script that will
produce BUG()

Comment 3 Christine Caulfield 2004-12-21 15:58:31 UTC
Ths seems to work for me. I'll run my tests overnight before changing
the status of this bug though.

Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.42; previous revision: 1.41
done         


Comment 4 Christine Caulfield 2004-12-22 11:49:08 UTC

*** This bug has been marked as a duplicate of 133512 ***

Comment 5 Red Hat Bugzilla 2006-02-21 19:07:41 UTC
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.


Note You need to log in before you can comment on or make changes to this bug.