Bug 172086 - Kernel panic when other node joins the cluster with cman and bonding
Kernel panic when other node joins the cluster with cman and bonding
Status: CLOSED DUPLICATE of bug 173621
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cman (Show other bugs)
4
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-10-31 02:09 EST by tulldata
Modified: 2009-04-16 16:00 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-11-21 08:52:14 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description tulldata 2005-10-31 02:09:29 EST
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)

Description of problem:
After configureing a cluster the first node kernel panics when other node joins with cman.

Console message:
SM: 01000002 process_stop_request: uevent already set

SM:  Assertion failed on line 106 of file /usr/src/build/572273-i686/BUILD/cman-
kernel-2.6.9-36/src/sm_membership.c
SM:  assertion:  "node"
SM:  time = 4294735772
nodeid=1

Kernel panic - not syncing: SM:  Record message above and reboot.


Version-Release number of selected component (if applicable):
RHEL4 Update 1 GFS 6.1 ClusterSuite 3

How reproducible:
Always

Steps to Reproduce:
1.configure cluster (2 node)
2.start member 1
3.start member 2 and member 1 panics almost every time
  

Actual Results:  Kernel panic

Expected Results:  Member should have joined the cluster successfully.

Additional info:
Comment 1 tulldata 2005-10-31 06:47:17 EST
This does not occur when we disable the bonding...
/etc/modprobe.conf
alias eth0 tg3
alias eth1 tg3
alias scsi_hostadapter cciss
alias scsi_hostadapter1 qla2300
alias eth2 e1000
alias usb-controller ohci-hcd
# Bonding driver configuration
alias bond0 bonding
# High availability, failover after 100ms
options bond0 mode=1 miimon=100 mode=1 miimon=100
options hangcheck-timer hangcheck_tick=30 hangcheck_margin=180


cat /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
BOOTPROTO=static
IPADDR=10.0.2.75
NETMASK=255.255.255.0
ONBOOT=yes


cat /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
HWADDR=00:14:C2:3D:6C:C4
IPADDR=138.215.28.73
NETMASK=255.255.255.0
ONBOOT=yes
TYPE=Ethernet

cat /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1
HWADDR=00:14:C2:3D:6C:C3
ONBOOT=yes
MASTER=bond0
SLAVE=yes


cat /etc/sysconfig/network-scripts/ifcfg-eth2
DEVICE=eth2
HWADDR=00:04:23:BD:68:4D
ONBOOT=yes
MASTER=bond0
SLAVE=yes

Comment 2 Christine Caulfield 2005-11-02 05:11:09 EST
I can't reproduce this here, can you give more details about how your network is
set up please ?
Comment 3 tulldata 2005-11-02 06:07:20 EST
The machines are attached to one switch, with the interconnect configured to a 
separate vlan.
The regular interface (eth0) is connected to a different vlan.
Comment 4 tulldata 2005-11-07 08:56:37 EST
Is it possible to turn on some debugging info?
Comment 5 Christine Caulfield 2005-11-07 10:52:38 EST
Not without rebuilding from source and uncommenting the DEBUG_COMMS line near
the bottom of cnxman-private.h

You might like to try the attached (completely untested) patch:

--- cnxman.c    14 Sep 2005 08:15:33 -0000      1.42.2.16
+++ cnxman.c    7 Nov 2005 15:54:50 -0000
@@ -859,7 +859,7 @@ static void process_incoming_packet(stru
         /* Have we received this message before ? If so just ignore it, it's a
         * resend for someone else's benefit */
        if (!(flags & MSG_NOACK) &&
-           rem_node && le16_to_cpu(header->seq) == rem_node->last_seq_recv) {
+           rem_node && (short)le16_to_cpu(header->seq) <=
(short)rem_node->last_seq_recv) {
                P_COMMS
                    ("Discarding message - Already seen this sequence number %d\n",
                     rem_node->last_seq_recv);
Comment 6 Christine Caulfield 2005-11-15 04:27:08 EST
If you can't change the source, how about a tcpdump of the traffic on UDP port
6809 ? (Ideally time-synchronized for each interface).
Comment 7 David Juran 2005-11-21 08:52:14 EST
Bz 173621 was opened in response to a service request about this very issue, so
I'm closing this one.

*** This bug has been marked as a duplicate of 173621 ***

Note You need to log in before you can comment on or make changes to this bug.