Bug 229797 - cman panic in free_cluster_sockets after having membership request denied
cman panic in free_cluster_sockets after having membership request denied
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cman (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-02-23 10:46 EST by Corey Marthaler
Modified: 2009-04-16 16:01 EDT (History)
1 user (show)

See Also:
Fixed In Version: RHBA-2007-0135
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-05-10 17:22:34 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2007-02-23 10:46:30 EST
Description of problem:
taft-04 paniced after all four nodes were attempting to join a cluster.



Feb 23 09:38:48 taft-04 kernel: CMAN 2.6.9-48.6 (built Feb 14 2007 14:45:35)
installed
Feb 23 09:38:48 taft-04 kernel: NET: Registered protocol family 30
Feb 23 09:38:48 taft-04 kernel: DLM 2.6.9-46.12 (built Feb 14 2007 16:48:10)
installed
Feb 23 09:38:48 taft-04 ccsd[4793]: cluster.conf (cluster name = TAFT_CLUSTER,
version
Feb 23 09:38:48 taft-04 ccsd[4793]: Remote copy of cluster.conf is newer than
local cop
Feb 23 09:38:48 taft-04 ccsd[4793]:  Local version # : 4
Feb 23 09:38:48 taft-04 ccsd[4793]:  Remote version #: 5
Feb 23 09:38:49 taft-04 kernel: CMAN: Waiting to join or form a Linux-cluster
Feb 23 09:38:49 taft-04 ccsd[4793]: Connected to cluster infrastruture via:
CMAN/SM Plu
Feb 23 09:38:49 taft-04 ccsd[4793]: Initial status:: Inquorate
CMAN: Cluster membership rejected
WARNING: dlm_emergency_shutdown
WARNING: dlm_emergency_shutdown
Feb 23 09:39:21 taft-04 kernel: CMAN: sending membership request
Feb 23 09:39:21 taft-04 kernel: CMAN: Cluster membership rejected
Feb 23 09:39:21 taft-04 kernel: WARNING: dlm_emergency_shutdown
Feb 23 09:39:21 taft-04 kernel: WARNING: dlm_emergency_shutdown
Feb 23 09:39:21 taft-04 ccsd[4793]: Cluster manager shutdown.  Attemping to
reconnect..
Feb 23 09:39:21 taft-04 kernel: CMAN: Waiting to join or form a Linux-cluster
Feb 23 09:39:21 taft-04 ccsd[4793]: Connected to cluster infrastruture via:
CMAN/SM Plu
Feb 23 09:39:21 taft-04 ccsd[4793]: Initial status:: Inquorate
CMAN: Cluster membership rejected
Feb 23 09:39:26 taft-04 kernel: CMAN: sending membership request
Feb 23 09:39:26 taft-04 kernel: CMAN: Cluster membership rejected
Feb 23 09:39:26 taft-04 ccsd[4793]: Cluster manager shutdown.  Attemping to
reconnect..
Feb 23 09:39:26 taft-04 ccsd[4793]: Connected to cluster infrastruture via:
CMAN/SM Plu
Feb 23 09:39:26 taft-04 kernel: CMAN: Waiting to join or form a Linux-cluster
Feb 23 09:39:26 taft-04 ccsd[4793]: Initial status:: Inquorate
CMAN: Cluster membership rejected
Feb 23 09:39:31 taft-04 kernel: CMAN: sending membership request
Feb 23 09:39:31 taft-04 kernel: CMAN: Cluster membership rejected
Feb 23 09:39:31 taft-04 ccsd[4793]: Cluster manager shutdown.  Attemping to
reconnect..
Feb 23 09:39:31 taft-04 ccsd[4793]: Unable to connect to cluster infrastructure
after 6
Feb 23 09:39:31 taft-04 kernel: CMAN: Waiting to join or form a Linux-cluster
Feb 23 09:39:32 taft-04 ccsd[4793]: Connected to cluster infrastruture via:
CMAN/SM Plu
Feb 23 09:39:32 taft-04 ccsd[4793]: Initial status:: Inquorate
CMAN: Cluster membership rejected
Unable to handle kernel paging request at 0000000000100108 RIP:
<ffffffffa02323c6>{:cman:free_cluster_sockets+33}
PML4 21370a067 PGD 212466067 PMD 0
Oops: 0002 [1] SMP
CPU 0
Modules linked in: dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc
ds yentd
Pid: 5701, comm: cman_comms Not tainted 2.6.9-48.ELsmp
RIP: 0010:[<ffffffffa02323c6>] <ffffffffa02323c6>{:cman:free_cluster_sockets+33}
RSP: 0018:000001020f387d88  EFLAGS: 00010207
RAX: 0000000000100100 RBX: 0000010213f665c0 RCX: 0000010213f66630
RDX: 0000000000200200 RSI: ffffffff804ee800 RDI: 00000101ffffd680
RBP: 0000000000100100 R08: 0000000100000000 R09: 0000000000000246
R10: 0000000000000000 R11: 0000010213f66630 R12: 0000010213ddc598
R13: 0000010215943200 R14: ffffffffa0252de0 R15: ffffffffa0252ca0
FS:  0000002a95562b00(0000) GS:ffffffff804ed500(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000100108 CR3: 0000000000101000 CR4: 00000000000006e0
Process cman_comms (pid: 5701, threadinfo 000001020f386000, task 00000102167cf7f0)
Stack: ffffffffa0252da0 ffffffffa0252da0 ffffffffa0252d68 ffffffffa0235afe
       ffffffffa0252c50 fffffff500000002 0000010213f665c0 ffffffffa0252cd0
       0000000000000001 000000000000001f
Call Trace:<ffffffffa0235afe>{:cman:cluster_kthread+5142}
<ffffffff8014335f>{do_notify_
       <ffffffff80134660>{default_wake_function+0} <ffffffff80110f47>{child_rip+8}
       <ffffffffa02346e8>{:cman:cluster_kthread+0} <ffffffff80110f3f>{child_rip+0}


Code: 48 89 50 08 48 89 02 48 8b 79 c8 48 c7 01 00 01 10 00 48 c7
RIP <ffffffffa02323c6>{:cman:free_cluster_sockets+33} RSP <000001020f387d88>
CR2: 0000000000100108
 <0>Kernel panic - not syncing: Oops


Version-Release number of selected component (if applicable):
2.6.9-48.ELsmp
RHEL4-U5-re20070216.2
CMAN 2.6.9-48.6 (built Feb 14 2007 14:45:35) installed
DLM 2.6.9-46.12 (built Feb 14 2007 16:48:10) installed
Comment 1 Corey Marthaler 2007-02-23 11:00:20 EST
The node with the "bad" config file wasn't even the node that paniced, it was
taft-02. The only problem was that the nodeid wasn't there and the version was
one lower. The other three nodes (including taft-04) had the proper .conf file.

<?xml version="1.0"?>
<cluster config_version="4" name="TAFT_CLUSTER">
  <cman>
                </cman>
  <fence_daemon clean_start="0" post_fail_delay="30" post_join_delay="30"/>
  <clusternodes>
    <clusternode name="taft-01">
      <fence>
                                        <method name="1">
                                                <device name="qawti-01"
option="off" port="1"/>
                                                <device name="qawti-01"
option="off" port="5"/>
                                                <device name="qawti-01"
option="on" port="1"/>
                                                <device name="qawti-01"
option="on" port="5"/>
                                        </method>
                                </fence>
    </clusternode>
    <clusternode name="taft-02">
      <fence>
                                        <method name="1">
                                                <device name="qawti-01"
option="off" port="2"/>
                                                <device name="qawti-01"
option="off" port="6"/>
                                                <device name="qawti-01"
option="on" port="2"/>
                                                <device name="qawti-01"
option="on" port="6"/>
                                        </method>
                                </fence>
    </clusternode>
    <clusternode name="taft-03">
      <fence>
                                        <method name="1">
                                                <device name="qawti-01"
option="off" port="3"/>
                                                <device name="qawti-01"
option="off" port="7"/>
                                                <device name="qawti-01"
option="on" port="3"/>
                                                <device name="qawti-01"
option="on" port="7"/>
                                        </method>
                                </fence>
    </clusternode>
    <clusternode name="taft-04">
      <fence>
                                        <method name="1">
                                                <device name="qawti-01"
option="off" port="4"/>
                                                <device name="qawti-01"
option="off" port="8"/>
                                                <device name="qawti-01"
option="on" port="4"/>
                                                <device name="qawti-01"
option="on" port="8"/>
                                        </method>
                                </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice agent="fence_wti" ipaddr="10.15.89.2" name="qawti-01" passwd=" "/>
  </fencedevices>
</cluster>
Comment 2 Christine Caulfield 2007-02-23 11:11:19 EST
yeah, it looks like an unlucky race condition during shutdown.
Comment 3 Christine Caulfield 2007-02-26 09:25:34 EST
I don't know how easy this is to reproduce, but my guess is that it is a race
between the startup script looping round to attempt another join while the
kernel is still tidying up after being refused.

I'll put some locking around the offending list but in the meantime a sleep
between being rejected and trying to rejoin should fix it (yes, it's one of those!).

If it happens again (with or without the sleep) would it be possible to get a
dump of a processes on the system please, just to confirm or refute my hypothesis?
Comment 4 Christine Caulfield 2007-02-27 05:40:12 EST
I can't reproduce the actual bug shown here but I can reproduce some very odd
behaviour that is caused by the same (I think) bug.

So I've added a flag that will prevent cman_tool (or anything else for that
matter) trying to start the cluster whilst it is in the process of tidying up.

-rRHEL4:
Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.28; previous revision: 1.42.2.27
done

-rSTABLE
Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.12.4.1.2.16; previous revision: 1.42.2.12.4.1.2.15
done
Comment 7 Red Hat Bugzilla 2007-05-10 17:22:34 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0135.html

Note You need to log in before you can comment on or make changes to this bug.