Bug 229797

Summary: cman panic in free_cluster_sockets after having membership request denied
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: cmanAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0135 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-05-10 21:22:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2007-02-23 15:46:30 UTC
Description of problem:
taft-04 paniced after all four nodes were attempting to join a cluster.



Feb 23 09:38:48 taft-04 kernel: CMAN 2.6.9-48.6 (built Feb 14 2007 14:45:35)
installed
Feb 23 09:38:48 taft-04 kernel: NET: Registered protocol family 30
Feb 23 09:38:48 taft-04 kernel: DLM 2.6.9-46.12 (built Feb 14 2007 16:48:10)
installed
Feb 23 09:38:48 taft-04 ccsd[4793]: cluster.conf (cluster name = TAFT_CLUSTER,
version
Feb 23 09:38:48 taft-04 ccsd[4793]: Remote copy of cluster.conf is newer than
local cop
Feb 23 09:38:48 taft-04 ccsd[4793]:  Local version # : 4
Feb 23 09:38:48 taft-04 ccsd[4793]:  Remote version #: 5
Feb 23 09:38:49 taft-04 kernel: CMAN: Waiting to join or form a Linux-cluster
Feb 23 09:38:49 taft-04 ccsd[4793]: Connected to cluster infrastruture via:
CMAN/SM Plu
Feb 23 09:38:49 taft-04 ccsd[4793]: Initial status:: Inquorate
CMAN: Cluster membership rejected
WARNING: dlm_emergency_shutdown
WARNING: dlm_emergency_shutdown
Feb 23 09:39:21 taft-04 kernel: CMAN: sending membership request
Feb 23 09:39:21 taft-04 kernel: CMAN: Cluster membership rejected
Feb 23 09:39:21 taft-04 kernel: WARNING: dlm_emergency_shutdown
Feb 23 09:39:21 taft-04 kernel: WARNING: dlm_emergency_shutdown
Feb 23 09:39:21 taft-04 ccsd[4793]: Cluster manager shutdown.  Attemping to
reconnect..
Feb 23 09:39:21 taft-04 kernel: CMAN: Waiting to join or form a Linux-cluster
Feb 23 09:39:21 taft-04 ccsd[4793]: Connected to cluster infrastruture via:
CMAN/SM Plu
Feb 23 09:39:21 taft-04 ccsd[4793]: Initial status:: Inquorate
CMAN: Cluster membership rejected
Feb 23 09:39:26 taft-04 kernel: CMAN: sending membership request
Feb 23 09:39:26 taft-04 kernel: CMAN: Cluster membership rejected
Feb 23 09:39:26 taft-04 ccsd[4793]: Cluster manager shutdown.  Attemping to
reconnect..
Feb 23 09:39:26 taft-04 ccsd[4793]: Connected to cluster infrastruture via:
CMAN/SM Plu
Feb 23 09:39:26 taft-04 kernel: CMAN: Waiting to join or form a Linux-cluster
Feb 23 09:39:26 taft-04 ccsd[4793]: Initial status:: Inquorate
CMAN: Cluster membership rejected
Feb 23 09:39:31 taft-04 kernel: CMAN: sending membership request
Feb 23 09:39:31 taft-04 kernel: CMAN: Cluster membership rejected
Feb 23 09:39:31 taft-04 ccsd[4793]: Cluster manager shutdown.  Attemping to
reconnect..
Feb 23 09:39:31 taft-04 ccsd[4793]: Unable to connect to cluster infrastructure
after 6
Feb 23 09:39:31 taft-04 kernel: CMAN: Waiting to join or form a Linux-cluster
Feb 23 09:39:32 taft-04 ccsd[4793]: Connected to cluster infrastruture via:
CMAN/SM Plu
Feb 23 09:39:32 taft-04 ccsd[4793]: Initial status:: Inquorate
CMAN: Cluster membership rejected
Unable to handle kernel paging request at 0000000000100108 RIP:
<ffffffffa02323c6>{:cman:free_cluster_sockets+33}
PML4 21370a067 PGD 212466067 PMD 0
Oops: 0002 [1] SMP
CPU 0
Modules linked in: dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc
ds yentd
Pid: 5701, comm: cman_comms Not tainted 2.6.9-48.ELsmp
RIP: 0010:[<ffffffffa02323c6>] <ffffffffa02323c6>{:cman:free_cluster_sockets+33}
RSP: 0018:000001020f387d88  EFLAGS: 00010207
RAX: 0000000000100100 RBX: 0000010213f665c0 RCX: 0000010213f66630
RDX: 0000000000200200 RSI: ffffffff804ee800 RDI: 00000101ffffd680
RBP: 0000000000100100 R08: 0000000100000000 R09: 0000000000000246
R10: 0000000000000000 R11: 0000010213f66630 R12: 0000010213ddc598
R13: 0000010215943200 R14: ffffffffa0252de0 R15: ffffffffa0252ca0
FS:  0000002a95562b00(0000) GS:ffffffff804ed500(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000100108 CR3: 0000000000101000 CR4: 00000000000006e0
Process cman_comms (pid: 5701, threadinfo 000001020f386000, task 00000102167cf7f0)
Stack: ffffffffa0252da0 ffffffffa0252da0 ffffffffa0252d68 ffffffffa0235afe
       ffffffffa0252c50 fffffff500000002 0000010213f665c0 ffffffffa0252cd0
       0000000000000001 000000000000001f
Call Trace:<ffffffffa0235afe>{:cman:cluster_kthread+5142}
<ffffffff8014335f>{do_notify_
       <ffffffff80134660>{default_wake_function+0} <ffffffff80110f47>{child_rip+8}
       <ffffffffa02346e8>{:cman:cluster_kthread+0} <ffffffff80110f3f>{child_rip+0}


Code: 48 89 50 08 48 89 02 48 8b 79 c8 48 c7 01 00 01 10 00 48 c7
RIP <ffffffffa02323c6>{:cman:free_cluster_sockets+33} RSP <000001020f387d88>
CR2: 0000000000100108
 <0>Kernel panic - not syncing: Oops


Version-Release number of selected component (if applicable):
2.6.9-48.ELsmp
RHEL4-U5-re20070216.2
CMAN 2.6.9-48.6 (built Feb 14 2007 14:45:35) installed
DLM 2.6.9-46.12 (built Feb 14 2007 16:48:10) installed

Comment 1 Corey Marthaler 2007-02-23 16:00:20 UTC
The node with the "bad" config file wasn't even the node that paniced, it was
taft-02. The only problem was that the nodeid wasn't there and the version was
one lower. The other three nodes (including taft-04) had the proper .conf file.

<?xml version="1.0"?>
<cluster config_version="4" name="TAFT_CLUSTER">
  <cman>
                </cman>
  <fence_daemon clean_start="0" post_fail_delay="30" post_join_delay="30"/>
  <clusternodes>
    <clusternode name="taft-01">
      <fence>
                                        <method name="1">
                                                <device name="qawti-01"
option="off" port="1"/>
                                                <device name="qawti-01"
option="off" port="5"/>
                                                <device name="qawti-01"
option="on" port="1"/>
                                                <device name="qawti-01"
option="on" port="5"/>
                                        </method>
                                </fence>
    </clusternode>
    <clusternode name="taft-02">
      <fence>
                                        <method name="1">
                                                <device name="qawti-01"
option="off" port="2"/>
                                                <device name="qawti-01"
option="off" port="6"/>
                                                <device name="qawti-01"
option="on" port="2"/>
                                                <device name="qawti-01"
option="on" port="6"/>
                                        </method>
                                </fence>
    </clusternode>
    <clusternode name="taft-03">
      <fence>
                                        <method name="1">
                                                <device name="qawti-01"
option="off" port="3"/>
                                                <device name="qawti-01"
option="off" port="7"/>
                                                <device name="qawti-01"
option="on" port="3"/>
                                                <device name="qawti-01"
option="on" port="7"/>
                                        </method>
                                </fence>
    </clusternode>
    <clusternode name="taft-04">
      <fence>
                                        <method name="1">
                                                <device name="qawti-01"
option="off" port="4"/>
                                                <device name="qawti-01"
option="off" port="8"/>
                                                <device name="qawti-01"
option="on" port="4"/>
                                                <device name="qawti-01"
option="on" port="8"/>
                                        </method>
                                </fence>
    </clusternode>
  </clusternodes>
  <fencedevices>
    <fencedevice agent="fence_wti" ipaddr="10.15.89.2" name="qawti-01" passwd=" "/>
  </fencedevices>
</cluster>


Comment 2 Christine Caulfield 2007-02-23 16:11:19 UTC
yeah, it looks like an unlucky race condition during shutdown.

Comment 3 Christine Caulfield 2007-02-26 14:25:34 UTC
I don't know how easy this is to reproduce, but my guess is that it is a race
between the startup script looping round to attempt another join while the
kernel is still tidying up after being refused.

I'll put some locking around the offending list but in the meantime a sleep
between being rejected and trying to rejoin should fix it (yes, it's one of those!).

If it happens again (with or without the sleep) would it be possible to get a
dump of a processes on the system please, just to confirm or refute my hypothesis?

Comment 4 Christine Caulfield 2007-02-27 10:40:12 UTC
I can't reproduce the actual bug shown here but I can reproduce some very odd
behaviour that is caused by the same (I think) bug.

So I've added a flag that will prevent cman_tool (or anything else for that
matter) trying to start the cluster whilst it is in the process of tidying up.

-rRHEL4:
Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.28; previous revision: 1.42.2.27
done

-rSTABLE
Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.12.4.1.2.16; previous revision: 1.42.2.12.4.1.2.15
done


Comment 7 Red Hat Bugzilla 2007-05-10 21:22:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0135.html