Bug 229797
Summary: | cman panic in free_cluster_sockets after having membership request denied | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> |
Component: | cman | Assignee: | Christine Caulfield <ccaulfie> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | cluster-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2007-0135 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-05-10 21:22:34 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Corey Marthaler
2007-02-23 15:46:30 UTC
The node with the "bad" config file wasn't even the node that paniced, it was taft-02. The only problem was that the nodeid wasn't there and the version was one lower. The other three nodes (including taft-04) had the proper .conf file. <?xml version="1.0"?> <cluster config_version="4" name="TAFT_CLUSTER"> <cman> </cman> <fence_daemon clean_start="0" post_fail_delay="30" post_join_delay="30"/> <clusternodes> <clusternode name="taft-01"> <fence> <method name="1"> <device name="qawti-01" option="off" port="1"/> <device name="qawti-01" option="off" port="5"/> <device name="qawti-01" option="on" port="1"/> <device name="qawti-01" option="on" port="5"/> </method> </fence> </clusternode> <clusternode name="taft-02"> <fence> <method name="1"> <device name="qawti-01" option="off" port="2"/> <device name="qawti-01" option="off" port="6"/> <device name="qawti-01" option="on" port="2"/> <device name="qawti-01" option="on" port="6"/> </method> </fence> </clusternode> <clusternode name="taft-03"> <fence> <method name="1"> <device name="qawti-01" option="off" port="3"/> <device name="qawti-01" option="off" port="7"/> <device name="qawti-01" option="on" port="3"/> <device name="qawti-01" option="on" port="7"/> </method> </fence> </clusternode> <clusternode name="taft-04"> <fence> <method name="1"> <device name="qawti-01" option="off" port="4"/> <device name="qawti-01" option="off" port="8"/> <device name="qawti-01" option="on" port="4"/> <device name="qawti-01" option="on" port="8"/> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice agent="fence_wti" ipaddr="10.15.89.2" name="qawti-01" passwd=" "/> </fencedevices> </cluster> yeah, it looks like an unlucky race condition during shutdown. I don't know how easy this is to reproduce, but my guess is that it is a race between the startup script looping round to attempt another join while the kernel is still tidying up after being refused. I'll put some locking around the offending list but in the meantime a sleep between being rejected and trying to rejoin should fix it (yes, it's one of those!). If it happens again (with or without the sleep) would it be possible to get a dump of a processes on the system please, just to confirm or refute my hypothesis? I can't reproduce the actual bug shown here but I can reproduce some very odd behaviour that is caused by the same (I think) bug. So I've added a flag that will prevent cman_tool (or anything else for that matter) trying to start the cluster whilst it is in the process of tidying up. -rRHEL4: Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v <-- cnxman.c new revision: 1.42.2.28; previous revision: 1.42.2.27 done -rSTABLE Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v <-- cnxman.c new revision: 1.42.2.12.4.1.2.16; previous revision: 1.42.2.12.4.1.2.15 done An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0135.html |