Bug 233521
Summary: | cman panic in start_transition after cmirror device failure caused node to shut down | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> |
Component: | cman-kernel | Assignee: | Christine Caulfield <ccaulfie> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | cluster-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2007-0990 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-11-21 21:54:06 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Corey Marthaler
2007-03-22 21:26:23 UTC
Here's a little more info from the only node still "in" the cluster [root@link-07 ~]# cman_tool nodes Node Votes Exp Sts Name 1 1 4 X link-08 2 1 4 M link-07 3 1 4 X link-02 4 1 4 X link-04 [root@link-07 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 3 2 recover 0 - [2] DLM Lock Space: "clvmd" 377 257 recover 0 - [2] DLM Lock Space: "clustered_log" 379 259 recover 0 - [2] [root@link-07 cluster]# cat status Protocol version: 5.0.1 Config version: 2 Cluster name: LINK_128 Cluster ID: 19208 Cluster Member: Yes Membership state: Cluster-Member Nodes: 1 Expected_votes: 4 Total_votes: 1 Quorum: 3 Activity blocked Active subsystems: 4 Node name: link-07 Node ID: 2 Node addresses: 10.15.89.157 [root@link-07 cluster]# cat sm_debug count 3 00000003 remove node 4 count 2 0100017b remove node 3 count 3 0100017b remove node 4 count 2 01000179 remove node 3 count 3 01000179 remove node 4 count 2 00000003 remove node 1 count 1 0100017b remove node 1 count 1 01000179 remove node 1 count 1 Just another note... I have automated tests that do this sort of thing over and over, so it's surprising that when I did it once by hand this occured. To date, this has only been seen one time. I think it's one of those fluke occurances... Looking at the code it seems most likely that a NULL node structure address was passed into start_transition() but It's really not clear to me how that can happen. Putting extra debugging into the code is almost certainly going to hide the problem, if it's even reproducable at all - which seems unlikely given what you've said. I found one unchecked use of the node structure being passed into start_transition(). It's pretty unlikely, but this doesn't seem to a common bug so they might be related ! Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v <-- membership.c new revision: 1.44.2.27; previous revision: 1.44.2.26 done Feel free, of course, to reopen this if it happens again. Setting flags for 4.6. Marking this bug verified as it hasn't been seen in over 7 months. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0990.html |