Use case: create 4 nodes on cluster with cman & AIS, using redundent ring configuration. - break the network to isolate one node. (issue -- cman does not exit the node that lost quoram ) - re-establish the netowrk (issue -- The 3 nodes exit, and not the 1 node that joined) packages used: cman-2.0.98-1.el5_3.1.hotfix.2 openais-0.80.3-22.el5_3.7
Need logs from all nodes in the cluster
Also, for reproducing this issue... when you say 'break the network' are you breaking all of the rings for a given node? i.e. w/ redundant ring there would be multiple network connections for each node. Or is only a single link/ring getting pulled?
As redundant ring is totally unsupported and untested software. Is it possible to test this without RRP enabled ? Not only will it eliminate a potentially huge variable but, if the problem persists, it will simplify the logs hugely I suspect.
This isn't a redundant ring configuration - just a single ring using a dedicated network interface on each node. I'll provide the logs shortly.
{From issue} I’m still having the issues with cman, and I think it’s related to a multicast issue we’re seeing on the switch. Essentially one host in the cluster keeps dropping in and out of the IGMP snooping configuration on the switch, which causes it to drop in and out of the cluster. When it drops out, it correctly is shown as being down in cman_tool; when it comes back, the rest of the cluster commits suicide. L The logs from the rest of the cluster are essentially identical to the one I sent before.
The cluster config: <?xml version="1.0"?> <cluster config_version="11" name="testcluster"> <clusternodes> <clusternode name="lnaiqlv21-cl2" votes="1" nodeid="1"> </clusternode> <clusternode name="lnaiqlv22-cl2" votes="1" nodeid="2"> </clusternode> <clusternode name="lnaiqlv23-cl2" votes="1" nodeid="3"> </clusternode> <clusternode name="lnaiqlv24-cl2" votes="1" nodeid="4"> </clusternode> </clusternodes> <cman port="5405"> <multicast addr="239.255.255.1"/> </cman> <fencedevices/> <rm/> <totem version="2" secauth="off" threads="0"/> <!-- rrp_mode="active"/> --> <logging/> <amf mode="disabled"/> <event/> <aisexec/> <group/> </cluster> The log: Jul 21 12:57:19 lnaiqlv21 openais[20947]: [TOTEM] Sending initial ORF token Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM ] CLM CONFIGURATION CHANGE Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM ] New Configuration: Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM ] r(0) ip(10.229.21.244) Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM ] r(0) ip(10.229.21.245) Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM ] r(0) ip(10.229.21.246) Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM ] Members Left: Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM ] Members Joined: Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM ] CLM CONFIGURATION CHANGE Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM ] New Configuration: Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM ] r(0) ip(10.229.21.244) Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM ] r(0) ip(10.229.21.245) Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM ] r(0) ip(10.229.21.246) Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM ] r(0) ip(10.229.21.247) Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM ] Members Left: Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM ] Members Joined: Jul 21 12:57:19 lnaiqlv21 openais[20947]: [CLM ] r(0) ip(10.229.21.247) Jul 21 12:57:19 lnaiqlv21 openais[20947]: [SYNC ] This node is within the primary component and will provide service. Jul 21 12:57:19 lnaiqlv21 openais[20947]: [TOTEM] entering OPERATIONAL state. Jul 21 12:57:19 lnaiqlv21 openais[20947]: [MAIN ] Killing node lnaiqlv24-cl2 because it has rejoined the cluster without cman_tool join Jul 21 12:57:29 lnaiqlv21 openais[20947]: [TOTEM] The token was lost in the OPERATIONAL state. Jul 21 12:57:29 lnaiqlv21 openais[20947]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Jul 21 12:57:29 lnaiqlv21 openais[20947]: [TOTEM] Transmit multicast socket send buffer size (288000 bytes). Jul 21 12:57:29 lnaiqlv21 openais[20947]: [TOTEM] entering GATHER state from 2. Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] entering GATHER state from 11. Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] Creating commit token because I am the rep. Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] Saving state aru 6 high seq received 6 Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] Storing new sequence id for ring 188d4 Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] entering COMMIT state. Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] entering RECOVERY state. Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] position [0] member 10.229.21.244: Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] previous ring seq 100560 rep 10.229.21.244 Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] aru 6 high delivered 6 received flag 1 Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] position [1] member 10.229.21.245: Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] previous ring seq 100560 rep 10.229.21.244 Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] aru 6 high delivered 6 received flag 1 Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] position [2] member 10.229.21.246: Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] previous ring seq 100560 rep 10.229.21.244 Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] aru 6 high delivered 6 received flag 1 Jul 21 12:57:34 lnaiqlv21 openais[20947]: CMAN: Joined a cluster with disallowed nodes. must die Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] Did not need to originate any messages in recovery. Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] Sending initial ORF token Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM ] CLM CONFIGURATION CHANGE Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM ] New Configuration: Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM ] r(0) ip(10.229.21.244) Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM ] r(0) ip(10.229.21.245) Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM ] r(0) ip(10.229.21.246) Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM ] Members Left: Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM ] r(0) ip(10.229.21.247) Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM ] Members Joined: Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM ] CLM CONFIGURATION CHANGE Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM ] New Configuration: Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM ] r(0) ip(10.229.21.244) Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM ] r(0) ip(10.229.21.245) Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM ] r(0) ip(10.229.21.246) Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM ] Members Left: Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM ] Members Joined: Jul 21 12:57:34 lnaiqlv21 openais[20947]: [SYNC ] This node is within the primary component and will provide service. Jul 21 12:57:34 lnaiqlv21 openais[20947]: [TOTEM] entering OPERATIONAL state. Jul 21 12:57:34 lnaiqlv21 openais[20947]: [CLM ] got nodejoin message 10.229.21.245 Jul 21 12:57:34 lnaiqlv21 dlm_controld[20972]: cluster is down, exiting Jul 21 12:57:34 lnaiqlv21 gfs_controld[20978]: groupd_dispatch error -1 errno 11 Jul 21 12:57:34 lnaiqlv21 fenced[20966]: groupd is down, exiting Jul 21 12:57:34 lnaiqlv21 kernel: dlm: closing connection to node 3 Jul 21 12:57:34 lnaiqlv21 gfs_controld[20978]: groupd connection died Jul 21 12:57:34 lnaiqlv21 kernel: dlm: closing connection to node 2 Jul 21 12:57:34 lnaiqlv21 gfs_controld[20978]: cluster is down, exiting Jul 21 12:57:34 lnaiqlv21 kernel: dlm: closing connection to node 1 Jul 21 12:58:01 lnaiqlv21 ccsd[20939]: Unable to connect to cluster infrastructure after 30 seconds. Jul 21 12:58:32 lnaiqlv21 ccsd[20939]: Unable to connect to cluster infrastructure after 60 seconds.
packages used: cman-2.0.98-1.el5_3.4 openais-0.80.3-22.el5_3.8
I managed to make this happen using the STABLE3 code on Fedora 11. I'll go through the logs in detail on Monday.
Committed to the RHEL55 branch of git. commit 34bccfffdb35f368a72e2fa6859f15f6e8f9ebb8 Author: Christine Caulfield <ccaulfie> Date: Wed Jul 29 11:17:47 2009 +0100 cman: Fix a situation where cman could kill the wrong nodes
Chrissie, I've written up a new test to cover this bug and I would like to know if we should be covering both the INPUT and the OUTPUT cases (where we put the DROP iptables rule in either chain)?
Created attachment 398643 [details] Test output including log captures during test case. I'm still hitting some problems when running this on higher node counts. At times I get multiple partitions in cman with the rest of the nodes in openais membership as disallowed: ============================================================ Iteration 1: west-01 OUTPUT ============================================================ Setting up log capture: west-01 west-02 west-03 west-04 west-05 west-06 west-07 west-08 Stopping traffic from west-01 Waiting for other nodes to notice. Restarting traffic from west-01 Waiting up to 60 seconds for things to blow up west-01 killed by node 2 because it joined without a full restart west-03 killing west-01 because it has rejoined the cluster with exisiting state west-02 killing west-01 because it has rejoined the cluster with exisiting state west-05 killing west-01 because it has rejoined the cluster with exisiting state west-06 killing west-01 because it has rejoined the cluster with exisiting state west-04 killing west-01 because it has rejoined the cluster with exisiting state Error while checking for missing node Cluster state - rows are 'cman_tool nodes' output from that node west-01 west-02 west-03 west-04 west-05 west-06 west-07 west-08 ======================================================================== west-01 west-02 X M *d *d *d *d *d *d west-03 X *d M M M M *d *d west-04 X *d M M M M *d *d west-05 X *d M M M M *d *d west-06 X *d M M M M *d *d west-07 X *d *d *d *d *d M M west-08 X *d *d *d *d *d M M unexpected states marked with *
Disallowed state generally is not part of this bug. If we need to tune openais for higher node counts then it should really be in a separate BZ. I managed to get 32 nodes but there will very likely be loads that will break at lower node counts.
it might also be related to https://bugzilla.redhat.com/show_bug.cgi?id=556804
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0266.html