Description of problem: Heterogenous cluster of HP DL580 G5 (Intel 16 core, 64 GB and AMD 8 core 72 GB) running RHEL 5.2 AP and recently updated Cluster is formed using luci. post-join timeout chagned. fences added (HP ilo) In this state the cluster appears okay, daemons appear up and no unusual entried in the messages file. Upon a reboot, nodes appear to join, then totem appears to fail on the rebooted node, which cuases it to be fenced. Lon discovered that upon disabling rgmanager, the cluster rejoined and seemed stable. However if rgmanager was started by hand on the rebooted node, it then failed. Even with rgmanager disabled, a cluster.conf update was attempted, and it appeared that the updated caused the same or similar issues. Nov 3 15:10:50 monet ccsd[8302]: Update of cluster.conf complete (version 5 -> 6). Nov 3 15:11:07 monet openais[8309]: [TOTEM] The token was lost in the OPERATIONAL state. Nov 3 15:11:07 monet openais[8309]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Nov 3 15:11:07 monet openais[8309]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Nov 3 15:11:07 monet openais[8309]: [TOTEM] entering GATHER state from 2. Nov 3 15:11:12 monet openais[8309]: [TOTEM] entering GATHER state from 0. Nov 3 15:11:12 monet openais[8309]: [TOTEM] Creating commit token because I am the rep. Nov 3 15:11:12 monet openais[8309]: [TOTEM] Saving state aru 36 high seq received 36 Nov 3 15:11:12 monet openais[8309]: [TOTEM] Storing new sequence id for ring 8b8 Nov 3 15:11:12 monet openais[8309]: [TOTEM] entering COMMIT state. Nov 3 15:11:12 monet openais[8309]: [TOTEM] entering RECOVERY state. Nov 3 15:11:12 monet openais[8309]: [TOTEM] position [0] member 10.10.10.100: Nov 3 15:11:12 monet openais[8309]: [TOTEM] previous ring seq 2228 rep 10.10.10.100 Nov 3 15:11:12 monet openais[8309]: [TOTEM] aru 36 high delivered 36 received flag 1 Nov 3 15:11:12 monet openais[8309]: [TOTEM] Did not need to originate any messages in recovery. Nov 3 15:11:12 monet openais[8309]: [TOTEM] Sending initial ORF token Nov 3 15:11:12 monet openais[8309]: [CLM ] CLM CONFIGURATION CHANGE Nov 3 15:11:12 monet openais[8309]: [CLM ] New Configuration: Nov 3 15:11:12 monet openais[8309]: [CLM ] r(0) ip(10.10.10.100) Nov 3 15:11:12 monet openais[8309]: [CLM ] Members Left: Nov 3 15:11:12 monet kernel: dlm: closing connection to node 1 Nov 3 15:11:12 monet openais[8309]: [CLM ] r(0) ip(10.10.10.102) Nov 3 15:11:12 monet openais[8309]: [CLM ] Members Joined: Nov 3 15:11:12 monet openais[8309]: [CLM ] CLM CONFIGURATION CHANGE Nov 3 15:11:12 monet openais[8309]: [CLM ] New Configuration: Nov 3 15:11:12 monet openais[8309]: [CLM ] r(0) ip(10.10.10.100) Nov 3 15:11:12 monet openais[8309]: [CLM ] Members Left: Nov 3 15:11:12 monet openais[8309]: [CLM ] Members Joined: Nov 3 15:11:12 monet openais[8309]: [SYNC ] This node is within the primary component and will provide service. Nov 3 15:11:12 monet openais[8309]: [TOTEM] entering OPERATIONAL state. Nov 3 15:11:12 monet openais[8309]: [CLM ] got nodejoin message 10.10.10.100 Nov 3 15:11:12 monet openais[8309]: [CPG ] got joinlist message from node 2 Nov 3 15:11:12 monet fenced[8325]: renoir-ic.lab.bos.redhat.com not a cluster member after 0 sec post_fail_delay Nov 3 15:11:12 monet fenced[8325]: fencing node "renoir-ic.lab.bos.redhat.com" Nov 3 15:12:18 monet ccsd[8302]: Attempt to close an unopened CCS descriptor (25470). Nov 3 15:12:18 monet ccsd[8302]: Error while processing disconnect: Invalid request descriptor Nov 3 15:12:18 monet fenced[8325]: fence "renoir-ic.lab.bos.redhat.com" success Nov 3 15:16:06 monet openais[8309]: [TOTEM] entering GATHER state from 11. Nov 3 15:16:06 monet openais[8309]: [TOTEM] Creating commit token because I am the rep. Nov 3 15:16:06 monet openais[8309]: [TOTEM] Saving state aru 11 high seq received 11 Nov 3 15:10:50 renoir ccsd[7221]: Update of cluster.conf complete (version 5 -> 6). Nov 3 15:10:58 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:10:58 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:10:58 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:10:58 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:10:58 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:10:59 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:10:59 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:10:59 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:10:59 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:11:00 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:11:00 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:11:00 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:11:00 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:11:01 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:11:01 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:11:01 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:11:01 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:11:02 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:11:02 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:11:02 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:11:02 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:11:03 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:11:03 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:11:03 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:11:03 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:11:04 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:11:04 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:11:04 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:11:04 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:11:05 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:11:05 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:11:05 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:11:05 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:11:06 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:11:06 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:11:06 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:11:06 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:11:07 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE Nov 3 15:11:07 renoir openais[7229]: [TOTEM] entering GATHER state from 6. Nov 3 15:11:12 renoir openais[7229]: [TOTEM] entering GATHER state from 0. Nov 3 15:11:26 renoir openais[7229]: [TOTEM] The consensus timeout expired. Nov 3 15:11:26 renoir openais[7229]: [TOTEM] entering GATHER state from 3. Nov 3 15:11:32 renoir gnome-power-manager: (root) GNOME interactive logout because the power button has been pressed Version-Release number of selected component (if applicable): [root@renoir crash]# cat /etc/redhat-release ; uname -a Red Hat Enterprise Linux Server release 5.2 (Tikanga) Linux renoir.lab.bos.redhat.com 2.6.18-92.1.13.el5xen #1 SMP Thu Sep 4 04:07:08 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux [root@renoir crash]# BTW - the 2 nodes have had guests (4 vCPU, 4G, RHEL 5.2) on them that have formed a stable cluster which none of these issues have been seen. How reproducible: This instability was reproduced several times: iptables and SElunix enabled and disabled using existing packages and downloading using a qdisk and not using a qdisk Using other nodes (another DL580, similarly configured) Using various interconnect switches (including one that is known to be functioning with other clusters) Using manual fencing instead of the HP ilo Steps to Reproduce: 1. form cluster 2. update post join timeout and fences 3. reboot a node Actual results: Unstable cluster Expected results: stable cluster, even with rgmanager running Additional info:
Please indicate the version of the openais package you are using. -15 or a later version? Is there remote access available to the machines to debug? That would help quite a bit. I have never seen this kind of field failure before but it indicates the system is failing to receive any multicast messages for over 30 token rotations which is signifnicant. Rgmanager running just sends multicast messages which triggers the failed to receive state because of lack of received messages.
Version : AIS Executive Service RELEASE 'subrev 1358 version 0.80.3' openais.x86_64 0.80.3-15.el5 installed machines are available, just will need to co-ordinate a little
system has xen bridging. Moving cman to run at runlevel 99 fixes problem. Steve's coworker reported in a recent kernel upgrade this problem was introduced and his system only used virbr. This is really out of my domain of expertise - would be helpful to have the person that did the xenbr magic code in cman init script take a look at this issue and fix the compatability problem. Regards -steve
The co-workers problems does not seem to be related, changing his init sequence number did not change his issue. Also confirmed that if the configuration does not have the Xen network bridges configured, the cluster was stable. Put the bridges back in place and the instability returned on the next reboot.
apparently this is resolved in 5.3. Please try 5.3 and if problem persists, reopen defect. Thanks.
Hello, A customer with a similar issue did an update today (28/Jul/09) to RHEL 5.3 and the problem persists. The log shows this message many times: Jul 28 08:12:50 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:50 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Could you please confirm what this message means? Is is regarding multicast messages? Here are the log: Jul 28 08:12:34 server01 openais[2552]: [TOTEM] The token was lost in the OPERATIONAL state. Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Jul 28 08:12:34 server01 openais[2552]: [TOTEM] entering GATHER state from 2. Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Creating commit token because I am the rep. Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Saving state aru 3a high seq received 3a Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Storing new sequence id for ring 944 Jul 28 08:12:34 server01 openais[2552]: [TOTEM] entering COMMIT state. Jul 28 08:12:34 server01 openais[2552]: [TOTEM] entering RECOVERY state. Jul 28 08:12:34 server01 openais[2552]: [TOTEM] position [0] member 192.168.100.13: Jul 28 08:12:34 server01 openais[2552]: [TOTEM] previous ring seq 2368 rep 192.168.100.13 Jul 28 08:12:34 server01 openais[2552]: [TOTEM] aru 3a high delivered 3a received flag 0 Jul 28 08:12:34 server01 openais[2552]: [TOTEM] position [1] member 192.168.100.14: Jul 28 08:12:34 server01 openais[2552]: [TOTEM] previous ring seq 2368 rep 192.168.100.13 Jul 28 08:12:34 server01 openais[2552]: [TOTEM] aru 3b high delivered 3b received flag 1 Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Did not need to originate any messages in recovery. Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Sending initial ORF token Jul 28 08:12:34 server01 openais[2552]: [CLM ] CLM CONFIGURATION CHANGE Jul 28 08:12:34 server01 openais[2552]: [CLM ] New Configuration: Jul 28 08:12:34 server01 openais[2552]: [CLM ] r(0) ip(192.168.100.13) Jul 28 08:12:34 server01 openais[2552]: [CLM ] r(0) ip(192.168.100.14) Jul 28 08:12:34 server01 openais[2552]: [CLM ] Members Left: Jul 28 08:12:34 server01 openais[2552]: [CLM ] Members Joined: Jul 28 08:12:34 server01 openais[2552]: [CLM ] CLM CONFIGURATION CHANGE Jul 28 08:12:34 server01 openais[2552]: [CLM ] New Configuration: Jul 28 08:12:34 server01 openais[2552]: [CLM ] r(0) ip(192.168.100.13) Jul 28 08:12:34 server01 openais[2552]: [CLM ] r(0) ip(192.168.100.14) Jul 28 08:12:34 server01 openais[2552]: [CLM ] Members Left: Jul 28 08:12:34 server01 openais[2552]: [CLM ] Members Joined: Jul 28 08:12:34 server01 openais[2552]: [SYNC ] This node is within the primary component and will provide service. Jul 28 08:12:34 server01 openais[2552]: [TOTEM] entering OPERATIONAL state. Jul 28 08:12:34 server01 openais[2552]: [CLM ] got nodejoin message 192.168.100.13 Jul 28 08:12:41 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:41 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:41 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:41 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:42 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:42 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:42 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:42 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:43 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:43 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:43 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:43 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:44 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:44 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:44 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:44 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:45 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:45 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:45 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:45 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:46 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:46 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:46 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:46 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:47 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:47 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:47 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:47 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:48 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:48 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:48 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:48 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:49 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:49 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:49 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:49 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:50 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:50 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:50 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:50 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Creating commit token because I am the rep. Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Saving state aru 5 high seq received 5 Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Storing new sequence id for ring 948 Jul 28 08:12:51 server01 openais[2552]: [TOTEM] entering COMMIT state. Jul 28 08:12:51 server01 openais[2552]: [TOTEM] entering RECOVERY state. Jul 28 08:12:51 server01 openais[2552]: [TOTEM] position [0] member 192.168.100.13: Jul 28 08:12:51 server01 openais[2552]: [TOTEM] previous ring seq 2372 rep 192.168.100.13 Jul 28 08:12:51 server01 openais[2552]: [TOTEM] aru 5 high delivered 5 received flag 1 Jul 28 08:12:51 server01 openais[2552]: [TOTEM] position [1] member 192.168.100.14: Jul 28 08:12:51 server01 openais[2552]: [TOTEM] previous ring seq 2372 rep 192.168.100.13 Jul 28 08:12:51 server01 openais[2552]: [TOTEM] aru 2 high delivered 2 received flag 0 Jul 28 08:12:51 server01 openais[2552]: [TOTEM] copying all old ring messages from 3-5. Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Originated 3 messages in RECOVERY. Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Originated for recovery: 3 4 5 Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Not Originated for recovery: Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Sending initial ORF token Thanks in advance. Regards, Florencia
My guess is the customer's iptable rules were not set properly during the update. This error means that no multicast messages could be received by the receivers. Regards -steve
Thanks Steve for your quick reply. Iptables is disabled and customer hasn't a Cisco Swith (which I read could have problems with multicast messages). The nodes are two Blades from an IBM Blade Center. Here is the information: Chasis: BladeCenter-H. Description BladeCenter-H Machine Type/Model 88524YU Nodes: Product Name HS21-XM Blade Server, 2 dual- or quad-core Intel Xeon Description HS21 XM (Type 7995) Could you please let me know if the error with multicast messages is related with openais or is a hardware/configuration issue? Thanks again. Regards, Florencia
THey used the software with 5.2 and an upgrade to 5.3 caused this problem? I believe the bladecenter uses cisco switches internally. Usually this would be a hardware configuration issue with the switch or a iptables issue.
I reread bugzilla and was curious, is the user using xen briding or virbr?
Hello, > THey used the software with 5.2 and an upgrade to 5.3 caused this problem? No, the problem was in RHEL 5.2 and persists after the upgrade to RHEL 5.3. > I reread bugzilla and was curious, is the user using xen briding or virbr? Neither of them. It's not using Xen. Thanks
Hello, The switch included in the BlaceCenter is "IBM Server Connectivity Module for BladeCenter 39y9324". Do you have any workaround to check, similar to the one for Cisco Switches (http://kbase.redhat.com/faq/docs/DOC-5933)? I know it's not regarding openais, but perhaps you saw this behavior before. Thanks in advance.
Hello, I saw that "IGMP Snoopping" is enable by default in BladeCenter's switch, as in Cisco switches. (http://publib.boulder.ibm.com/infocenter/bladectr/documentation/topic/com.ibm.bladecenter.io_39Y9324.doc/31r1755.pdf) Which are the switch's requirements for openais works properly? Thanks in advance.
First i'd check that the user doesn't have xend starting by default. Another option is to change the init script for cman to run at runlevel 99. If that doesn't work, then it could be a switch problem. We have docs on the product which describe the switch requirements, but I am not sure where they are. Ask paul kennedy. Regards -steve
Florencia, I don't know that we have a clear picture of the switch reqirements. One thing you might try if you haven't is to turn on fast port forwarding on the switch.
Florencia, Here is a relevant kbase article: http://kbase.redhat.com/faq/docs/DOC-5933
Thanks Steven for your help. We follow this guidelines and it seems it's working now: Multicast Addresses http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/s1-multicast-considerations-CA.html Why are my Red Hat Enterprise Linux cluster nodes having problems communicating when connected to a Cisco Catalyst switch? http://kbase.redhat.com/faq/docs/DOC-5933 Thanks and regards, -- Florencia Fotorello Global Support Services Red Hat Latin America