Bug 469874

Summary:	Openais appears to fail, causing cluster member to fence
Product:	Red Hat Enterprise Linux 5	Reporter:	Steve Reichard <sreichar>
Component:	openais	Assignee:	Steven Dake <sdake>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Cluster QE <mspqa-list>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.2	CC:	cluster-maint, edamato, ffotorel, tao
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	5.3	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-01-20 20:40:00 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Steve Reichard 2008-11-04 16:01:17 UTC

Description of problem:

Heterogenous cluster of HP DL580 G5 (Intel 16 core, 64 GB and AMD 8 core 72 GB)
running RHEL 5.2 AP and recently updated

Cluster is formed using luci.   post-join timeout chagned.   fences added (HP ilo)

In this state the cluster appears okay, daemons appear up and no unusual entried in the messages file.   

Upon a reboot, nodes appear to join,  then totem appears to fail on the rebooted node, which cuases it to be fenced.

Lon discovered that upon disabling rgmanager, the cluster rejoined and seemed stable.  However if rgmanager was started by hand on the rebooted node, it then failed.

Even with rgmanager disabled,  a cluster.conf update was attempted, and it appeared that the updated caused the same or similar issues.

Nov  3 15:10:50 monet ccsd[8302]: Update of cluster.conf complete (version 5 -> 6).
Nov  3 15:11:07 monet openais[8309]: [TOTEM] The token was lost in the OPERATIONAL state.
Nov  3 15:11:07 monet openais[8309]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
Nov  3 15:11:07 monet openais[8309]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Nov  3 15:11:07 monet openais[8309]: [TOTEM] entering GATHER state from 2.
Nov  3 15:11:12 monet openais[8309]: [TOTEM] entering GATHER state from 0.
Nov  3 15:11:12 monet openais[8309]: [TOTEM] Creating commit token because I am the rep.
Nov  3 15:11:12 monet openais[8309]: [TOTEM] Saving state aru 36 high seq received 36
Nov  3 15:11:12 monet openais[8309]: [TOTEM] Storing new sequence id for ring 8b8
Nov  3 15:11:12 monet openais[8309]: [TOTEM] entering COMMIT state.
Nov  3 15:11:12 monet openais[8309]: [TOTEM] entering RECOVERY state.
Nov  3 15:11:12 monet openais[8309]: [TOTEM] position [0] member 10.10.10.100:
Nov  3 15:11:12 monet openais[8309]: [TOTEM] previous ring seq 2228 rep 10.10.10.100
Nov  3 15:11:12 monet openais[8309]: [TOTEM] aru 36 high delivered 36 received flag 1
Nov  3 15:11:12 monet openais[8309]: [TOTEM] Did not need to originate any messages in recovery.
Nov  3 15:11:12 monet openais[8309]: [TOTEM] Sending initial ORF token
Nov  3 15:11:12 monet openais[8309]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  3 15:11:12 monet openais[8309]: [CLM  ] New Configuration:
Nov  3 15:11:12 monet openais[8309]: [CLM  ]    r(0) ip(10.10.10.100)
Nov  3 15:11:12 monet openais[8309]: [CLM  ] Members Left:
Nov  3 15:11:12 monet kernel: dlm: closing connection to node 1
Nov  3 15:11:12 monet openais[8309]: [CLM  ]    r(0) ip(10.10.10.102)
Nov  3 15:11:12 monet openais[8309]: [CLM  ] Members Joined:
Nov  3 15:11:12 monet openais[8309]: [CLM  ] CLM CONFIGURATION CHANGE
Nov  3 15:11:12 monet openais[8309]: [CLM  ] New Configuration:
Nov  3 15:11:12 monet openais[8309]: [CLM  ]    r(0) ip(10.10.10.100)
Nov  3 15:11:12 monet openais[8309]: [CLM  ] Members Left:
Nov  3 15:11:12 monet openais[8309]: [CLM  ] Members Joined:
Nov  3 15:11:12 monet openais[8309]: [SYNC ] This node is within the primary component and will provide service.
Nov  3 15:11:12 monet openais[8309]: [TOTEM] entering OPERATIONAL state.
Nov  3 15:11:12 monet openais[8309]: [CLM  ] got nodejoin message 10.10.10.100
Nov  3 15:11:12 monet openais[8309]: [CPG  ] got joinlist message from node 2
Nov  3 15:11:12 monet fenced[8325]: renoir-ic.lab.bos.redhat.com not a cluster member after 0 sec post_fail_delay
Nov  3 15:11:12 monet fenced[8325]: fencing node "renoir-ic.lab.bos.redhat.com"
Nov  3 15:12:18 monet ccsd[8302]: Attempt to close an unopened CCS descriptor (25470).
Nov  3 15:12:18 monet ccsd[8302]: Error while processing disconnect: Invalid request descriptor
Nov  3 15:12:18 monet fenced[8325]: fence "renoir-ic.lab.bos.redhat.com" success
Nov  3 15:16:06 monet openais[8309]: [TOTEM] entering GATHER state from 11.
Nov  3 15:16:06 monet openais[8309]: [TOTEM] Creating commit token because I am the rep.
Nov  3 15:16:06 monet openais[8309]: [TOTEM] Saving state aru 11 high seq received 11


Nov  3 15:10:50 renoir ccsd[7221]: Update of cluster.conf complete (version 5 -> 6).
Nov  3 15:10:58 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:10:58 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:10:58 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:10:58 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:10:58 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:10:59 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:10:59 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:10:59 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:10:59 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:11:00 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:11:00 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:11:00 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:11:00 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:11:01 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:11:01 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:11:01 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:11:01 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:11:02 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:11:02 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:11:02 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:11:02 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:11:03 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:11:03 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:11:03 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:11:03 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:11:04 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:11:04 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:11:04 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:11:04 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:11:05 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:11:05 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:11:05 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:11:05 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:11:06 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:11:06 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:11:06 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:11:06 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:11:07 renoir openais[7229]: [TOTEM] FAILED TO RECEIVE
Nov  3 15:11:07 renoir openais[7229]: [TOTEM] entering GATHER state from 6.
Nov  3 15:11:12 renoir openais[7229]: [TOTEM] entering GATHER state from 0.
Nov  3 15:11:26 renoir openais[7229]: [TOTEM] The consensus timeout expired.
Nov  3 15:11:26 renoir openais[7229]: [TOTEM] entering GATHER state from 3.
Nov  3 15:11:32 renoir gnome-power-manager: (root) GNOME interactive logout because the power button has been pressed




Version-Release number of selected component (if applicable):

[root@renoir crash]# cat /etc/redhat-release ;  uname -a
Red Hat Enterprise Linux Server release 5.2 (Tikanga)
Linux renoir.lab.bos.redhat.com 2.6.18-92.1.13.el5xen #1 SMP Thu Sep 4 04:07:08 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
[root@renoir crash]# 


BTW - the 2 nodes have had guests (4 vCPU, 4G, RHEL 5.2) on them that have formed a stable cluster which none of these issues have been seen.

How reproducible:
This instability was reproduced several times:
   iptables and SElunix enabled and disabled
   using existing packages and downloading
   using a qdisk and not using a qdisk
   Using other nodes (another DL580, similarly configured)
   Using various interconnect switches (including one that is known to be functioning with other clusters)
   Using manual fencing instead of the HP ilo




Steps to Reproduce:
1.  form cluster
2.  update post join timeout and fences
3.  reboot a node
  
Actual results:
Unstable cluster

Expected results:
stable cluster, even with rgmanager running

Additional info:

Comment 1 Steven Dake 2008-11-04 22:50:34 UTC

Please indicate the version of the openais package you are using.  -15 or a later version?

Is there remote access available to the machines to debug?  That would help quite a bit.

I have never seen this kind of field failure before but it indicates the system is failing to receive any multicast messages for over 30 token rotations which is signifnicant.  Rgmanager running just sends multicast messages which triggers the failed to receive state because of lack of received messages.

Comment 2 Steve Reichard 2008-11-05 14:17:01 UTC

Version :

 AIS Executive Service RELEASE 'subrev 1358 version 0.80.3' 
openais.x86_64                           0.80.3-15.el5          installed       


machines are available, just will need to co-ordinate a little

Comment 3 Steven Dake 2008-11-10 16:57:51 UTC

system has xen bridging.  Moving cman to run at runlevel 99 fixes problem.  Steve's coworker reported in a recent kernel upgrade this problem was introduced and his system only used virbr.

This is really out of my domain of expertise - would be helpful to have the person that did the xenbr magic code in cman init script take a look at this issue and fix the compatability problem.

Regards
-steve

Comment 4 Steve Reichard 2008-11-10 22:52:39 UTC

The co-workers problems does not seem to be related, changing his init sequence number did not change his issue.

Also confirmed that if the configuration does not have the Xen network bridges configured, the cluster was stable.  Put the bridges back in place and the instability returned on the next reboot.

Comment 5 Steven Dake 2009-01-20 20:40:00 UTC

apparently this is resolved in 5.3.  Please try 5.3 and if problem persists, reopen defect.

Thanks.

Comment 6 Florencia Fotorello 2009-07-28 15:45:56 UTC

Hello,

A customer with a similar issue did an update today (28/Jul/09) to RHEL 5.3 and the problem persists.

The log shows this message many times:

Jul 28 08:12:50 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:50 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  

Could you please confirm what this message means? Is is regarding multicast messages?


Here are the log:

Jul 28 08:12:34 server01 openais[2552]: [TOTEM] The token was lost in the OPERATIONAL state.  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] entering GATHER state from 2.  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Creating commit token because I am the rep.  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Saving state aru 3a high seq received 3a  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Storing new sequence id for ring 944  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] entering COMMIT state.  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] entering RECOVERY state.  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] position [0] member 192.168.100.13:  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] previous ring seq 2368 rep 192.168.100.13  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] aru 3a high delivered 3a received flag 0  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] position [1] member 192.168.100.14:  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] previous ring seq 2368 rep 192.168.100.13  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] aru 3b high delivered 3b received flag 1  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Did not need to originate any messages in recovery.  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Sending initial ORF token  
Jul 28 08:12:34 server01 openais[2552]: [CLM  ] CLM CONFIGURATION CHANGE  
Jul 28 08:12:34 server01 openais[2552]: [CLM  ] New Configuration:  
Jul 28 08:12:34 server01 openais[2552]: [CLM  ] r(0) ip(192.168.100.13)  
Jul 28 08:12:34 server01 openais[2552]: [CLM  ] r(0) ip(192.168.100.14)  
Jul 28 08:12:34 server01 openais[2552]: [CLM  ] Members Left:  
Jul 28 08:12:34 server01 openais[2552]: [CLM  ] Members Joined:  
Jul 28 08:12:34 server01 openais[2552]: [CLM  ] CLM CONFIGURATION CHANGE  
Jul 28 08:12:34 server01 openais[2552]: [CLM  ] New Configuration:  
Jul 28 08:12:34 server01 openais[2552]: [CLM  ] r(0) ip(192.168.100.13)  
Jul 28 08:12:34 server01 openais[2552]: [CLM  ] r(0) ip(192.168.100.14)  
Jul 28 08:12:34 server01 openais[2552]: [CLM  ] Members Left:  
Jul 28 08:12:34 server01 openais[2552]: [CLM  ] Members Joined:  
Jul 28 08:12:34 server01 openais[2552]: [SYNC ] This node is within the primary component and will provide service.  
Jul 28 08:12:34 server01 openais[2552]: [TOTEM] entering OPERATIONAL state.  
Jul 28 08:12:34 server01 openais[2552]: [CLM  ] got nodejoin message 192.168.100.13  
Jul 28 08:12:41 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:41 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:41 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:41 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:42 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:42 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:42 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:42 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:43 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:43 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:43 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:43 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:44 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:44 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:44 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:44 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:45 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:45 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:45 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:45 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:46 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:46 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:46 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:46 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:47 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:47 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:47 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:47 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:48 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:48 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:48 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:48 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:49 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:49 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:49 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:49 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:50 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:50 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:50 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE  
Jul 28 08:12:50 server01 openais[2552]: [TOTEM] entering GATHER state from 6.  
Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Creating commit token because I am the rep.  
Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Saving state aru 5 high seq received 5  
Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Storing new sequence id for ring 948  
Jul 28 08:12:51 server01 openais[2552]: [TOTEM] entering COMMIT state.  
Jul 28 08:12:51 server01 openais[2552]: [TOTEM] entering RECOVERY state.  
Jul 28 08:12:51 server01 openais[2552]: [TOTEM] position [0] member 192.168.100.13:  
Jul 28 08:12:51 server01 openais[2552]: [TOTEM] previous ring seq 2372 rep 192.168.100.13  
Jul 28 08:12:51 server01 openais[2552]: [TOTEM] aru 5 high delivered 5 received flag 1  
Jul 28 08:12:51 server01 openais[2552]: [TOTEM] position [1] member 192.168.100.14:  
Jul 28 08:12:51 server01 openais[2552]: [TOTEM] previous ring seq 2372 rep 192.168.100.13  
Jul 28 08:12:51 server01 openais[2552]: [TOTEM] aru 2 high delivered 2 received flag 0  
Jul 28 08:12:51 server01 openais[2552]: [TOTEM] copying all old ring messages from 3-5.  
Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Originated 3 messages in RECOVERY.  
Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Originated for recovery: 3 4 5  
Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Not Originated for recovery:  
Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Sending initial ORF token  


Thanks in advance.

Regards,

Florencia

Comment 7 Steven Dake 2009-07-28 16:09:05 UTC

My guess is the customer's iptable rules were not set properly during the update.

This error means that no multicast messages could be received by the receivers.

Regards
-steve

Comment 8 Florencia Fotorello 2009-07-28 17:17:29 UTC

Thanks Steve for your quick reply.

Iptables is disabled and customer hasn't a Cisco Swith (which I read could have problems with multicast messages). 

The nodes are two Blades from an IBM Blade Center. Here is the information:

Chasis: BladeCenter-H.
 Description    BladeCenter-H  
 Machine Type/Model    88524YU   
Nodes:
Product Name    HS21-XM Blade Server, 2 dual- or quad-core Intel Xeon  
 Description    HS21 XM (Type 7995)   

Could you please let me know if the error with multicast messages is related with openais or is a hardware/configuration issue?

Thanks again.

Regards,

Florencia

Comment 9 Steven Dake 2009-07-28 17:24:09 UTC

THey used the software with 5.2 and an upgrade to 5.3 caused this problem?  I believe the bladecenter uses cisco switches internally.

Usually this would be a hardware configuration issue with the switch or a iptables issue.

Comment 10 Steven Dake 2009-07-28 17:26:51 UTC

I reread bugzilla and was curious, is the user using xen briding or virbr?

Comment 11 Florencia Fotorello 2009-07-28 18:35:18 UTC

Hello,

> THey used the software with 5.2 and an upgrade to 5.3 caused this problem? 
No, the problem was in RHEL 5.2 and persists after the upgrade to RHEL 5.3.

> I reread bugzilla and was curious, is the user using xen briding or virbr?  
Neither of them. It's not using Xen.

Thanks

Comment 12 Florencia Fotorello 2009-07-29 17:48:12 UTC

Hello,

The switch included in the BlaceCenter is "IBM Server Connectivity Module for BladeCenter 39y9324". 

Do you have any workaround to check, similar to the one for Cisco Switches (http://kbase.redhat.com/faq/docs/DOC-5933)?

I know it's not regarding openais, but perhaps you saw this behavior before.

Thanks in advance.

Comment 13 Florencia Fotorello 2009-07-30 13:06:59 UTC

Hello,

I saw that "IGMP Snoopping" is enable by default in BladeCenter's switch, as in Cisco switches. (http://publib.boulder.ibm.com/infocenter/bladectr/documentation/topic/com.ibm.bladecenter.io_39Y9324.doc/31r1755.pdf)

Which are the switch's requirements for openais works properly?

Thanks in advance.

Comment 14 Steven Dake 2009-07-30 17:02:40 UTC

First i'd check that the user doesn't have xend starting by default.  Another option is to change the init script for cman to run at runlevel 99.  If that doesn't work, then it could be a switch problem.

We have docs on the product which describe the switch requirements, but I am not sure where they are.  Ask paul kennedy.

Regards
-steve

Comment 15 Steven Dake 2009-08-05 20:41:38 UTC

Florencia,
I don't know that we have a clear picture of the switch reqirements.  One thing you might try if you haven't is to turn on fast port forwarding on the switch.

Comment 16 Steven Dake 2009-08-05 20:50:04 UTC

Florencia,

Here is a relevant kbase article:
http://kbase.redhat.com/faq/docs/DOC-5933

Comment 17 Florencia Fotorello 2009-08-05 21:03:17 UTC

Thanks Steven for your help.

We follow this guidelines and it seems it's working now:

Multicast Addresses
http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/s1-multicast-considerations-CA.html

Why are my Red Hat Enterprise Linux cluster nodes having problems communicating when connected to a Cisco Catalyst switch?
http://kbase.redhat.com/faq/docs/DOC-5933  

Thanks and regards,

--
Florencia Fotorello
Global Support Services
Red Hat Latin America