Bug 469874
| Summary: | Openais appears to fail, causing cluster member to fence | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Steve Reichard <sreichar> |
| Component: | openais | Assignee: | Steven Dake <sdake> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 5.2 | CC: | cluster-maint, edamato, ffotorel, tao |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | 5.3 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2009-01-20 20:40:00 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Steve Reichard
2008-11-04 16:01:17 UTC
Please indicate the version of the openais package you are using. -15 or a later version? Is there remote access available to the machines to debug? That would help quite a bit. I have never seen this kind of field failure before but it indicates the system is failing to receive any multicast messages for over 30 token rotations which is signifnicant. Rgmanager running just sends multicast messages which triggers the failed to receive state because of lack of received messages. Version : AIS Executive Service RELEASE 'subrev 1358 version 0.80.3' openais.x86_64 0.80.3-15.el5 installed machines are available, just will need to co-ordinate a little system has xen bridging. Moving cman to run at runlevel 99 fixes problem. Steve's coworker reported in a recent kernel upgrade this problem was introduced and his system only used virbr. This is really out of my domain of expertise - would be helpful to have the person that did the xenbr magic code in cman init script take a look at this issue and fix the compatability problem. Regards -steve The co-workers problems does not seem to be related, changing his init sequence number did not change his issue. Also confirmed that if the configuration does not have the Xen network bridges configured, the cluster was stable. Put the bridges back in place and the instability returned on the next reboot. apparently this is resolved in 5.3. Please try 5.3 and if problem persists, reopen defect. Thanks. Hello, A customer with a similar issue did an update today (28/Jul/09) to RHEL 5.3 and the problem persists. The log shows this message many times: Jul 28 08:12:50 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:50 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Could you please confirm what this message means? Is is regarding multicast messages? Here are the log: Jul 28 08:12:34 server01 openais[2552]: [TOTEM] The token was lost in the OPERATIONAL state. Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Jul 28 08:12:34 server01 openais[2552]: [TOTEM] entering GATHER state from 2. Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Creating commit token because I am the rep. Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Saving state aru 3a high seq received 3a Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Storing new sequence id for ring 944 Jul 28 08:12:34 server01 openais[2552]: [TOTEM] entering COMMIT state. Jul 28 08:12:34 server01 openais[2552]: [TOTEM] entering RECOVERY state. Jul 28 08:12:34 server01 openais[2552]: [TOTEM] position [0] member 192.168.100.13: Jul 28 08:12:34 server01 openais[2552]: [TOTEM] previous ring seq 2368 rep 192.168.100.13 Jul 28 08:12:34 server01 openais[2552]: [TOTEM] aru 3a high delivered 3a received flag 0 Jul 28 08:12:34 server01 openais[2552]: [TOTEM] position [1] member 192.168.100.14: Jul 28 08:12:34 server01 openais[2552]: [TOTEM] previous ring seq 2368 rep 192.168.100.13 Jul 28 08:12:34 server01 openais[2552]: [TOTEM] aru 3b high delivered 3b received flag 1 Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Did not need to originate any messages in recovery. Jul 28 08:12:34 server01 openais[2552]: [TOTEM] Sending initial ORF token Jul 28 08:12:34 server01 openais[2552]: [CLM ] CLM CONFIGURATION CHANGE Jul 28 08:12:34 server01 openais[2552]: [CLM ] New Configuration: Jul 28 08:12:34 server01 openais[2552]: [CLM ] r(0) ip(192.168.100.13) Jul 28 08:12:34 server01 openais[2552]: [CLM ] r(0) ip(192.168.100.14) Jul 28 08:12:34 server01 openais[2552]: [CLM ] Members Left: Jul 28 08:12:34 server01 openais[2552]: [CLM ] Members Joined: Jul 28 08:12:34 server01 openais[2552]: [CLM ] CLM CONFIGURATION CHANGE Jul 28 08:12:34 server01 openais[2552]: [CLM ] New Configuration: Jul 28 08:12:34 server01 openais[2552]: [CLM ] r(0) ip(192.168.100.13) Jul 28 08:12:34 server01 openais[2552]: [CLM ] r(0) ip(192.168.100.14) Jul 28 08:12:34 server01 openais[2552]: [CLM ] Members Left: Jul 28 08:12:34 server01 openais[2552]: [CLM ] Members Joined: Jul 28 08:12:34 server01 openais[2552]: [SYNC ] This node is within the primary component and will provide service. Jul 28 08:12:34 server01 openais[2552]: [TOTEM] entering OPERATIONAL state. Jul 28 08:12:34 server01 openais[2552]: [CLM ] got nodejoin message 192.168.100.13 Jul 28 08:12:41 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:41 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:41 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:41 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:42 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:42 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:42 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:42 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:43 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:43 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:43 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:43 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:44 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:44 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:44 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:44 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:45 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:45 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:45 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:45 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:46 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:46 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:46 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:46 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:47 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:47 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:47 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:47 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:48 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:48 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:48 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:48 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:49 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:49 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:49 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:49 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:50 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:50 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:50 server01 openais[2552]: [TOTEM] FAILED TO RECEIVE Jul 28 08:12:50 server01 openais[2552]: [TOTEM] entering GATHER state from 6. Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Creating commit token because I am the rep. Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Saving state aru 5 high seq received 5 Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Storing new sequence id for ring 948 Jul 28 08:12:51 server01 openais[2552]: [TOTEM] entering COMMIT state. Jul 28 08:12:51 server01 openais[2552]: [TOTEM] entering RECOVERY state. Jul 28 08:12:51 server01 openais[2552]: [TOTEM] position [0] member 192.168.100.13: Jul 28 08:12:51 server01 openais[2552]: [TOTEM] previous ring seq 2372 rep 192.168.100.13 Jul 28 08:12:51 server01 openais[2552]: [TOTEM] aru 5 high delivered 5 received flag 1 Jul 28 08:12:51 server01 openais[2552]: [TOTEM] position [1] member 192.168.100.14: Jul 28 08:12:51 server01 openais[2552]: [TOTEM] previous ring seq 2372 rep 192.168.100.13 Jul 28 08:12:51 server01 openais[2552]: [TOTEM] aru 2 high delivered 2 received flag 0 Jul 28 08:12:51 server01 openais[2552]: [TOTEM] copying all old ring messages from 3-5. Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Originated 3 messages in RECOVERY. Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Originated for recovery: 3 4 5 Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Not Originated for recovery: Jul 28 08:12:51 server01 openais[2552]: [TOTEM] Sending initial ORF token Thanks in advance. Regards, Florencia My guess is the customer's iptable rules were not set properly during the update. This error means that no multicast messages could be received by the receivers. Regards -steve Thanks Steve for your quick reply. Iptables is disabled and customer hasn't a Cisco Swith (which I read could have problems with multicast messages). The nodes are two Blades from an IBM Blade Center. Here is the information: Chasis: BladeCenter-H. Description BladeCenter-H Machine Type/Model 88524YU Nodes: Product Name HS21-XM Blade Server, 2 dual- or quad-core Intel Xeon Description HS21 XM (Type 7995) Could you please let me know if the error with multicast messages is related with openais or is a hardware/configuration issue? Thanks again. Regards, Florencia THey used the software with 5.2 and an upgrade to 5.3 caused this problem? I believe the bladecenter uses cisco switches internally. Usually this would be a hardware configuration issue with the switch or a iptables issue. I reread bugzilla and was curious, is the user using xen briding or virbr? Hello, > THey used the software with 5.2 and an upgrade to 5.3 caused this problem? No, the problem was in RHEL 5.2 and persists after the upgrade to RHEL 5.3. > I reread bugzilla and was curious, is the user using xen briding or virbr? Neither of them. It's not using Xen. Thanks Hello, The switch included in the BlaceCenter is "IBM Server Connectivity Module for BladeCenter 39y9324". Do you have any workaround to check, similar to the one for Cisco Switches (http://kbase.redhat.com/faq/docs/DOC-5933)? I know it's not regarding openais, but perhaps you saw this behavior before. Thanks in advance. Hello, I saw that "IGMP Snoopping" is enable by default in BladeCenter's switch, as in Cisco switches. (http://publib.boulder.ibm.com/infocenter/bladectr/documentation/topic/com.ibm.bladecenter.io_39Y9324.doc/31r1755.pdf) Which are the switch's requirements for openais works properly? Thanks in advance. First i'd check that the user doesn't have xend starting by default. Another option is to change the init script for cman to run at runlevel 99. If that doesn't work, then it could be a switch problem. We have docs on the product which describe the switch requirements, but I am not sure where they are. Ask paul kennedy. Regards -steve Florencia, I don't know that we have a clear picture of the switch reqirements. One thing you might try if you haven't is to turn on fast port forwarding on the switch. Florencia, Here is a relevant kbase article: http://kbase.redhat.com/faq/docs/DOC-5933 Thanks Steven for your help. We follow this guidelines and it seems it's working now: Multicast Addresses http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/s1-multicast-considerations-CA.html Why are my Red Hat Enterprise Linux cluster nodes having problems communicating when connected to a Cisco Catalyst switch? http://kbase.redhat.com/faq/docs/DOC-5933 Thanks and regards, -- Florencia Fotorello Global Support Services Red Hat Latin America |