Bug 754237 - cman fails to start (fine after system is up)
Summary: cman fails to start (fine after system is up)
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais
Version: 5.7
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 807971
TreeView+ depends on / blocked
 
Reported: 2011-11-15 20:01 UTC by Shad L. Lords
Modified: 2012-05-30 06:23 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-05-30 06:23:33 UTC
Target Upstream Version:


Attachments (Terms of Use)
Requested files (28.30 KB, application/octet-stream)
2012-04-10 16:17 UTC, Shad L. Lords
no flags Details

Description Shad L. Lords 2011-11-15 20:01:13 UTC
Description of problem:

When booting a three node cluster on an Intel Modular Server the cluster fails to come up correctly.  cman appears to start on the three nodes and they form a cluster and then two of them start spouting off "[TOTEM] Retransmit List: xxxx" with the same xxxx repeated 3-4 times a second.  This never stops.

Version-Release number of selected component (if applicable):

cman-2.0.115-85.el5_7.2

How reproducible:

Always

Steps to Reproduce:
1. Reboot or start cluster fresh.
  
Actual results:

Cluster fails to initialize correctly

Expected results:

Cluster comes up without issues.

Additional info:

If I disable cman, clvmd, rgmanager then I'm able to do the following (after the machine is fully booted) and things come up as they should.  I wait about 2-3 seconds between each command.  (cexec executes the command on all three nodes at the same time)

cexec service cman start
cexec service clvmd start
cexec service cmirror restart
cexec service gfs start
cexec service rgmanager start

Comment 1 Shad L. Lords 2011-11-18 17:42:09 UTC
Few more notes on this.  I first noticed an issue which might be the same thing when I added the first IMS blade to an existing 5 node cluster.  When the IMS would start (automatically) the cluster would exhibit the same behavior where all but one node would spout off the "[TOTEM] Retransmit List: xxxx".  If I did a manual fence_node of the node that wasn't doing the Retransmit List then as soon as the node was successfully fenced the cluster would recover and start working fine.

I was fine with this method as long as I had a non IMS part of the cluster. (The slowest box always seemed to be the one that wasn't throwing the retransmit message).  I would just make it so no resources were on this node and I was able to fence it easily.

However once I had the cluster just on the IMS this doesn't work.  When one of the IMS nodes reboots and automatically starts the cman services then it throws things into a tizzy.  The only way I've been able to get things working is to manually start them after the box is fully up.

Comment 2 RHEL Program Management 2012-04-02 10:52:29 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 3 Lon Hohberger 2012-04-09 17:28:10 UTC
This was addressed in a fix to openais:

http://rhn.redhat.com/errata/RHBA-2012-0180.html

*** This bug has been marked as a duplicate of bug 781773 ***

Comment 4 Shad L. Lords 2012-04-09 17:47:08 UTC
Does it matter that this is a RHEL5 bug and was duped to Fedora 16?

Comment 5 Shad L. Lords 2012-04-09 18:31:57 UTC
This has not been fixed by RHBA-2012-0180.  I've installed all updates as of today and restarted the cluster with services enabled.  2 of the 3 nodes start spitting out "[TOTEM] Retransmit List: xxxx" with the same number over and over and the cluster hangs.

Comment 6 Fabio Massimo Di Nitto 2012-04-10 03:26:27 UTC
Please attach sosreports or cluster.conf and logs from the nodes.

Comment 7 RHEL Program Management 2012-04-10 03:37:06 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 8 Jan Friesse 2012-04-10 06:59:42 UTC
Together with sosreports, please attach iptables configuration (I believe it's in sosreport, but ...). Because I believe this really looks like problem with network configuration.

Comment 9 Shad L. Lords 2012-04-10 16:17:29 UTC
Created attachment 576508 [details]
Requested files

Comment 10 Shad L. Lords 2012-04-10 16:25:12 UTC
I've attached a file that includes the messages log for each member of the cluster during the attempts to bring the system up yesterday.  the first boot is with all the services enabled.  The second one is with the the services disabled and after all 3 were up running a script that contains the following:

[root@xen2 ~]# cat ~/bin/start_cluster
#!/bin/bash

service cman start
service cmirror start
sleep 1
service clvmd start
sleep 1
service gfs start
sleep 5
service rgmanager start

I've run this exact same configuration on a different set of hardware (slower machines) and everything comes up as it should (see first few comments of this bug).  It was only after adding the IMS nodes to the cluster that I started seeing this issue.

Comment 12 Steven Dake 2012-04-10 18:26:16 UTC
Shad,

Please submit your issue through your support representative.  They can surprisingly troubleshoot these switch configuration issues better then engineering since they see all customer issues.

Regards
-steve

Comment 14 Jan Friesse 2012-04-24 13:32:58 UTC
Shad,
please let me know if gss was able to solve your issue so I can close bug eventually.

Comment 15 Shad L. Lords 2012-05-30 02:05:07 UTC
I ended up fixing this by setting NETWORK_BRIDGE_SCRIPT in /etc/sysconfig/cman to the network script I was using.  My custom script was setting up 4 bridges.  Once I set the NETWORK_BRIDGE_SCRIPT to match my custom script everything works as expected.

This can be closed.

Comment 16 Jan Friesse 2012-05-30 06:23:33 UTC
Shad,
thanks for good news and hope everything will work as expected.

Closing as NOTABUG.


Note You need to log in before you can comment on or make changes to this bug.