Hide Forgot
Description of problem: When booting a three node cluster on an Intel Modular Server the cluster fails to come up correctly. cman appears to start on the three nodes and they form a cluster and then two of them start spouting off "[TOTEM] Retransmit List: xxxx" with the same xxxx repeated 3-4 times a second. This never stops. Version-Release number of selected component (if applicable): cman-2.0.115-85.el5_7.2 How reproducible: Always Steps to Reproduce: 1. Reboot or start cluster fresh. Actual results: Cluster fails to initialize correctly Expected results: Cluster comes up without issues. Additional info: If I disable cman, clvmd, rgmanager then I'm able to do the following (after the machine is fully booted) and things come up as they should. I wait about 2-3 seconds between each command. (cexec executes the command on all three nodes at the same time) cexec service cman start cexec service clvmd start cexec service cmirror restart cexec service gfs start cexec service rgmanager start
Few more notes on this. I first noticed an issue which might be the same thing when I added the first IMS blade to an existing 5 node cluster. When the IMS would start (automatically) the cluster would exhibit the same behavior where all but one node would spout off the "[TOTEM] Retransmit List: xxxx". If I did a manual fence_node of the node that wasn't doing the Retransmit List then as soon as the node was successfully fenced the cluster would recover and start working fine. I was fine with this method as long as I had a non IMS part of the cluster. (The slowest box always seemed to be the one that wasn't throwing the retransmit message). I would just make it so no resources were on this node and I was able to fence it easily. However once I had the cluster just on the IMS this doesn't work. When one of the IMS nodes reboots and automatically starts the cman services then it throws things into a tizzy. The only way I've been able to get things working is to manually start them after the box is fully up.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release.
This was addressed in a fix to openais: http://rhn.redhat.com/errata/RHBA-2012-0180.html *** This bug has been marked as a duplicate of bug 781773 ***
Does it matter that this is a RHEL5 bug and was duped to Fedora 16?
This has not been fixed by RHBA-2012-0180. I've installed all updates as of today and restarted the cluster with services enabled. 2 of the 3 nodes start spitting out "[TOTEM] Retransmit List: xxxx" with the same number over and over and the cluster hangs.
Please attach sosreports or cluster.conf and logs from the nodes.
Together with sosreports, please attach iptables configuration (I believe it's in sosreport, but ...). Because I believe this really looks like problem with network configuration.
Created attachment 576508 [details] Requested files
I've attached a file that includes the messages log for each member of the cluster during the attempts to bring the system up yesterday. the first boot is with all the services enabled. The second one is with the the services disabled and after all 3 were up running a script that contains the following: [root@xen2 ~]# cat ~/bin/start_cluster #!/bin/bash service cman start service cmirror start sleep 1 service clvmd start sleep 1 service gfs start sleep 5 service rgmanager start I've run this exact same configuration on a different set of hardware (slower machines) and everything comes up as it should (see first few comments of this bug). It was only after adding the IMS nodes to the cluster that I started seeing this issue.
Shad, Please submit your issue through your support representative. They can surprisingly troubleshoot these switch configuration issues better then engineering since they see all customer issues. Regards -steve
Shad, please let me know if gss was able to solve your issue so I can close bug eventually.
I ended up fixing this by setting NETWORK_BRIDGE_SCRIPT in /etc/sysconfig/cman to the network script I was using. My custom script was setting up 4 bridges. Once I set the NETWORK_BRIDGE_SCRIPT to match my custom script everything works as expected. This can be closed.
Shad, thanks for good news and hope everything will work as expected. Closing as NOTABUG.