Description of problem: I have a two node cluster, and when qdiskd starts the following error appears and the cluster goes down. Feb 3 00:20:19 nodo2 qdiskd[2649]: <info> Assuming master role Feb 3 00:20:20 nodo2 qdiskd[2649]: <err> cman_dispatch: Host is down Feb 3 00:20:20 nodo2 qdiskd[2649]: <err> Halting qdisk operations Version-Release number of selected component (if applicable): cman-2.0.73-1.el5_1.1 How reproducible: I have one heuristic for qdiskd in cluster.conf: <quorumd device="/dev/gnbd2" votes="1" min_score="5" label="quorum" tko="20" interval="1"> <heuristic interval="2" program="ping -c1 -t1 192.168.1.254" score="2"/> </quorumd> I started cman and qdiskd in foreground "qdiskd -f -d" ans shows errors when when detects the Active node, then the cman daemon goes down. Actual results: Expected results: Additional info: If I removes the heuristics, the qdiskd daemon starts but it shows in the clustat as offline.
Are you sure you have a running two node cluster? That second line says that cman was down .. ie this node is not in the cluster, or has just left it. Are there any messages in syslog to indicate why cman has shut down? If the problem is reproducible then try starting cman with 'cman_tool join -d' to get more debugging information. Also ... if this really is RHEL5.0, try upgrading. I think a lot of qdisk problems were fixed in 5.1
cman-2.0.73-1.el5_1.1 I think is 5.1+errata. I've never seen this before, however. Could you attach your cluster.conf as well?
I'm sure that there are two nodes. Once qdisk is running it shutdowns the cman daemon. This only happends whith one heuristic. I added two more heuristics and was fine, but the configured services doesn't makes failover. Also I was trying removing the heartbeat link and the quorum is dissolved inmediately the services either makes failover, in addition the rgmanager daemon couldn't goes down, it is not possible to be stopped either rebooting the system, the only way is reset the server. I think that quorum disk resolves the split-brain problem but not. Maybe I am wrong or the cluster is badly formed, I only wants to communicate the lab that I did. I attach the cluster.conf and messages. Thanks
Created attachment 294072 [details] messages and cluster.conf
It looks like cman is being restarted without restarting the other services. If you shut cman down, it's important to make sure that everything else is also shut down. Normally it will check that for you if you use cman_tool or the init scripts - I'm not entirely sure what's happened in this case. The best wasy to be sure of this is to always use the init scripts (at least) or even to reboot the entire node to make sure that there is no state left lying around. If you reboot the node does it join the cluster properly ?
It's probably the openais IPC bug that causes openais to 'splode when qdiskd advertises master-status. Feb 3 02:15:44 apache2 openais[2863]: [MAIN ] AIS Executive Service RELEASE 'subrev 1324 version 0.80.2' It wasn't fixed until 0.80.3-some_rev.
http://rpm.pbone.net/index.php3/stat/4/idpl/5531749/com/openais-0.80.3-7.el5.i386.rpm.html http://rpm.pbone.net/index.php3/stat/4/idpl/5534741/com/openais-0.80.3-7.el5.x86_64.rpm.html Try one of those.
*** This bug has been marked as a duplicate of 314641 ***
(In reply to comment #6) > It's probably the openais IPC bug that causes openais to 'splode when qdiskd > advertises master-status. > > Feb 3 02:15:44 apache2 openais[2863]: [MAIN ] AIS Executive Service RELEASE > 'subrev 1324 version 0.80.2' > > It wasn't fixed until 0.80.3-some_rev. > > > > Yes the member node joins succesfully to the cluster.
The openais-0.80.3-7 packages fix at least one problem specific to qdiskd/cman/openais interaction - let us know if it fixes your qdiskd problem.