Description of problem: Hello! I'm not sure, if I'm doinig something wrong or if it's really a bug. My Problem: Im testing in vmware the new RHEL5 cluster suite and have installed the newest cluster-rpms (cman-2.0.70-1.el5 rgmanager-2.0.28-1.el5 lvm2-2.02.26-2.el5 luci-0.10.0-2.el5 ricci-0.10.0-2.el5). My setup is * 2-node-cluster * quorum-disk * heuristic just for testing [ -f /tmp/has_qu ] -> this file exists on both nodes I've created a quorum-disk on a shared disk (vmware) using mkqdisk. mkqdisk -L shows this disk on both nodes. But every time I start qdiskd, the node reboots. The last messages are: Aug 22 11:24:06 rhel5n1 qdiskd[20035]: <info> Quorum Daemon Initializing Aug 22 11:24:07 rhel5n1 qdiskd[20035]: <info> Heuristic: '[ -f /tmp/has_qu ]' UP Aug 22 11:25:46 rhel5n1 qdiskd[20035]: <info> Initial score 10/10 Aug 22 11:25:46 rhel5n1 qdiskd[20035]: <info> Initialization complete Aug 22 11:25:46 rhel5n1 qdiskd[20035]: <notice> Score sufficient for master operation (10/10; required=10); upgrading Aug 22 11:25:46 rhel5n1 openais[1766]: [CMAN ] quorum device registered Aug 22 11:26:46 rhel5n1 qdiskd[20035]: <info> Assuming master role Aug 22 11:26:56 rhel5n1 gfs_controld[1794]: groupd_dispatch error -1 errno 11 Aug 22 11:26:56 rhel5n1 gfs_controld[1794]: groupd connection died Aug 22 11:26:56 rhel5n1 gfs_controld[1794]: cluster is down, exiting Aug 22 11:26:56 rhel5n1 dlm_controld[1788]: cluster is down, exiting Aug 22 11:26:56 rhel5n1 kernel: dlm: closing connection to node 2 Aug 22 11:26:56 rhel5n1 kernel: dlm: closing connection to node 1 Aug 22 11:26:56 rhel5n1 clurgmgrd[2260]: <crit> Watchdog: Daemon died, rebooting... Aug 22 11:26:57 rhel5n1 kernel: md: stopping all md devices. clustat shows the following: Member Status: Quorate Member Name ID Status ------ ---- ---- ------ rhel5n1.icoserve.test 1 Online, rgmanager rhel5n2.icoserve.test 2 Online, Local, rgmanager /dev/sdd 0 Offline, Quorum Disk Service Name Owner (Last) State ------- ---- ----- ------ ----- service:testscript1 rhel5n1.icoserve.test started service:testscript2 rhel5n2.icoserve.test started My quorum-conf was: <quorumd interval="10" tko="10" votes="1" device="/dev/sdd" status_file="/tmp/qu01.log"> <heuristic program="[ -f /tmp/has_qu ]" score="10" interval="5"/> </quorumd> The votes: <cman expected_votes="2" two_node="0"/> Output of qdisk status-file: On node 1: Time Stamp: Wed Aug 22 11:26:56 2007 Node ID: 1 Score: 10/10 (Minimum required = 10) Current state: Master Initializing Set: { } Visible Set: { 1 2 } Master Node ID: 1 Quorate Set: { 2 3 } On node 2: Time Stamp: Wed Aug 22 11:26:47 2007 Node ID: 2 Score: 10/10 (Minimum required = 10) Current state: Running Initializing Set: { } Visible Set: { 1 2 } Master Node ID: 1 Quorate Set: { 2 3 } I'm using the same vmware-disks for 2 shared gfs-partitions. This works fine, so I dont't think the problem depends on use of vmware-disks. Version-Release number of selected component (if applicable): RHEL5 U0 + Updates of cluster-rpms How reproducible: * 2-Node-cluster + 1 quorum disk * start cman, clvmd, (rgmanager - same behavior) -> everything ok * start qdiskd -> start ok * after some time, qdiskd is "assuming master role" * Reboot (groupd_dispatch error -1 errno 11) Steps to Reproduce: 1. set up cluster with 2 nodes, expected_votes=2, two_node=0 2. set up quorum disk 3. start cman, clvmd, qdiskd Actual results: Reboot after "Assuming master role" Expected results: qdisk online and working Additional info: Tests were done in VMware-GSX with use of shared virtual disks. Bootparams are "clocksource=pit nosmp noapic noalpic" to workaround a rtc-bug. Date is synchron on both nodes.
Visible Set: { 1 2 } Master Node ID: 1 Quorate Set: { 2 3 } ^^^^ This is strange; are your nodeids 1 and 2 in cluster.conf? (it shouldn't matter that they were virtual machines nor that you were using virtual disks)
yes they are. I've set up the cluster using system-config-cluster, which btw. has a bug too: in the Managementtab it does not show the cluster-members but it shows the services correctly. Maybe this problem is related to the nodeids. What's strange about the nodeids? I guess they should be "0" and "1"? If the quorum joins the cluster (it is offline), clustat -x says the nodeid of the quorumdisk is "0". I'll try it with nodeid "0" and "1" and will report the results here..
I've stopped the cluster, vim-ed the config and changed nodeid of node 1 to "0" and node 2 to "1". Then I scp-ed the cluster.conf to the other node and tried to start cman. Thats the output: [root@rhel5n2 ~]# /etc/init.d/cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... failed cman not started: No node ID for rhel5n1.icoserve.test, run 'ccs_tool addnodeids' to fix /usr/sbin/cman_tool: aisexec daemon didn't start [FAILED] The cluster.conf is changed after this failed start: Version is <version + 1> and the nodeids are * Node 1 has "2" * Node 2 has "1" I think, ccs-daemon did this changes - but cman does not start although.. I have to set node 1's id to "1" and node 2's id to "2" to get cman running..
For what it's worth, you can't use node ID = 0 for nodes. The visible set should be nodes {1 2} after they get up and running; qdiskd shouldn't display node 0 in the quorate set. It could be that the quorate set output is backwards, but I doubt that. I'll look more into it. Are you using multipath / LVM / etc for qdisk?
I'm using VMware-Workstation. And so the disks are virtual disks connected to both nodes. They are raw - no LVM, no Multipath-Fibrechannel. GFS works fine on this setting and RedHat cluster 3 with quorum-disk works too (So I don't think the problem is related to the fact that they are virtual disks) Here is the disk setting for both virtual machines: disk.locking = "FALSE" scsi1.sharedBus = "virtual" scsi1.virtualDev = "lsilogic" scsi1.present = "TRUE" scsi1:0.deviceType = "disk" scsi1:0.writeThrough = "TRUE" scsi1:0.present = "TRUE" scsi1:0.fileName = "/opt/vmware/machines/vmrhel3_san/qu01.vmdk" scsi1:0.mode = "independent-persistent" scsi1:1.deviceType = "disk" scsi1:1.writeThrough = "TRUE" scsi1:1.present = "TRUE" scsi1:1.fileName = "/opt/vmware/machines/vmrhel3_san/qu02.vmdk" scsi1:1.mode = "independent-persistent"
Ah, so cman / aisexec probably crashed on activation?
yes, it seems so
Herbert, I'm pretty sure this is fixed in 5.1 - there are two bugs related to cman qdisk interaction which were fixed: (1) device names could only be 15 characters (you were using /dev/sdd, so this probably isn't the problem) (2) timer logic bug in openais caused cman/openais to die when qdiskd advertised fitness Could you retest with the 5.1 cman + ccs packages and let me know if it's not fixed? It should be. If it is fixed, could you close this bug?
Sounds good - where can I find this packages? In RHN the latest versions are: cman-2.0.73-1.el5_1.1.i386 Red Hat Enterprise Linux (v. 5 for 32-bit x86) rgmanager-2.0.31-1.el5.i386 RHEL Clustering (v. 5 for 32-bit x86) With this versions the problem is not solved - cman died a few minutes after starting qdiskd.
You also need the openais-0.80.3-7.el5 if you don't already have it. If you do, then this is a new bug - and it seems specific to your configuration (VMWare).
Yum-ed up to RHEL5.1 and tested again -> now its working, qdisk is online and seems to be working. Thank You! (I cannot close the bug - bugzilla says, only owner can do that..)
Sweet.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0599.html
*** This bug has been marked as a duplicate of 314641 ***