Bug 253836
| Summary: | qdiskd causes node to reboot | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Herbert L. Plankl <h.plankl> |
| Component: | cman | Assignee: | Lon Hohberger <lhh> |
| Status: | CLOSED DUPLICATE | QA Contact: | Cluster QE <mspqa-list> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 5.0 | CC: | ccaulfie, cluster-maint |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | i386 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | RHBA-2007-0599 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2007-11-16 13:48:37 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 314641 | ||
| Bug Blocks: | |||
Visible Set: { 1 2 }
Master Node ID: 1
Quorate Set: { 2 3 }
^^^^
This is strange; are your nodeids 1 and 2 in cluster.conf?
(it shouldn't matter that they were virtual machines nor that you were using
virtual disks)
yes they are. I've set up the cluster using system-config-cluster, which btw. has a bug too: in the Managementtab it does not show the cluster-members but it shows the services correctly. Maybe this problem is related to the nodeids. What's strange about the nodeids? I guess they should be "0" and "1"? If the quorum joins the cluster (it is offline), clustat -x says the nodeid of the quorumdisk is "0". I'll try it with nodeid "0" and "1" and will report the results here.. I've stopped the cluster, vim-ed the config and changed nodeid of node 1 to "0"
and node 2 to "1". Then I scp-ed the cluster.conf to the other node and tried to
start cman. Thats the output:
[root@rhel5n2 ~]# /etc/init.d/cman start
Starting cluster:
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman... failed
cman not started: No node ID for rhel5n1.icoserve.test, run 'ccs_tool
addnodeids' to fix /usr/sbin/cman_tool: aisexec daemon didn't start
[FAILED]
The cluster.conf is changed after this failed start: Version is <version + 1>
and the nodeids are
* Node 1 has "2"
* Node 2 has "1"
I think, ccs-daemon did this changes - but cman does not start although.. I have
to set node 1's id to "1" and node 2's id to "2" to get cman running..
For what it's worth, you can't use node ID = 0 for nodes. The visible set
should be nodes {1 2} after they get up and running; qdiskd shouldn't display
node 0 in the quorate set.
It could be that the quorate set output is backwards, but I doubt that. I'll
look more into it.
Are you using multipath / LVM / etc for qdisk?
I'm using VMware-Workstation. And so the disks are virtual disks connected to both nodes. They are raw - no LVM, no Multipath-Fibrechannel. GFS works fine on this setting and RedHat cluster 3 with quorum-disk works too (So I don't think the problem is related to the fact that they are virtual disks) Here is the disk setting for both virtual machines: disk.locking = "FALSE" scsi1.sharedBus = "virtual" scsi1.virtualDev = "lsilogic" scsi1.present = "TRUE" scsi1:0.deviceType = "disk" scsi1:0.writeThrough = "TRUE" scsi1:0.present = "TRUE" scsi1:0.fileName = "/opt/vmware/machines/vmrhel3_san/qu01.vmdk" scsi1:0.mode = "independent-persistent" scsi1:1.deviceType = "disk" scsi1:1.writeThrough = "TRUE" scsi1:1.present = "TRUE" scsi1:1.fileName = "/opt/vmware/machines/vmrhel3_san/qu02.vmdk" scsi1:1.mode = "independent-persistent" Ah, so cman / aisexec probably crashed on activation? yes, it seems so Herbert, I'm pretty sure this is fixed in 5.1 - there are two bugs related to cman qdisk interaction which were fixed: (1) device names could only be 15 characters (you were using /dev/sdd, so this probably isn't the problem) (2) timer logic bug in openais caused cman/openais to die when qdiskd advertised fitness Could you retest with the 5.1 cman + ccs packages and let me know if it's not fixed? It should be. If it is fixed, could you close this bug? Sounds good - where can I find this packages? In RHN the latest versions are: cman-2.0.73-1.el5_1.1.i386 Red Hat Enterprise Linux (v. 5 for 32-bit x86) rgmanager-2.0.31-1.el5.i386 RHEL Clustering (v. 5 for 32-bit x86) With this versions the problem is not solved - cman died a few minutes after starting qdiskd. You also need the openais-0.80.3-7.el5 if you don't already have it. If you do, then this is a new bug - and it seems specific to your configuration (VMWare). Yum-ed up to RHEL5.1 and tested again -> now its working, qdisk is online and seems to be working. Thank You! (I cannot close the bug - bugzilla says, only owner can do that..) Sweet. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0599.html *** This bug has been marked as a duplicate of 314641 *** |
Description of problem: Hello! I'm not sure, if I'm doinig something wrong or if it's really a bug. My Problem: Im testing in vmware the new RHEL5 cluster suite and have installed the newest cluster-rpms (cman-2.0.70-1.el5 rgmanager-2.0.28-1.el5 lvm2-2.02.26-2.el5 luci-0.10.0-2.el5 ricci-0.10.0-2.el5). My setup is * 2-node-cluster * quorum-disk * heuristic just for testing [ -f /tmp/has_qu ] -> this file exists on both nodes I've created a quorum-disk on a shared disk (vmware) using mkqdisk. mkqdisk -L shows this disk on both nodes. But every time I start qdiskd, the node reboots. The last messages are: Aug 22 11:24:06 rhel5n1 qdiskd[20035]: <info> Quorum Daemon Initializing Aug 22 11:24:07 rhel5n1 qdiskd[20035]: <info> Heuristic: '[ -f /tmp/has_qu ]' UP Aug 22 11:25:46 rhel5n1 qdiskd[20035]: <info> Initial score 10/10 Aug 22 11:25:46 rhel5n1 qdiskd[20035]: <info> Initialization complete Aug 22 11:25:46 rhel5n1 qdiskd[20035]: <notice> Score sufficient for master operation (10/10; required=10); upgrading Aug 22 11:25:46 rhel5n1 openais[1766]: [CMAN ] quorum device registered Aug 22 11:26:46 rhel5n1 qdiskd[20035]: <info> Assuming master role Aug 22 11:26:56 rhel5n1 gfs_controld[1794]: groupd_dispatch error -1 errno 11 Aug 22 11:26:56 rhel5n1 gfs_controld[1794]: groupd connection died Aug 22 11:26:56 rhel5n1 gfs_controld[1794]: cluster is down, exiting Aug 22 11:26:56 rhel5n1 dlm_controld[1788]: cluster is down, exiting Aug 22 11:26:56 rhel5n1 kernel: dlm: closing connection to node 2 Aug 22 11:26:56 rhel5n1 kernel: dlm: closing connection to node 1 Aug 22 11:26:56 rhel5n1 clurgmgrd[2260]: <crit> Watchdog: Daemon died, rebooting... Aug 22 11:26:57 rhel5n1 kernel: md: stopping all md devices. clustat shows the following: Member Status: Quorate Member Name ID Status ------ ---- ---- ------ rhel5n1.icoserve.test 1 Online, rgmanager rhel5n2.icoserve.test 2 Online, Local, rgmanager /dev/sdd 0 Offline, Quorum Disk Service Name Owner (Last) State ------- ---- ----- ------ ----- service:testscript1 rhel5n1.icoserve.test started service:testscript2 rhel5n2.icoserve.test started My quorum-conf was: <quorumd interval="10" tko="10" votes="1" device="/dev/sdd" status_file="/tmp/qu01.log"> <heuristic program="[ -f /tmp/has_qu ]" score="10" interval="5"/> </quorumd> The votes: <cman expected_votes="2" two_node="0"/> Output of qdisk status-file: On node 1: Time Stamp: Wed Aug 22 11:26:56 2007 Node ID: 1 Score: 10/10 (Minimum required = 10) Current state: Master Initializing Set: { } Visible Set: { 1 2 } Master Node ID: 1 Quorate Set: { 2 3 } On node 2: Time Stamp: Wed Aug 22 11:26:47 2007 Node ID: 2 Score: 10/10 (Minimum required = 10) Current state: Running Initializing Set: { } Visible Set: { 1 2 } Master Node ID: 1 Quorate Set: { 2 3 } I'm using the same vmware-disks for 2 shared gfs-partitions. This works fine, so I dont't think the problem depends on use of vmware-disks. Version-Release number of selected component (if applicable): RHEL5 U0 + Updates of cluster-rpms How reproducible: * 2-Node-cluster + 1 quorum disk * start cman, clvmd, (rgmanager - same behavior) -> everything ok * start qdiskd -> start ok * after some time, qdiskd is "assuming master role" * Reboot (groupd_dispatch error -1 errno 11) Steps to Reproduce: 1. set up cluster with 2 nodes, expected_votes=2, two_node=0 2. set up quorum disk 3. start cman, clvmd, qdiskd Actual results: Reboot after "Assuming master role" Expected results: qdisk online and working Additional info: Tests were done in VMware-GSX with use of shared virtual disks. Bootparams are "clocksource=pit nosmp noapic noalpic" to workaround a rtc-bug. Date is synchron on both nodes.