253836 – qdiskd causes node to reboot

Bug 253836 - qdiskd causes node to reboot

Summary: qdiskd causes node to reboot

Keywords:
Status:	CLOSED DUPLICATE of bug 314641
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cman
Sub Component:
Version:	5.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:	314641
Blocks:
TreeView+	depends on / blocked

Reported:	2007-08-22 10:05 UTC by Herbert L. Plankl
Modified:	2009-04-16 22:37 UTC (History)
CC List:	2 users (show)
Fixed In Version:	RHBA-2007-0599
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-11-16 13:48:37 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Herbert L. Plankl 2007-08-22 10:05:17 UTC

Description of problem:
Hello! I'm not sure, if I'm doinig something wrong or if it's really a bug. My
Problem:
Im testing in vmware the new RHEL5 cluster suite and have installed the newest
cluster-rpms (cman-2.0.70-1.el5 rgmanager-2.0.28-1.el5 lvm2-2.02.26-2.el5
luci-0.10.0-2.el5 ricci-0.10.0-2.el5). My setup is
* 2-node-cluster
* quorum-disk
* heuristic just for testing [ -f /tmp/has_qu ] -> this file exists on both nodes

I've created a quorum-disk on a shared disk (vmware) using mkqdisk. mkqdisk -L
shows this disk on both nodes. But every time I start qdiskd, the node reboots. 

The last messages are:
Aug 22 11:24:06 rhel5n1 qdiskd[20035]: <info> Quorum Daemon Initializing
Aug 22 11:24:07 rhel5n1 qdiskd[20035]: <info> Heuristic: '[ -f /tmp/has_qu ]' UP
Aug 22 11:25:46 rhel5n1 qdiskd[20035]: <info> Initial score 10/10
Aug 22 11:25:46 rhel5n1 qdiskd[20035]: <info> Initialization complete
Aug 22 11:25:46 rhel5n1 qdiskd[20035]: <notice> Score sufficient for master
operation (10/10; required=10); upgrading
Aug 22 11:25:46 rhel5n1 openais[1766]: [CMAN ] quorum device registered
Aug 22 11:26:46 rhel5n1 qdiskd[20035]: <info> Assuming master role
Aug 22 11:26:56 rhel5n1 gfs_controld[1794]: groupd_dispatch error -1 errno 11
Aug 22 11:26:56 rhel5n1 gfs_controld[1794]: groupd connection died
Aug 22 11:26:56 rhel5n1 gfs_controld[1794]: cluster is down, exiting
Aug 22 11:26:56 rhel5n1 dlm_controld[1788]: cluster is down, exiting
Aug 22 11:26:56 rhel5n1 kernel: dlm: closing connection to node 2
Aug 22 11:26:56 rhel5n1 kernel: dlm: closing connection to node 1
Aug 22 11:26:56 rhel5n1 clurgmgrd[2260]: <crit> Watchdog: Daemon died, rebooting...
Aug 22 11:26:57 rhel5n1 kernel: md: stopping all md devices.

clustat shows the following:
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  rhel5n1.icoserve.test                 1 Online, rgmanager
  rhel5n2.icoserve.test                 2 Online, Local, rgmanager
  /dev/sdd                              0 Offline, Quorum Disk

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  service:testscript1  rhel5n1.icoserve.test          started
  service:testscript2  rhel5n2.icoserve.test          started


My quorum-conf was:
        <quorumd interval="10" tko="10" votes="1" device="/dev/sdd"
status_file="/tmp/qu01.log">
            <heuristic program="[ -f /tmp/has_qu ]" score="10" interval="5"/>
        </quorumd>

The votes:
<cman expected_votes="2" two_node="0"/>

Output of qdisk status-file:
On node 1:
Time Stamp: Wed Aug 22 11:26:56 2007
Node ID: 1
Score: 10/10 (Minimum required = 10)
Current state: Master
Initializing Set: { }
Visible Set: { 1 2 }
Master Node ID: 1
Quorate Set: { 2 3 }

On node 2:
Time Stamp: Wed Aug 22 11:26:47 2007
Node ID: 2
Score: 10/10 (Minimum required = 10)
Current state: Running
Initializing Set: { }
Visible Set: { 1 2 }
Master Node ID: 1
Quorate Set: { 2 3 }

I'm using the same vmware-disks for 2 shared gfs-partitions. This works fine, so
I dont't think the problem depends on use of vmware-disks.

Version-Release number of selected component (if applicable):
RHEL5 U0 + Updates of cluster-rpms

How reproducible:
* 2-Node-cluster + 1 quorum disk
* start cman, clvmd, (rgmanager - same behavior)
-> everything ok
* start qdiskd
-> start ok
* after some time, qdiskd is "assuming master role"
* Reboot (groupd_dispatch error -1 errno 11)

Steps to Reproduce:
1. set up cluster with 2 nodes, expected_votes=2, two_node=0
2. set up quorum disk
3. start cman, clvmd, qdiskd
  
Actual results:
Reboot after "Assuming master role"

Expected results:
qdisk online and working

Additional info:
Tests were done in VMware-GSX with use of shared virtual disks. Bootparams are
"clocksource=pit nosmp noapic noalpic" to workaround a rtc-bug. Date is synchron
on both nodes.

Comment 1 Lon Hohberger 2007-08-22 17:25:38 UTC

Visible Set: { 1 2 }
Master Node ID: 1
Quorate Set: { 2 3 }
^^^^

This is strange; are your nodeids 1 and 2 in cluster.conf?

(it shouldn't matter that they were virtual machines nor that you were using
virtual disks)

Comment 2 Herbert L. Plankl 2007-08-22 18:01:05 UTC

yes they are. I've set up the cluster using system-config-cluster, which btw.
has a bug too: in the Managementtab it does not show the cluster-members but it
shows the services correctly. Maybe this problem is related to the nodeids.
What's strange about the nodeids? I guess they should be "0" and "1"? If the
quorum joins the cluster (it is offline), clustat -x says the nodeid of the
quorumdisk is "0".
I'll try it with nodeid "0" and "1" and will report the results here..

Comment 3 Herbert L. Plankl 2007-08-22 18:39:11 UTC

I've stopped the cluster, vim-ed the config and changed nodeid of node 1 to "0"
and node 2 to "1". Then I scp-ed the cluster.conf to the other node and tried to
start cman. Thats the output:

[root@rhel5n2 ~]# /etc/init.d/cman start
Starting cluster: 
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... failed
cman not started: No node ID for rhel5n1.icoserve.test, run 'ccs_tool
addnodeids' to fix /usr/sbin/cman_tool: aisexec daemon didn't start
                                                           [FAILED]


The cluster.conf is changed after this failed start: Version is <version + 1>
and the nodeids are
* Node 1 has "2"
* Node 2 has "1"

I think, ccs-daemon did this changes - but cman does not start although.. I have
to set node 1's id to "1" and node 2's id to "2" to get cman running..

Comment 4 Lon Hohberger 2007-09-12 18:18:23 UTC

For what it's worth, you can't use node ID = 0 for nodes.  The visible set
should be nodes {1 2} after they get up and running; qdiskd shouldn't display
node 0 in the quorate set.

It could be that the quorate set output is backwards, but I doubt that.  I'll
look more into it.

Are you using multipath / LVM / etc for qdisk?

Comment 5 Herbert L. Plankl 2007-09-13 06:33:08 UTC

I'm using VMware-Workstation. And so the disks are virtual disks connected to
both nodes. They are raw - no LVM, no Multipath-Fibrechannel. GFS works fine on
this setting and RedHat cluster 3 with quorum-disk works too (So I don't think
the problem is related to the fact that they are virtual disks)

Here is the disk setting for both virtual machines:
disk.locking = "FALSE"
scsi1.sharedBus = "virtual"
scsi1.virtualDev = "lsilogic"
scsi1.present = "TRUE"
scsi1:0.deviceType = "disk"
scsi1:0.writeThrough = "TRUE"
scsi1:0.present = "TRUE"
scsi1:0.fileName = "/opt/vmware/machines/vmrhel3_san/qu01.vmdk"
scsi1:0.mode = "independent-persistent"
scsi1:1.deviceType = "disk"
scsi1:1.writeThrough = "TRUE"
scsi1:1.present = "TRUE"
scsi1:1.fileName = "/opt/vmware/machines/vmrhel3_san/qu02.vmdk"
scsi1:1.mode = "independent-persistent"

Comment 6 Lon Hohberger 2007-09-24 20:22:00 UTC

Ah, so cman / aisexec probably crashed on activation?

Comment 9 Herbert L. Plankl 2007-09-25 08:42:20 UTC

yes, it seems so

Comment 10 Lon Hohberger 2007-11-13 17:30:21 UTC

Herbert, I'm pretty sure this is fixed in 5.1 - there are two bugs related to
cman qdisk interaction which were fixed:

(1) device names could only be 15 characters (you were using /dev/sdd, so this
probably isn't the problem)
(2) timer logic bug in openais caused cman/openais to die when qdiskd advertised
fitness

Could you retest with the 5.1 cman + ccs packages and let me know if it's not
fixed?  It should be.  If it is fixed, could you close this bug?

Comment 11 Herbert L. Plankl 2007-11-15 12:50:56 UTC

Sounds good - where can I find this packages? In RHN the latest versions are:
cman-2.0.73-1.el5_1.1.i386   	 Red Hat Enterprise Linux (v. 5 for 32-bit x86)
rgmanager-2.0.31-1.el5.i386   	 RHEL Clustering (v. 5 for 32-bit x86)

With this versions the problem is not solved - cman died a few minutes after
starting qdiskd.

Comment 12 Lon Hohberger 2007-11-15 17:06:50 UTC

You also need the openais-0.80.3-7.el5 if you don't already have it.  If you do,
then this is a new bug - and it seems specific to your configuration (VMWare).

Comment 13 Herbert L. Plankl 2007-11-16 12:45:10 UTC

Yum-ed up to RHEL5.1 and tested again -> now its working, qdisk is online and
seems to be working.
Thank You!

(I cannot close the bug - bugzilla says, only owner can do that..)

Comment 14 Lon Hohberger 2007-11-16 13:48:37 UTC

Sweet.

Comment 15 Lon Hohberger 2007-11-16 13:49:03 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0599.html

Comment 16 Lon Hohberger 2007-11-16 13:50:55 UTC


*** This bug has been marked as a duplicate of 314641 ***

Note You need to log in before you can comment on or make changes to this bug.