431382 – qdiskd kills cman and self

Bug 431382 - qdiskd kills cman and self

Summary: qdiskd kills cman and self

Keywords:
Status:	CLOSED DUPLICATE of bug 314641
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cman
Sub Component:
Version:	5.0
Hardware:	i386
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Christine Caulfield
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-02-03 23:14 UTC by Gustavo Prada
Modified:	2009-04-16 22:51 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-02-07 14:04:55 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
messages and cluster.conf (119.56 KB, application/octet-stream) 2008-02-06 03:20 UTC, Gustavo Prada	no flags	Details
View All

Description Gustavo Prada 2008-02-03 23:14:42 UTC

Description of problem:

I have a two node cluster, and when qdiskd starts the following error appears
and the cluster goes down.

Feb  3 00:20:19 nodo2 qdiskd[2649]: <info> Assuming master role
Feb  3 00:20:20 nodo2 qdiskd[2649]: <err> cman_dispatch: Host is down
Feb  3 00:20:20 nodo2 qdiskd[2649]: <err> Halting qdisk operations


Version-Release number of selected component (if applicable):
cman-2.0.73-1.el5_1.1

How reproducible:

I have one heuristic for qdiskd in cluster.conf:
         
<quorumd device="/dev/gnbd2" votes="1" min_score="5" label="quorum" tko="20"
interval="1">
                <heuristic interval="2" program="ping -c1 -t1 192.168.1.254"
score="2"/>
        </quorumd>

I started cman and qdiskd in foreground "qdiskd -f -d" ans shows errors when 
when detects the Active node, then the cman daemon goes down.
  
Actual results:


Expected results:


Additional info:

If I removes the heuristics, the qdiskd daemon starts but it shows in the
clustat as offline.

Comment 1 Christine Caulfield 2008-02-04 08:36:51 UTC

Are you sure you have a running two node cluster? That second line says that
cman was down .. ie this node is not in the cluster, or has just left it. Are
there any messages in syslog to indicate why cman has shut down? 

If the problem is reproducible then try starting cman with 'cman_tool join -d'
to get more debugging information.

Also ... if this really is RHEL5.0, try upgrading. I think a lot of qdisk
problems were fixed in 5.1

Comment 2 Lon Hohberger 2008-02-04 16:03:00 UTC

cman-2.0.73-1.el5_1.1 I think is 5.1+errata.  I've never seen this before,
however.  Could you attach your cluster.conf as well?

Comment 3 Gustavo Prada 2008-02-06 03:14:31 UTC

I'm sure that there are two nodes. Once qdisk is running it shutdowns the cman
daemon. This only happends whith one heuristic. I added two more heuristics and
was fine, but the configured services doesn't makes failover. Also I was trying
removing the heartbeat link and the quorum is dissolved inmediately the services
either makes failover, in addition the rgmanager daemon couldn't goes down, it
is not possible to be stopped either rebooting the system, the only way is reset
the server. I think that quorum disk resolves the split-brain problem but not.
Maybe I am wrong or the cluster is badly formed, I only wants to communicate the
lab that I did. I attach the cluster.conf and messages. Thanks

Comment 4 Gustavo Prada 2008-02-06 03:20:19 UTC

Created attachment 294072 [details]
messages and cluster.conf

Comment 5 Christine Caulfield 2008-02-06 08:51:59 UTC

It looks like cman is being restarted without restarting the other services. If
you shut cman down, it's important to make sure that everything else is also
shut down. Normally it will check that for you if you use cman_tool or the init
scripts - I'm not entirely sure what's happened in this case.

The best wasy to be sure of this is to always use the init scripts (at least) or
even to reboot the entire node to make sure that there is no state left lying
around.

If you reboot the node does it join the cluster properly ?

Comment 6 Lon Hohberger 2008-02-07 13:55:30 UTC

It's probably the openais IPC bug that causes openais to 'splode when qdiskd
advertises master-status.

Feb  3 02:15:44 apache2 openais[2863]: [MAIN ] AIS Executive Service RELEASE
'subrev 1324 version 0.80.2' 

It wasn't fixed until 0.80.3-some_rev.

Comment 7 Lon Hohberger 2008-02-07 13:57:25 UTC

http://rpm.pbone.net/index.php3/stat/4/idpl/5531749/com/openais-0.80.3-7.el5.i386.rpm.html
http://rpm.pbone.net/index.php3/stat/4/idpl/5534741/com/openais-0.80.3-7.el5.x86_64.rpm.html

Try one of those.

Comment 8 Lon Hohberger 2008-02-07 14:04:55 UTC


*** This bug has been marked as a duplicate of 314641 ***

Comment 9 Gustavo Prada 2008-02-08 16:35:56 UTC

(In reply to comment #6)
> It's probably the openais IPC bug that causes openais to 'splode when qdiskd
> advertises master-status.
> 
> Feb  3 02:15:44 apache2 openais[2863]: [MAIN ] AIS Executive Service RELEASE
> 'subrev 1324 version 0.80.2' 
> 
> It wasn't fixed until 0.80.3-some_rev.
> 
> 
> 
> 

Yes the member node joins succesfully to the cluster.

Comment 10 Lon Hohberger 2008-02-08 16:42:26 UTC

The openais-0.80.3-7 packages fix at least one problem specific to
qdiskd/cman/openais interaction - let us know if it fixes your qdiskd problem.

Note You need to log in before you can comment on or make changes to this bug.