Bug 471415

Summary:	rgmanager needs to wait for fence domain join to complete on startup
Product:	Red Hat Enterprise Linux 5	Reporter:	Benjamin Kahn <bkahn>
Component:	rgmanager	Assignee:	Chris Feist <cfeist>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	5.2	CC:	cfeist, cluster-maint, edamato, jakub, kanderso, nstraz, pm-eus
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-11-25 20:52:30 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	459754
Bug Blocks:

Description Benjamin Kahn 2008-11-13 16:13:20 UTC

This bug has been copied from bug #459754 and has been proposed
to be backported to 5.2 z-stream (EUS).

Comment 3 Lon Hohberger 2008-11-13 16:16:46 UTC

Note that this bugzilla has *two* patches and both must be applied in order for the fix to be considered completed.

One of the patches is in CMAN.

Comment 4 Lon Hohberger 2008-11-13 16:21:07 UTC

These patches have been pushed to the RHEL52 branch of git:

(cman):

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=24fe905a449a2faedc0ec703b24de87692e516e9

(rgmanager):

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=b3c91c9dd3290c5c571071542c9b539ae4cd9ba0

Comment 7 errata-xmlrpc 2008-11-25 20:52:30 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-1008.html

Comment 8 Jakub Suchy 2008-12-02 10:47:54 UTC

I think there is a consequence to this. Consider a situation:

Two node cluster, perform a failover to node2. Node1 is fenced. Now turn off node1 (turn off it's power permanently, disconnect a fencing device = simulate it's hardware failure)). Reboot node2.

Expected:
Node2 is rebooted, after few minutes of waiting for fenced, fenced bails out and cluster starts. Rgmanager is started, services are started (you expect to have at least one node, right?).

Actual results:
Node2 is rebooted, waiting for fenced, bails out after few minutes but rgmanager never starts because it waits for the "Waiting for fence domain join operation"...

Is there any workaround or maybe propose a timeout for rgmanager?

Comment 9 Jakub Suchy 2008-12-02 11:00:05 UTC

I have temporarily fixed this using "clean_start=1" but i think the timeout will be better...

Comment 10 Jakub Suchy 2008-12-03 09:15:23 UTC

So clean_start=1 doesn't help because then sometimes the nodes are killed with "Rejoined the cluster with existing state". Therefore the applied fix introduces a deadlock to rgmanager.