Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 471415

Summary: rgmanager needs to wait for fence domain join to complete on startup
Product: Red Hat Enterprise Linux 5 Reporter: Benjamin Kahn <bkahn>
Component: rgmanagerAssignee: Chris Feist <cfeist>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.2CC: cfeist, cluster-maint, edamato, jakub, kanderso, nstraz, pm-eus
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-11-25 20:52:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 459754    
Bug Blocks:    

Description Benjamin Kahn 2008-11-13 16:13:20 UTC
This bug has been copied from bug #459754 and has been proposed
to be backported to 5.2 z-stream (EUS).

Comment 3 Lon Hohberger 2008-11-13 16:16:46 UTC
Note that this bugzilla has *two* patches and both must be applied in order for the fix to be considered completed.

One of the patches is in CMAN.

Comment 7 errata-xmlrpc 2008-11-25 20:52:30 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-1008.html

Comment 8 Jakub Suchy 2008-12-02 10:47:54 UTC
I think there is a consequence to this. Consider a situation:

Two node cluster, perform a failover to node2. Node1 is fenced. Now turn off node1 (turn off it's power permanently, disconnect a fencing device = simulate it's hardware failure)). Reboot node2.

Expected:
Node2 is rebooted, after few minutes of waiting for fenced, fenced bails out and cluster starts. Rgmanager is started, services are started (you expect to have at least one node, right?).

Actual results:
Node2 is rebooted, waiting for fenced, bails out after few minutes but rgmanager never starts because it waits for the "Waiting for fence domain join operation"...

Is there any workaround or maybe propose a timeout for rgmanager?

Comment 9 Jakub Suchy 2008-12-02 11:00:05 UTC
I have temporarily fixed this using "clean_start=1" but i think the timeout will be better...

Comment 10 Jakub Suchy 2008-12-03 09:15:23 UTC
So clean_start=1 doesn't help because then sometimes the nodes are killed with "Rejoined the cluster with existing state". Therefore the applied fix introduces a deadlock to rgmanager.