Bug 471415 - rgmanager needs to wait for fence domain join to complete on startup
rgmanager needs to wait for fence domain join to complete on startup
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager (Show other bugs)
5.2
All Linux
urgent Severity high
: rc
: ---
Assigned To: Chris Feist
Cluster QE
: ZStream
Depends On: 459754
Blocks:
  Show dependency treegraph
 
Reported: 2008-11-13 11:13 EST by Benjamin Kahn
Modified: 2009-04-16 18:17 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-11-25 15:52:30 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Benjamin Kahn 2008-11-13 11:13:20 EST
This bug has been copied from bug #459754 and has been proposed
to be backported to 5.2 z-stream (EUS).
Comment 3 Lon Hohberger 2008-11-13 11:16:46 EST
Note that this bugzilla has *two* patches and both must be applied in order for the fix to be considered completed.

One of the patches is in CMAN.
Comment 7 errata-xmlrpc 2008-11-25 15:52:30 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-1008.html
Comment 8 Jakub Suchy 2008-12-02 05:47:54 EST
I think there is a consequence to this. Consider a situation:

Two node cluster, perform a failover to node2. Node1 is fenced. Now turn off node1 (turn off it's power permanently, disconnect a fencing device = simulate it's hardware failure)). Reboot node2.

Expected:
Node2 is rebooted, after few minutes of waiting for fenced, fenced bails out and cluster starts. Rgmanager is started, services are started (you expect to have at least one node, right?).

Actual results:
Node2 is rebooted, waiting for fenced, bails out after few minutes but rgmanager never starts because it waits for the "Waiting for fence domain join operation"...

Is there any workaround or maybe propose a timeout for rgmanager?
Comment 9 Jakub Suchy 2008-12-02 06:00:05 EST
I have temporarily fixed this using "clean_start=1" but i think the timeout will be better...
Comment 10 Jakub Suchy 2008-12-03 04:15:23 EST
So clean_start=1 doesn't help because then sometimes the nodes are killed with "Rejoined the cluster with existing state". Therefore the applied fix introduces a deadlock to rgmanager.

Note You need to log in before you can comment on or make changes to this bug.