Red Hat Bugzilla – Bug 471415
rgmanager needs to wait for fence domain join to complete on startup
Last modified: 2009-04-16 18:17:38 EDT
This bug has been copied from bug #459754 and has been proposed
to be backported to 5.2 z-stream (EUS).
Note that this bugzilla has *two* patches and both must be applied in order for the fix to be considered completed.
One of the patches is in CMAN.
These patches have been pushed to the RHEL52 branch of git:
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
I think there is a consequence to this. Consider a situation:
Two node cluster, perform a failover to node2. Node1 is fenced. Now turn off node1 (turn off it's power permanently, disconnect a fencing device = simulate it's hardware failure)). Reboot node2.
Node2 is rebooted, after few minutes of waiting for fenced, fenced bails out and cluster starts. Rgmanager is started, services are started (you expect to have at least one node, right?).
Node2 is rebooted, waiting for fenced, bails out after few minutes but rgmanager never starts because it waits for the "Waiting for fence domain join operation"...
Is there any workaround or maybe propose a timeout for rgmanager?
I have temporarily fixed this using "clean_start=1" but i think the timeout will be better...
So clean_start=1 doesn't help because then sometimes the nodes are killed with "Rejoined the cluster with existing state". Therefore the applied fix introduces a deadlock to rgmanager.