Bug 471415
Summary: | rgmanager needs to wait for fence domain join to complete on startup | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Benjamin Kahn <bkahn> |
Component: | rgmanager | Assignee: | Chris Feist <cfeist> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.2 | CC: | cfeist, cluster-maint, edamato, jakub, kanderso, nstraz, pm-eus |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-11-25 20:52:30 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 459754 | ||
Bug Blocks: |
Description
Benjamin Kahn
2008-11-13 16:13:20 UTC
Note that this bugzilla has *two* patches and both must be applied in order for the fix to be considered completed. One of the patches is in CMAN. These patches have been pushed to the RHEL52 branch of git: (cman): http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=24fe905a449a2faedc0ec703b24de87692e516e9 (rgmanager): http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=b3c91c9dd3290c5c571071542c9b539ae4cd9ba0 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-1008.html I think there is a consequence to this. Consider a situation: Two node cluster, perform a failover to node2. Node1 is fenced. Now turn off node1 (turn off it's power permanently, disconnect a fencing device = simulate it's hardware failure)). Reboot node2. Expected: Node2 is rebooted, after few minutes of waiting for fenced, fenced bails out and cluster starts. Rgmanager is started, services are started (you expect to have at least one node, right?). Actual results: Node2 is rebooted, waiting for fenced, bails out after few minutes but rgmanager never starts because it waits for the "Waiting for fence domain join operation"... Is there any workaround or maybe propose a timeout for rgmanager? I have temporarily fixed this using "clean_start=1" but i think the timeout will be better... So clean_start=1 doesn't help because then sometimes the nodes are killed with "Rejoined the cluster with existing state". Therefore the applied fix introduces a deadlock to rgmanager. |