Bug 843160
Summary: | dlm_controld recovery stuck in check_fencing | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | David Teigland <teigland> |
Component: | cluster | Assignee: | David Teigland <teigland> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | high | Docs Contact: | |
Priority: | low | ||
Version: | 6.3 | CC: | ccaulfie, cluster-maint, djansa, fdinitto, jruemker, lhh, mjuricek, rpeterso, teigland |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | cluster-3.0.12.1-67.el6 | Doc Type: | Bug Fix |
Doc Text: |
Cause: Fencing time comparison does not work as expected when a fence agent completes quickly or the corosync callback is delayed.
Consequence: dlm recovery will be stuck waiting for fencing to complete.
Fix: Save and compare different time stamps that are not affected by the sequence of fencing and corosync callbacks.
Result: dlm recovery will not be stuck waiting for fencing to complete in this case.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2014-10-14 04:44:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
David Teigland
2012-07-25 18:24:50 UTC
The symptom of this problem is fencing being done but dlm still waiting for fencing: # group_tool fence domain member count 1 victim count 0 victim now 0 master nodeid 1 wait state none members 1 dlm lockspaces name dlm_master id 0x8321e41a flags 0x00000004 kern_stop change member 2 joined 1 remove 0 failed 0 seq 2,2 members 1 2 new change member 1 joined 0 remove 1 failed 1 seq 3,3 new status wait_messages 0 wait_condition 1 fencing new members 1 A possible solution might be to save the add_time that preceded the last failure as need_fenced_time and compare last_fenced_time to that. No one has reported seeing this except me. Pushing this out since I don't have time to work on it for 6.4 pushed patch to RHEL6 branch https://git.fedorahosted.org/cgit/cluster.git/commit/?h=RHEL6&id=2d06dd478c27bf864ba1a5ac0cbb1ba6c3ed947f I tested and verified this in two different ways: 1. added usleep(1000000) in del_configfs_node() before rmdir, recompiled. killed nodeB, nodeA recovered properly based on the adjusted time comparison. 2. (no code change necessary) suspended the dlm_controld process on nodeA, killed nodeB, waited for fenced to complete fencing on nodeA, resumed dlm_controld on nodeA, and nodeA recovered properly. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1420.html |