Bug 1268313
Summary: | clvmd/dlm resource agent monitor action should recognize it is hung | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | michal novacek <mnovacek> | ||||
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | ||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | unspecified | Docs Contact: | Steven J. Levine <slevine> | ||||
Priority: | unspecified | ||||||
Version: | 7.2 | CC: | abeekhof, agk, cfeist, cluster-maint, fdinitto, ivlnka, jruemker, mnovacek, oalbrigt, phagara | ||||
Target Milestone: | rc | ||||||
Target Release: | 7.3 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | pacemaker-1.1.15-1.2c148ac.git.el7 | Doc Type: | Release Note | ||||
Doc Text: |
Fencing now occurs when DLM requires it, even when the cluster itself does not
Previously, DLM could require fencing due to quorum issues, even when the cluster itself did not require fencing, but would be unable to initiate it, As a consequence, DLM and DLM-based services could hang waiting for fencing that never happened. With this fix, the `ocf:pacemaker:controld` resource agent now checks whether DLM is in this state, and requests fencing if so. Fencing now occurs in this situation, allowing DLM to recover.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1287535 1339661 (view as bug list) | Environment: | |||||
Last Closed: | 2016-11-03 18:56:15 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1287535, 1339661 | ||||||
Attachments: |
|
Description
michal novacek
2015-10-02 13:38:59 UTC
what dlm and/or clvmd command should be executed to determine that they are hung? I'm not sure about the exact command. We should somehow find out that controld is in the non operational state that can be resolved only by fencing. 'dlm_tool ls' at the moment of the problem shows: dlm lockspaces name clvmd id 0x4104eefa flags 0x00000004 kern_stop change member 3 joined 1 remove 0 failed 0 seq 1,1 members 1 2 3 new change member 3 joined 1 remove 0 failed 0 seq 2,4 new status wait fencing new members 1 2 3 Maybe the 'wait fencing' status gives us this information? controld is a shell script that calls into the dlm/clvmd, so
> We should somehow find out that controld is in the non operational state that
> can be resolved only by fencing.
has no meaning.
We can have it look for 'wait fencing' though.
Is there an equivelent command when controld is managing clvmd_controld?
Oyvind: Btw. the controld agent is in the pacemaker tree when you need to find it.
Is there an easy way to reproduce this issue? I only got it to go into "wait fencing" once by random timing, and the other times it goes into "wait fencing" when it gets network connectivity back causing fencing. Scratch build available at: https://brewweb.devel.redhat.com/taskinfo?taskID=10124358 Created attachment 1097646 [details]
Working patch
The resource agent is in the pacemaker package. This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions RecoverySwitchFailure test scenario passed, marking as verified in pacemaker-1.1.15-1.2c148ac.git.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2578.html |