Bug 1268313 - clvmd/dlm resource agent monitor action should recognize it is hung
clvmd/dlm resource agent monitor action should recognize it is hung
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker (Show other bugs)
7.2
Unspecified Unspecified
unspecified Severity unspecified
: rc
: 7.3
Assigned To: Ken Gaillot
cluster-qe@redhat.com
Steven J. Levine
:
Depends On:
Blocks: 1287535 1339661
  Show dependency treegraph
 
Reported: 2015-10-02 09:38 EDT by michal novacek
Modified: 2016-11-03 14:56 EDT (History)
10 users (show)

See Also:
Fixed In Version: pacemaker-1.1.15-1.2c148ac.git.el7
Doc Type: Release Note
Doc Text:
Fencing now occurs when DLM requires it, even when the cluster itself does not Previously, DLM could require fencing due to quorum issues, even when the cluster itself did not require fencing, but would be unable to initiate it, As a consequence, DLM and DLM-based services could hang waiting for fencing that never happened. With this fix, the `ocf:pacemaker:controld` resource agent now checks whether DLM is in this state, and requests fencing if so. Fencing now occurs in this situation, allowing DLM to recover.
Story Points: ---
Clone Of:
: 1287535 1339661 (view as bug list)
Environment:
Last Closed: 2016-11-03 14:56:15 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Working patch (641 bytes, patch)
2015-11-23 08:48 EST, Oyvind Albrigtsen
no flags Details | Diff

  None (edit)
Description michal novacek 2015-10-02 09:38:59 EDT
Description of problem:

This is bug for the same problem as bz1241511 but with component resource-agent.

resume of bz124151: All nodes lose/resume quorum without fencing (caused by
network connection loss). This leads to dlm waiting for fencing causing clvmd
to appear hung and all access to lvs will wait forever.

The resulting situation is quorate cluster with clvmd-clone and dlm-clone (said
to be) running correctly where they both are hung. As concluded in this
bz1241511 the problem is not pacemaker's nor dlm's.

I believe that the correct resolution of this situation will be clvmd failing
it's monitor action leading fence.

Version-Release number of selected component (if applicable):
dlm-4.0.2-5.el7.x86_64
lvm2-cluster-2.02.115-3.el7.x86_64
pacemaker-1.1.12-22.el7.x86_64
corosync-2.3.4-4.el7.x86_64

How reproducible: very frequent

Steps to Reproduce:
. have quorate pacemaker cluster
. check nodes uptime
. disable network communication between all nodes with iptables and wait for
    all nodes turning inquorate
. enable at the same time network communication between nodes 
. check whether fencing occured and if it has not check dlm status and logs

Actual results: dlm hanging

Expected results: dlm happilly working

Additional info:
# tail /var/log/messages
...
Jul  9 12:18:33 virt-020 pengine[2287]: warning: custom_action: Action dlm:2_stop_0 on virt-019 is unrunnable (offline)
Jul  9 12:18:33 virt-020 pengine[2287]: warning: custom_action: Action dlm:2_stop_0 on virt-019 is unrunnable (offline)
Jul  9 12:18:33 virt-020 pengine[2287]: notice: LogActions: Stop    dlm:1       (virt-018 - blocked)
Jul  9 12:18:33 virt-020 pengine[2287]: notice: LogActions: Stop    dlm:2       (virt-019 - blocked)
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon joined 2 needs fencing
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon joined 1 needs fencing
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 1 stateful merge
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 1 stateful merge
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 2 stateful merge
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 2 stateful merge
Jul  9 12:19:12 virt-020 dlm_controld[2438]: 183 fence work wait to clear merge 2 clean 1 part 0 gone 0
Jul  9 12:19:39 virt-020 dlm_controld[2438]: 210 clvmd wait for fencing
Comment 2 Andrew Beekhof 2015-10-06 03:23:17 EDT
what dlm and/or clvmd command should be executed to determine that they are hung?
Comment 3 michal novacek 2015-10-07 08:09:13 EDT
I'm not sure about the exact command. 

We should somehow find out that controld is in the non operational state that can be resolved only by fencing. 

'dlm_tool ls' at the moment of the problem shows:

dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000004 kern_stop
change        member 3 joined 1 remove 0 failed 0 seq 1,1
members       1 2 3 
new change    member 3 joined 1 remove 0 failed 0 seq 2,4
new status    wait fencing
new members   1 2 3


Maybe the 'wait fencing' status gives us this information?
Comment 4 Andrew Beekhof 2015-10-12 02:07:30 EDT
controld is a shell script that calls into the dlm/clvmd, so

> We should somehow find out that controld is in the non operational state that
> can be resolved only by fencing.

has no meaning.

We can have it look for 'wait fencing' though.
Is there an equivelent command when controld is managing clvmd_controld?


Oyvind: Btw. the controld agent is in the pacemaker tree when you need to find it.
Comment 6 Oyvind Albrigtsen 2015-11-17 04:58:29 EST
Is there an easy way to reproduce this issue?

I only got it to go into "wait fencing" once by random timing, and the other times it goes into "wait fencing" when it gets network connectivity back causing fencing.
Comment 7 Oyvind Albrigtsen 2015-11-19 08:53:26 EST
Scratch build available at: https://brewweb.devel.redhat.com/taskinfo?taskID=10124358
Comment 9 Oyvind Albrigtsen 2015-11-23 08:48 EST
Created attachment 1097646 [details]
Working patch
Comment 10 Oyvind Albrigtsen 2015-12-02 04:55:11 EST
https://github.com/ClusterLabs/pacemaker/pull/839
Comment 11 Oyvind Albrigtsen 2016-02-29 06:30:48 EST
The resource agent is in the pacemaker package.
Comment 14 Mike McCune 2016-03-28 18:52:26 EDT
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions
Comment 18 Patrik Hagara 2016-09-15 05:59:57 EDT
RecoverySwitchFailure test scenario passed, marking as verified in pacemaker-1.1.15-1.2c148ac.git.el7
Comment 20 errata-xmlrpc 2016-11-03 14:56:15 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html

Note You need to log in before you can comment on or make changes to this bug.