Bug 1268313

Summary:

clvmd/dlm resource agent monitor action should recognize it is hung

Product:

Red Hat Enterprise Linux 7

Reporter:

michal novacek <mnovacek>

Component:

pacemaker

Assignee:

Ken Gaillot <kgaillot>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

unspecified

Docs Contact:

Steven J. Levine <slevine>

Priority:

unspecified

Version:

7.2

CC:

abeekhof, agk, cfeist, cluster-maint, fdinitto, ivlnka, jruemker, mnovacek, oalbrigt, phagara

Target Milestone:

Target Release:

7.3

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

pacemaker-1.1.15-1.2c148ac.git.el7

Doc Type:

Release Note

Doc Text:

Fencing now occurs when DLM requires it, even when the cluster itself does not Previously, DLM could require fencing due to quorum issues, even when the cluster itself did not require fencing, but would be unable to initiate it, As a consequence, DLM and DLM-based services could hang waiting for fencing that never happened. With this fix, the `ocf:pacemaker:controld` resource agent now checks whether DLM is in this state, and requests fencing if so. Fencing now occurs in this situation, allowing DLM to recover.

Story Points:

---

Clone Of:

Clones:

1287535 1339661 (view as bug list)

Environment:

Last Closed:

2016-11-03 18:56:15 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1287535, 1339661

Attachments:

Description	Flags
Working patch	none

Description michal novacek 2015-10-02 13:38:59 UTC

Description of problem:

This is bug for the same problem as bz1241511 but with component resource-agent.

resume of bz124151: All nodes lose/resume quorum without fencing (caused by
network connection loss). This leads to dlm waiting for fencing causing clvmd
to appear hung and all access to lvs will wait forever.

The resulting situation is quorate cluster with clvmd-clone and dlm-clone (said
to be) running correctly where they both are hung. As concluded in this
bz1241511 the problem is not pacemaker's nor dlm's.

I believe that the correct resolution of this situation will be clvmd failing
it's monitor action leading fence.

Version-Release number of selected component (if applicable):
dlm-4.0.2-5.el7.x86_64
lvm2-cluster-2.02.115-3.el7.x86_64
pacemaker-1.1.12-22.el7.x86_64
corosync-2.3.4-4.el7.x86_64

How reproducible: very frequent

Steps to Reproduce:
. have quorate pacemaker cluster
. check nodes uptime
. disable network communication between all nodes with iptables and wait for
    all nodes turning inquorate
. enable at the same time network communication between nodes 
. check whether fencing occured and if it has not check dlm status and logs

Actual results: dlm hanging

Expected results: dlm happilly working

Additional info:
# tail /var/log/messages
...
Jul  9 12:18:33 virt-020 pengine[2287]: warning: custom_action: Action dlm:2_stop_0 on virt-019 is unrunnable (offline)
Jul  9 12:18:33 virt-020 pengine[2287]: warning: custom_action: Action dlm:2_stop_0 on virt-019 is unrunnable (offline)
Jul  9 12:18:33 virt-020 pengine[2287]: notice: LogActions: Stop    dlm:1       (virt-018 - blocked)
Jul  9 12:18:33 virt-020 pengine[2287]: notice: LogActions: Stop    dlm:2       (virt-019 - blocked)
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon joined 2 needs fencing
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon joined 1 needs fencing
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 1 stateful merge
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 1 stateful merge
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 2 stateful merge
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 2 stateful merge
Jul  9 12:19:12 virt-020 dlm_controld[2438]: 183 fence work wait to clear merge 2 clean 1 part 0 gone 0
Jul  9 12:19:39 virt-020 dlm_controld[2438]: 210 clvmd wait for fencing

Comment 2 Andrew Beekhof 2015-10-06 07:23:17 UTC

what dlm and/or clvmd command should be executed to determine that they are hung?

Comment 3 michal novacek 2015-10-07 12:09:13 UTC

I'm not sure about the exact command. 

We should somehow find out that controld is in the non operational state that can be resolved only by fencing. 

'dlm_tool ls' at the moment of the problem shows:

dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000004 kern_stop
change        member 3 joined 1 remove 0 failed 0 seq 1,1
members       1 2 3 
new change    member 3 joined 1 remove 0 failed 0 seq 2,4
new status    wait fencing
new members   1 2 3


Maybe the 'wait fencing' status gives us this information?

Comment 4 Andrew Beekhof 2015-10-12 06:07:30 UTC

controld is a shell script that calls into the dlm/clvmd, so

> We should somehow find out that controld is in the non operational state that
> can be resolved only by fencing.

has no meaning.

We can have it look for 'wait fencing' though.
Is there an equivelent command when controld is managing clvmd_controld?


Oyvind: Btw. the controld agent is in the pacemaker tree when you need to find it.

Comment 6 Oyvind Albrigtsen 2015-11-17 09:58:29 UTC

Is there an easy way to reproduce this issue?

I only got it to go into "wait fencing" once by random timing, and the other times it goes into "wait fencing" when it gets network connectivity back causing fencing.

Comment 7 Oyvind Albrigtsen 2015-11-19 13:53:26 UTC

Scratch build available at: https://brewweb.devel.redhat.com/taskinfo?taskID=10124358

Comment 9 Oyvind Albrigtsen 2015-11-23 13:48:09 UTC

Created attachment 1097646 [details]
Working patch

Comment 10 Oyvind Albrigtsen 2015-12-02 09:55:11 UTC

https://github.com/ClusterLabs/pacemaker/pull/839

Comment 11 Oyvind Albrigtsen 2016-02-29 11:30:48 UTC

The resource agent is in the pacemaker package.

Comment 14 Mike McCune 2016-03-28 22:52:26 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 18 Patrik Hagara 2016-09-15 09:59:57 UTC

RecoverySwitchFailure test scenario passed, marking as verified in pacemaker-1.1.15-1.2c148ac.git.el7

Comment 20 errata-xmlrpc 2016-11-03 18:56:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html