1268313 – clvmd/dlm resource agent monitor action should recognize it is hung

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1268313 - clvmd/dlm resource agent monitor action should recognize it is hung

Summary: clvmd/dlm resource agent monitor action should recognize it is hung

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	7.3
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks:	1287535 1339661
TreeView+	depends on / blocked

Reported:	2015-10-02 13:38 UTC by michal novacek
Modified:	2016-11-03 18:56 UTC (History)
CC List:	10 users (show)
Fixed In Version:	pacemaker-1.1.15-1.2c148ac.git.el7
Doc Type:	Release Note
Doc Text:	Fencing now occurs when DLM requires it, even when the cluster itself does not Previously, DLM could require fencing due to quorum issues, even when the cluster itself did not require fencing, but would be unable to initiate it, As a consequence, DLM and DLM-based services could hang waiting for fencing that never happened. With this fix, the `ocf:pacemaker:controld` resource agent now checks whether DLM is in this state, and requests fencing if so. Fencing now occurs in this situation, allowing DLM to recover.
Clone Of:
Clones:	1287535 1339661 (view as bug list)
Environment:
Last Closed:	2016-11-03 18:56:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Working patch (641 bytes, patch) 2015-11-23 13:48 UTC, Oyvind Albrigtsen	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1241511	0	unspecified	CLOSED	dlm_controld waits for fencing which will never occur causing hang	2021-09-03 12:09:53 UTC
Red Hat Bugzilla	1321711	1	None	None	None	2021-03-11 14:31:48 UTC
Red Hat Product Errata	RHSA-2016:2578	0	normal	SHIPPED_LIVE	Moderate: pacemaker security, bug fix, and enhancement update	2016-11-03 12:07:24 UTC

Internal Links: 1241511 1321711

Description michal novacek 2015-10-02 13:38:59 UTC

Description of problem:

This is bug for the same problem as bz1241511 but with component resource-agent.

resume of bz124151: All nodes lose/resume quorum without fencing (caused by
network connection loss). This leads to dlm waiting for fencing causing clvmd
to appear hung and all access to lvs will wait forever.

The resulting situation is quorate cluster with clvmd-clone and dlm-clone (said
to be) running correctly where they both are hung. As concluded in this
bz1241511 the problem is not pacemaker's nor dlm's.

I believe that the correct resolution of this situation will be clvmd failing
it's monitor action leading fence.

Version-Release number of selected component (if applicable):
dlm-4.0.2-5.el7.x86_64
lvm2-cluster-2.02.115-3.el7.x86_64
pacemaker-1.1.12-22.el7.x86_64
corosync-2.3.4-4.el7.x86_64

How reproducible: very frequent

Steps to Reproduce:
. have quorate pacemaker cluster
. check nodes uptime
. disable network communication between all nodes with iptables and wait for
    all nodes turning inquorate
. enable at the same time network communication between nodes 
. check whether fencing occured and if it has not check dlm status and logs

Actual results: dlm hanging

Expected results: dlm happilly working

Additional info:
# tail /var/log/messages
...
Jul  9 12:18:33 virt-020 pengine[2287]: warning: custom_action: Action dlm:2_stop_0 on virt-019 is unrunnable (offline)
Jul  9 12:18:33 virt-020 pengine[2287]: warning: custom_action: Action dlm:2_stop_0 on virt-019 is unrunnable (offline)
Jul  9 12:18:33 virt-020 pengine[2287]: notice: LogActions: Stop    dlm:1       (virt-018 - blocked)
Jul  9 12:18:33 virt-020 pengine[2287]: notice: LogActions: Stop    dlm:2       (virt-019 - blocked)
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon joined 2 needs fencing
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon joined 1 needs fencing
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 1 stateful merge
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 1 stateful merge
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 2 stateful merge
Jul  9 12:18:41 virt-020 dlm_controld[2438]: 151 daemon node 2 stateful merge
Jul  9 12:19:12 virt-020 dlm_controld[2438]: 183 fence work wait to clear merge 2 clean 1 part 0 gone 0
Jul  9 12:19:39 virt-020 dlm_controld[2438]: 210 clvmd wait for fencing

Comment 2 Andrew Beekhof 2015-10-06 07:23:17 UTC

what dlm and/or clvmd command should be executed to determine that they are hung?

Comment 3 michal novacek 2015-10-07 12:09:13 UTC

I'm not sure about the exact command. 

We should somehow find out that controld is in the non operational state that can be resolved only by fencing. 

'dlm_tool ls' at the moment of the problem shows:

dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000004 kern_stop
change        member 3 joined 1 remove 0 failed 0 seq 1,1
members       1 2 3 
new change    member 3 joined 1 remove 0 failed 0 seq 2,4
new status    wait fencing
new members   1 2 3


Maybe the 'wait fencing' status gives us this information?

Comment 4 Andrew Beekhof 2015-10-12 06:07:30 UTC

controld is a shell script that calls into the dlm/clvmd, so

> We should somehow find out that controld is in the non operational state that
> can be resolved only by fencing.

has no meaning.

We can have it look for 'wait fencing' though.
Is there an equivelent command when controld is managing clvmd_controld?


Oyvind: Btw. the controld agent is in the pacemaker tree when you need to find it.

Comment 6 Oyvind Albrigtsen 2015-11-17 09:58:29 UTC

Is there an easy way to reproduce this issue?

I only got it to go into "wait fencing" once by random timing, and the other times it goes into "wait fencing" when it gets network connectivity back causing fencing.

Comment 7 Oyvind Albrigtsen 2015-11-19 13:53:26 UTC

Scratch build available at: https://brewweb.devel.redhat.com/taskinfo?taskID=10124358

Comment 9 Oyvind Albrigtsen 2015-11-23 13:48:09 UTC

Created attachment 1097646 [details]
Working patch

Comment 10 Oyvind Albrigtsen 2015-12-02 09:55:11 UTC

https://github.com/ClusterLabs/pacemaker/pull/839

Comment 11 Oyvind Albrigtsen 2016-02-29 11:30:48 UTC

The resource agent is in the pacemaker package.

Comment 14 Mike McCune 2016-03-28 22:52:26 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 18 Patrik Hagara 2016-09-15 09:59:57 UTC

RecoverySwitchFailure test scenario passed, marking as verified in pacemaker-1.1.15-1.2c148ac.git.el7

Comment 20 errata-xmlrpc 2016-11-03 18:56:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html

Note You need to log in before you can comment on or make changes to this bug.