1077404 – dlm hangs waiting for fence operation to complete

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1077404 - dlm hangs waiting for fence operation to complete

Summary: dlm hangs waiting for fence operation to complete

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	dlm
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1295577 1313485
TreeView+	depends on / blocked

Reported:	2014-03-17 22:32 UTC by David Vossel
Modified:	2021-09-03 12:07 UTC (History)
CC List:	4 users (show)
Fixed In Version:	dlm-4.0.6-1.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-11-04 06:34:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:2445	0	normal	SHIPPED_LIVE	dlm bug fix update	2016-11-03 14:03:09 UTC

Description David Vossel 2014-03-17 22:32:27 UTC

Description of problem:

dlm can hang in an infinite loop waiting for a fencing action to occur that has already happened.

Version-Release number of selected component (if applicable):


How reproducible:

difficult race condition

Steps to Reproduce:
1. disconnect a node from a pacemaker cluster in a way where the corosync membership is lost in an unexpected way (network disconnect work work)
2. both pacemaker and the dlm will detect the membership change and attempt to fence the lost node.
3.  If pacemaker's fencing request is completely processed within a short time interval before the dlm's request is processed, those two requests will get merged. The dlm will be alerted that the fencing operation occurred through the successful completion of the stonith_api_kick_helper function, but the dlm will remain in a infinite loop waiting for the time of the last fencing request to  update to what it expects the time should be.


Actual results:

dlm hangs


Expected results:

dlm sees the fencing action occurred and moves on.

Additional info:

here's the code in question.

>	rv = stonith_api_kick_helper(nodeid, 300, 0);

if stonith_api_kick_helper says we fenced the node, the node is fenced. This function should be blocking until the fencing operation completes. If rv==0, then fencing was successful.

>	if (rv) {
>		fprintf(stderr, "kick_helper error %d nodeid %d\n", rv, nodeid);
>		openlog("dlm_stonith", LOG_CONS | LOG_PID, LOG_DAEMON);
>		syslog(LOG_ERR, "kick_helper error %d nodeid %d\n", rv, nodeid);
>		return rv;
>	}
>
>	while (1) {
>		t = stonith_api_time_helper(nodeid, 0);
>		if (t >= fail_time)

We can't trust this statement will ever be true. There are many conditions where both pacemaker and the dlm will fence a node at seemingly the exact same time. depending on the request order and how fast the fencing devices respond we could have a condition where the last fencing action occurs slightly before fail_time.  This is because the two fencing requests stonith receives (one from dlm, one from pacemaker) are merged so we don't double fence a node because of overlapping fencing requests. This only happens if the two requests are received by stonith overlap or occur within a very short amount of time of one another.

>			return 0;
>		sleep(1);
>	}
>


The wait loop should not be necessary. I believe we can remove the entirely and this race condition will be solved.

Comment 2 RHEL Program Management 2014-03-25 05:47:25 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 3 David Teigland 2014-04-01 17:05:11 UTC

https://git.fedorahosted.org/cgit/dlm.git/commit/?id=fb61984c9388cbbcc02c6a09c09948b21320412d

    dlm_stonith: use kick_helper result
    
    Don't depend on the fence time being later than
    the fail time after using the kick helper function.
    Make fail_time optional.

Comment 5 David Teigland 2016-01-07 16:46:42 UTC

this would be included in the dlm rebase (bug 1295877)

Comment 7 Mike McCune 2016-03-28 23:39:55 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 11 errata-xmlrpc 2016-11-04 06:34:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2445.html

Note You need to log in before you can comment on or make changes to this bug.