843160 – dlm_controld recovery stuck in check_fencing

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 843160 - dlm_controld recovery stuck in check_fencing

Summary: dlm_controld recovery stuck in check_fencing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	cluster
Sub Component:
Version:	6.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-07-25 18:24 UTC by David Teigland
Modified:	2014-10-14 04:44 UTC (History)
CC List:	9 users (show)
Fixed In Version:	cluster-3.0.12.1-67.el6
Doc Type:	Bug Fix
Doc Text:	Cause: Fencing time comparison does not work as expected when a fence agent completes quickly or the corosync callback is delayed. Consequence: dlm recovery will be stuck waiting for fencing to complete. Fix: Save and compare different time stamps that are not affected by the sequence of fencing and corosync callbacks. Result: dlm recovery will not be stuck waiting for fencing to complete in this case.
Clone Of:
Environment:
Last Closed:	2014-10-14 04:44:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:1420	0	normal	SHIPPED_LIVE	cluster bug fix and enhancement update	2014-10-14 01:06:26 UTC

Description David Teigland 2012-07-25 18:24:50 UTC

Description of problem:

This commit
http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=4039bf4817a96b6aab20de948389f43b89ce4a8e

fixing bug 594511 was not a complete solution.  The chance remains that dlm_controld will get stuck waiting for fencing that has already happened.

dlm_controld is checking that the last_fenced_time is >= the time the failed node was removed from the cpg (fail_time).  But, a fast fencing agent and a delayed cpg callback can cause the fail_time to be later than the last_fenced_time.  Comparing last_fenced_time to the time the failed node was removed via the cman api probably has a similar problem.

To test this I added usleep(1000000) in del_configfs_node() before rmdir.

fence_tool dump

>1343237151 cluster node 2 removed seq 488
>1343237151 fenced:daemon conf 1 0 1 memb 1 join left 2
>1343237151 fenced:daemon ring 1:488 1 memb 1
>1343237151 fenced:default conf 1 0 1 memb 1 join left 2
>1343237151 add_change cg 4 remove nodeid 2 reason 3
>1343237151 add_change cg 4 m 1 j 0 r 1 f 1
>1343237151 add_victims node 2
>1343237151 check_ringid cluster 488 cpg 1:484
>1343237151 fenced:default ring 1:488 1 memb 1
>1343237151 check_ringid done cluster 488 cpg 1:488
>1343237151 check_quorum done
>1343237151 send_start 1:4 flags 2 started 3 m 1 j 0 r 1 f 1
>1343237151 receive_start 1:4 len 152
>1343237151 match_change 1:4 matches cg 4
>1343237151 wait_messages cg 4 got all 1
>1343237151 set_master from 1 to complete node 1
>1343237151 delay post_fail_delay 0 quorate_from_last_update 0
>1343237151 dct-xen-11 not a cluster member after 0 sec post_fail_delay
>1343237151 fencing node dct-xen-11
>1343237151 fence dct-xen-11 dev 0.0 agent /bin/true result: success
>1343237151 fence dct-xen-11 success
>1343237151 send_victim_done cg 4 flags 2 victim nodeid 2
>1343237151 send_complete 1:4 flags 2 started 3 m 1 j 0 r 1 f 1
>1343237151 receive_victim_done 1:4 flags 2 len 80
>1343237151 receive_victim_done 1:4 remove victim 2 time 1343237151 how 1
>1343237151 receive_complete 1:4 len 152

dlm_tool dump

>1343237151 cluster node 2 removed seq 488
>1343237151 del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/2"
>1343237152 dlm:controld conf 1 0 1 memb 1 join left 2
>1343237152 dlm:ls:dlm_master conf 1 0 1 memb 1 join left 2
>1343237152 dlm_master add_change cg 3 remove nodeid 2 reason 3
>1343237152 dlm_master add_change cg 3 counts member 1 joined 0 remove 1 failed 1
>1343237152 dlm_master stop_kernel cg 3
>1343237152 write "0" to "/sys/kernel/dlm/dlm_master/control"
>1343237152 dlm_master check_fencing 2 wait add 1343237075 fail 1343237152 last 1343237151


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2012-07-25 18:26:47 UTC

The symptom of this problem is fencing being done but dlm still waiting for fencing:

# group_tool
fence domain
member count  1
victim count  0
victim now    0
master nodeid 1
wait state    none
members       1

dlm lockspaces
name          dlm_master
id            0x8321e41a
flags         0x00000004 kern_stop
change        member 2 joined 1 remove 0 failed 0 seq 2,2
members       1 2
new change    member 1 joined 0 remove 1 failed 1 seq 3,3
new status    wait_messages 0 wait_condition 1 fencing
new members   1

Comment 2 David Teigland 2012-07-25 18:52:30 UTC

A possible solution might be to save the add_time that preceded the last failure as need_fenced_time and compare last_fenced_time to that.

Comment 4 David Teigland 2012-07-31 14:32:06 UTC

No one has reported seeing this except me.

Comment 6 David Teigland 2012-10-24 15:32:31 UTC

Pushing this out since I don't have time to work on it for 6.4

Comment 8 David Teigland 2014-01-10 22:24:52 UTC

pushed patch to RHEL6 branch
https://git.fedorahosted.org/cgit/cluster.git/commit/?h=RHEL6&id=2d06dd478c27bf864ba1a5ac0cbb1ba6c3ed947f

I tested and verified this in two different ways:

1. added usleep(1000000) in del_configfs_node() before rmdir, recompiled.
   killed nodeB, nodeA recovered properly based on the adjusted time comparison.

2. (no code change necessary) suspended the dlm_controld process on nodeA,
   killed nodeB, waited for fenced to complete fencing on nodeA, resumed
   dlm_controld on nodeA, and nodeA recovered properly.

Comment 13 errata-xmlrpc 2014-10-14 04:44:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1420.html

Note You need to log in before you can comment on or make changes to this bug.